#013: How to Automate & Improve Data Quality

Oct 01, 2022

“If code runs and manual checks pass, then we’re good!”

For most of my 8 years on data teams, this was the extent of Data Quality.

But this leads to sloppy errors that kill user confidence.

 

Today, I want to share 3 ways you can automate & improve Data Quality by using:

  1. Continuous Integration (CI) workflows

  2. Linters

  3. Task Automations

     

CI workflows help you automate deployments, testing & docs.

The biggest confidence killer is broken logic or missing data.

And it can take months to regain that trust.

Instead, build workflows to validate changes beforehand.

Soon stakeholders will be focused on new features, not bug fixes.

 

Example: Use GitHub Actions to deploy & test all Pull Requests changes.

 

Linters establish clear syntax rules for your code.

Everyone has their own take on the “right” way to code.

But this leads to petty arguments and wasted time.

Linters hard-code styling rules and auto-check that they’re being followed.

The result is more consistent and maintainable code.

 

Example: SQL Fluff or PyLint

 

Task automations push, pull and update data on your behalf.

Fair or not, you’re expected to be aware of the full data platform.

But without the right systems, this is an impossible task.

Push notifications and task orchestrators are perfect for this.

Once in place, you’ll feel more in control and can quickly address issues.

 

Example: Slack notifications from Airflow

 

In summary:

Better data quality = Happy stakeholders.

Happy stakeholders = Happy engineers.

Level-up your abilities as a Data Engineer, faster.

Learn new data engineering tips, tricks and best practices every Wednesday.

Other Recent Posts

Data Automation (CI/CD) with a Real Life Example

May 17, 2023

3 Ways to Deploy Data Projects

May 10, 2023

The Importance of Virtual Environments

Apr 26, 2023

How to Create a Virtual Machine on GCP

Apr 19, 2023