Automation

Data quality checks are automated tests that tell you whether data is fit to use before it reaches a report, model, customer workflow, or operational decision. The goal is not to test every possible thing. The goal is to catch the failures that would create wrong decisions, broken workflows, or a loss of trust.

What data quality checks are

Data quality checks are rules that compare actual data against an expected condition. If the condition is not met, the check warns, blocks, quarantines, or opens an investigation.

A simple check might say: the orders table must be updated by 7:00 a.m. every weekday. Another might say: every paid subscription must have a non-null customer_id. A more advanced check might compare revenue in the warehouse against revenue in a billing export and alert if the difference is outside an agreed tolerance.

The best checks are tied to how the data is used. A column can be technically messy and still harmless. A small error in a board metric, billing workflow, or customer-facing automation can be expensive. Start with impact, not perfection.

The operator principle: test the contract, not the database

Reliable data systems need clear expectations. A check is useful when it encodes an expectation between a producer and a consumer of data.

For example, an application team may produce product usage events. An analytics team may use those events to calculate activation. The contract is not just that a table exists. The useful contract includes when data arrives, which fields are required, how identifiers behave, and what changes are allowed without warning.

When checks are written without a contract, they often become noise. Teams add dozens of generic tests, ignore half of the alerts, and still miss the failures that matter. When checks represent real expectations, failures become easier to triage.

  • Bad check: every column in every table must never be null.
  • Better check: customer_id must be present for completed orders because revenue reporting and lifecycle automation depend on it.
  • Bad check: row count must be exactly the same every day.
  • Better check: daily order count should not fall more than 40% below the trailing weekday average unless there is a known business reason.
Operator rule

A check without an owner and response path is not reliability. It is only a signal hoping someone notices.

Checklist step 1: Start with critical data products

Do not begin by testing every table. Begin with the dashboards, models, exports, and workflows that people actually rely on.

List the data products where incorrect data would cause visible harm. This usually includes executive dashboards, revenue reporting, customer health scores, billing operations, marketing attribution, compliance-sensitive exports, and machine learning or AI inputs used in production workflows.

For each data product, identify the upstream tables, transformations, and source systems that feed it. This gives you a practical testing boundary. You are not trying to certify the whole warehouse on day one. You are protecting the flows where failure matters most.

  • Which dashboard or workflow would create the most confusion if it were wrong?
  • Which metrics are discussed in leadership, investor, finance, or customer meetings?
  • Which data feeds trigger customer communication, billing, sales routing, or operational action?
  • Which datasets are reused across many downstream models or reports?
  • Which source systems have changed recently or failed before?

Checklist step 2: Check freshness first

Freshness checks answer a basic question: did the data arrive when users expect it? Freshness is often the first quality dimension to automate because stale data can make every downstream number misleading even when the transformation logic is correct.

A good freshness check needs a business expectation, not just a technical timestamp. If a sales dashboard is reviewed at 9:00 a.m., then the relevant pipeline may need to complete by 8:30 a.m. If a weekly board package is built on Monday afternoon, daily delay may matter less than Monday completeness.

  • Define the expected arrival time for each critical table or model.
  • Use the timestamp that best represents new business data, not only the warehouse load timestamp.
  • Account for weekends, holidays, source system schedules, and known batch windows.
  • Alert the owner who can diagnose the source, not a broad channel that nobody owns.
  • Separate late but recoverable data from stale data that should block downstream consumption.
Practical checkpoint

Freshness checks should be based on when people need the data, not only when the pipeline usually finishes.

Checklist step 3: Add volume and distribution checks

Volume checks look for suspicious changes in row counts, event counts, file counts, or aggregate totals. They catch failures such as partial loads, duplicate ingestion, missing partitions, broken tracking, and source exports that silently changed.

A useful volume check usually compares the current batch with a historical baseline. Exact comparisons are brittle unless the business process is truly fixed. Most operating data has seasonality, weekday patterns, launches, campaigns, and normal volatility.

Distribution checks go one level deeper. Instead of asking only whether 100,000 events arrived, they ask whether the mix of event types, countries, plans, statuses, or platforms looks plausible.

  • Compare daily volume against the same day of week, not only yesterday.
  • Use tolerances that reflect the business process and the cost of false alarms.
  • Check for both drops and spikes. Duplicates can be as damaging as missing data.
  • Track important categories separately, such as product, region, plan, or channel.
  • Document known events that explain expected changes, such as launches or migrations.

Checklist step 4: Validate required fields and allowed values

Validity checks confirm that values follow expected rules. Completeness checks confirm that required values are present. Together, they prevent common failures where records exist but cannot be joined, classified, grouped, or trusted.

Examples include checking that email fields have a plausible format, order_amount is not negative unless refunds are represented that way, currency codes come from an allowed set, and customer_id is present for records that require customer attribution.

Be careful with universal null checks. Many columns are optional by design. A null is only a defect when the business process says a value should exist at that point in the record lifecycle.

  • Mark which fields are required for each important use case.
  • Check uniqueness for business keys that should identify one entity or transaction.
  • Check accepted values for statuses, types, plan names, currencies, and source labels.
  • Check numeric ranges where impossible values would distort metrics.
  • Check timestamp logic, such as completed_at not occurring before created_at.

Checklist step 5: Reconcile across tables and systems

Consistency checks compare data that should agree. They are especially important when metrics depend on multiple systems, such as product analytics, billing, CRM, support, and finance tools.

Reconciliation does not always mean exact equality. Different systems may have different timing, definitions, exchange rates, refund rules, or status lifecycles. The work is to define the expected relationship and tolerance.

For example, monthly recurring revenue in a finance model may not exactly match raw subscription events because the finance model applies business rules. But the difference should be explainable. If the gap suddenly doubles, the check should surface it.

  • Compare source record counts to warehouse record counts after ingestion.
  • Compare key aggregates between raw, staged, and final models.
  • Check that orphan records are within expected limits, such as orders without customers.
  • Reconcile high-value metrics against a source of record when one exists.
  • Document timing differences so normal lag does not become alert noise.

Checklist step 6: Put checks at the right layer

Not every check belongs in the same place. Some checks should run when data is ingested. Some belong after transformations. Some belong before a dashboard, reverse ETL sync, or production model is updated.

A simple rule: run checks as close as possible to the point where failure becomes expensive, but early enough to prevent bad data from spreading.

  • Source and ingestion layer: freshness, file arrival, schema changes, duplicate batches, source row counts.
  • Staging layer: type casting, required identifiers, accepted values, basic deduplication, parse failures.
  • Core modeling layer: uniqueness, relationships, metric logic, entity state transitions, cross-table consistency.
  • Consumption layer: executive metric reconciliation, dashboard readiness, model feature availability, reverse ETL audience sanity checks.

If a check fails in the source layer, the response may be to pause downstream jobs. If a check fails in a dashboard layer, the response may be to hide or annotate the report until the data is repaired. The check location should match the operational response.

Layer Good checks Typical failure caught
Ingestion File arrival, source freshness, schema change, duplicate batch The source export did not arrive or arrived in an unexpected shape
Staging Required IDs, type casting, accepted values, parse errors Raw data exists but cannot be reliably joined or transformed
Core models Uniqueness, relationships, metric logic, reconciliation The business entity or metric is wrong after transformation
Consumption Dashboard readiness, audience size, feature availability Bad data is about to reach a decision, customer workflow, or model

Checklist step 7: Decide what happens when a check fails

A data quality check is incomplete until the team knows what to do with the result. Alerting is only one response. Some checks should block a deployment. Some should stop a sync. Some should create a warning and allow the pipeline to continue.

Use the severity of downstream harm to choose the action. A missing optional marketing attribute may not justify waking anyone up. A broken revenue model before a board report may justify blocking publication.

  • Warn: the issue is visible and should be reviewed, but downstream use can continue.
  • Block: the issue would cause a materially wrong report, workflow, or model output.
  • Quarantine: bad records can be isolated while good records continue moving.
  • Annotate: users need context that the data is delayed, incomplete, or under review.
  • Escalate: the owner, severity, and response time are clear enough to require attention.
Warning

Do not make every failure a blocking failure. If blocking becomes routine for low-risk issues, teams will work around the system.

Checklist step 8: Assign owners and runbooks

Data quality checks fail in real operating conditions. APIs change, event tracking breaks, vendors delay exports, transformations drift, and business definitions evolve. The difference between a reliable system and a noisy system is usually ownership.

Each important check should have an owner, a reason, an expected response, and a place to record known exceptions. Without this, alerts become background noise and teams lose confidence in the system.

  • Assign one accountable owner for each critical dataset or data product.
  • Write the first diagnostic step in plain English.
  • Record likely causes, such as source delay, schema change, tracking release, or transformation bug.
  • Define who should be informed when user-facing reports or workflows are affected.
  • Review noisy checks regularly and either tune, downgrade, or remove them.

Common failure modes to watch for

Most data quality programs fail because they produce too many alerts, test the wrong things, or lack a response path. The checklist should make the system easier to operate, not just more instrumented.

  • Testing everything equally: low-risk tables consume attention while critical data products remain weakly protected.
  • Using brittle thresholds: exact row-count rules fail during normal seasonality, campaigns, or growth.
  • Ignoring business definitions: checks pass technically while the metric is still wrong for the decision being made.
  • Alerting without ownership: failures are posted to a channel but nobody is responsible for diagnosis.
  • Only checking final dashboards: issues are discovered too late, after bad data has already spread.
  • Only checking raw ingestion: source data arrives, but transformation logic still creates incorrect metrics.
  • Never retiring checks: outdated checks create noise after products, schemas, or business processes change.
Symptom Likely cause Operator response
Alerts are ignored Too many low-impact checks or unclear ownership Reduce noise, assign owners, and keep only checks tied to real use cases
Checks pass but users distrust dashboards Tests validate tables, not business definitions Add metric-level and reconciliation checks for critical dashboards
Pipelines fail during normal growth Thresholds are too rigid Use historical baselines, tolerances, and known business context
Issues are found by executives first Checks run too late or do not cover critical data products Map critical reports to upstream dependencies and add freshness, volume, and metric checks
The same issue repeats No runbook or root-cause follow-up Document diagnosis steps and add a preventive check after repair

A minimum viable check set for a small team

If you are building or repairing a data system, start with a small set of high-signal checks. A small team should prefer ten checks that people respond to over one hundred checks that people ignore.

For each critical data product, choose one or two checks from each category below. Add more only after the existing checks are trusted and actionable.

  1. Freshness: the main upstream table or model updated before the business needs it.
  2. Volume: row count or event count stayed within a reasonable historical range.
  3. Completeness: required identifiers and metric fields are populated for records in scope.
  4. Validity: key statuses, dates, amounts, and categories follow expected rules.
  5. Consistency: the final metric reconciles with its source or an agreed comparison model.
  6. Ownership: every failed check points to a named owner and a first diagnostic step.

How to review and improve checks over time

Data quality checks are not a one-time setup task. They should evolve as systems, users, and business processes change.

Set a regular review cadence for critical checks. Look at which checks caught real issues, which checks produced false positives, which incidents were missed, and which datasets became more important since the last review.

  • After every material data incident, ask which check would have caught it earlier.
  • After every noisy alert, ask whether the threshold, severity, or owner is wrong.
  • After every schema or metric definition change, update the affected checks.
  • After every new critical dashboard, sync, or model, add checks before launch.
  • After every retired process, remove or downgrade checks that no longer protect anything important.

Key takeaways

  • Data quality checks should protect important decisions and workflows, not certify every table equally.
  • Start with critical data products, then add freshness, volume, completeness, validity, and consistency checks where they reduce real risk.
  • Every important check needs an owner, severity, expected action, and first diagnostic step.
  • Thresholds should reflect business reality, including seasonality, timing differences, launches, and normal volatility.
  • Review checks after incidents and noisy alerts so the system becomes more useful over time.

Next step

Pick one critical dashboard, model, or operational sync. Map its three most important upstream dependencies, then add one freshness check, one volume check, and one business-rule check with a named owner and response path.

Controlled internal links