Modern Data Stack
Data quality checks are not about proving that every table is perfect. They are about catching the data problems that would make your company take the wrong action. For a founder or operator, the right framework starts with business risk, then translates that risk into simple checks on freshness, completeness, validity, uniqueness, consistency, and accuracy.
Why data quality checks matter before the data team feels mature
Most early companies discover data quality problems through embarrassment: a board metric changes after the meeting, a dashboard shows more active customers than total customers, sales and finance disagree on revenue, or a lifecycle email goes to the wrong segment.
The instinct is often to ask for “better dashboards” or a “modern data stack.” Those can help, but the root problem is usually simpler: important assumptions are not being checked. The system accepts broken, late, duplicated, or misunderstood data until a human notices.
Data quality checks create explicit tripwires. They turn hidden assumptions into visible rules. A check can alert the team that yesterday’s payments did not arrive, that customer IDs are duplicated, that an event name changed, or that a metric suddenly moved outside a reasonable range.
The founder version is not a massive testing program. It is a short list of checks protecting the numbers and workflows that run the business.
What a data quality check is in plain English
A data quality check asks whether data is fit for the job you are using it for. The same dataset can be good enough for one use and unsafe for another.
For example, product event data that is 95 percent complete may be acceptable for exploring feature usage trends. It may not be acceptable for calculating customer invoices, triggering contractual usage limits, or compensating sales reps.
A useful check has four parts:
- The object: the table, model, metric, file, API response, or event stream being checked.
- The rule: the condition that must be true, such as “order_id is unique” or “loaded_at is within the last 24 hours.”
- The severity: what happens if the rule fails, from log-only to blocking a downstream report.
- The owner: the person or team responsible for deciding whether the failure is a real issue and what to do next.
Without severity and ownership, checks become noisy. Without a rule, quality remains a vague complaint.
The founder framework: start with decisions, not tables
The fastest way to design useful data quality checks is to work backward from decisions. A founder does not need to begin by cataloging every column in the warehouse. Start with the operating moments where bad data would cause damage.
Use this sequence:
- Name the decision or workflow. Examples: weekly revenue review, cash forecast, activation dashboard, paid acquisition spend, churn risk list, customer health scoring, investor reporting.
- Name the metric or dataset used. Examples: net revenue retention, active accounts, qualified pipeline, trial conversion, open invoices.
- Name the failure that would mislead you. Examples: data is late, test accounts are included, duplicates inflate counts, refunds are missing, event definitions changed.
- Write the smallest check that catches that failure. Do not start with a generic test suite. Start with the business failure mode.
- Assign a response. Decide whether the failure should block a dashboard refresh, send an alert, create a ticket, or simply appear in a daily quality report.
This keeps quality work tied to operating risk. It also prevents the team from spending days testing low-value staging tables while the CEO dashboard still depends on unchecked revenue logic.
Do not ask, “What tests can we add?” Ask, “What bad decision are we trying to prevent?”
Six data quality checks most teams need first
Most practical data quality checks fall into six categories. You do not need all of them everywhere. You need the right ones on the data that carries business meaning.
Freshness checks confirm that data arrived or updated when expected. If your sales dashboard depends on CRM data, a freshness check might verify that the latest sync happened today.
Completeness checks confirm that required records or fields are present. For example, every paid order should have an order ID, customer ID, amount, and timestamp.
Validity checks confirm that values fit expected formats or allowed ranges. For example, an email field should look like an email, a subscription status should be one of the allowed statuses, and a date should not be in the far future.
Uniqueness checks confirm that identifiers are not duplicated where they should be unique. This is essential for customers, orders, invoices, accounts, and other entities that drive counts or joins.
Consistency checks confirm that related data agrees across tables or systems. For example, total invoice revenue in the finance model should reconcile within an expected tolerance to the source billing export.
Accuracy checks are harder because they ask whether data reflects reality, not just whether it follows a rule. Accuracy often requires comparison to a trusted source, manual review, sampling, or reconciliation to an operational system.
| Check type | Question it answers | Example rule |
|---|---|---|
| Freshness | Did the data arrive or update on time? | Latest payment record is less than 24 hours old. |
| Completeness | Are required records or fields present? | Every order has customer_id, order_id, amount, and created_at. |
| Validity | Do values follow expected formats or ranges? | Subscription status is one of active, trialing, past_due, canceled. |
| Uniqueness | Are identifiers duplicated when they should not be? | Each invoice_id appears once in the invoice model. |
| Consistency | Do related datasets agree? | Modeled monthly revenue reconciles to billing exports within an approved tolerance. |
| Accuracy | Does the data reflect reality? | A sample of account owners matches the CRM records used by the sales team. |
Where data quality checks belong in a modern data stack
Checks can run in several places. The durable principle is to place checks close to the failure they are designed to catch.
At ingestion, check whether expected source data arrived and whether basic structure changed. This catches missing files, broken API syncs, schema changes, and empty loads.
During transformation, check whether modeled data follows business rules. This catches duplicate customer records, invalid joins, impossible dates, orphaned foreign keys, and broken metric logic.
At the metric or dashboard layer, check whether numbers are plausible and ready for decision-making. This catches sudden drops, spikes, stale executive dashboards, and mismatches between important reported metrics.
Before operational activation, check whether the data is safe to use in workflows such as customer messaging, sales routing, product personalization, billing, or support prioritization. Bad analytics can mislead a meeting. Bad operational data can touch customers directly.
For an early team, the first checks usually belong around ingestion freshness, core transformation logic, and the few dashboards or workflows that leadership actually uses.
Freshness checks belong near ingestion. Business rule checks belong near modeled data. Plausibility checks belong near metrics and dashboards.
Common failure modes that checks should catch
Data quality problems are rarely mysterious once you know where to look. The same patterns appear across analytics stacks.
- Late data: a source sync fails, a warehouse load stalls, or an upstream vendor delays delivery.
- Empty data: a table technically updates, but it loads zero rows or far fewer rows than usual.
- Schema drift: a source changes field names, data types, nested structures, or allowed values.
- Duplicate records: a retry, merge, backfill, or source bug creates more than one record for the same entity.
- Broken joins: foreign keys do not match, causing records to disappear or multiply during transformation.
- Definition drift: teams use the same metric name for different logic, such as “active customer” or “qualified lead.”
- Test and internal data leakage: employee accounts, demo customers, sandbox transactions, or test events enter production reporting.
- Backfill side effects: a historical reload changes previously reported numbers without explanation.
The goal is not to predict every possible issue. It is to catch the repeatable failure modes that have real decision cost.
How to prioritize checks when everything feels risky
If every table feels important, prioritize by damage. A check deserves attention when failure would cause one of four outcomes: a bad executive decision, a customer-facing mistake, a financial reporting error, or a recurring loss of trust in the data system.
A simple scoring model is enough:
- Impact: How costly is the decision or workflow if the data is wrong?
- Frequency: How often is the data used?
- Volatility: How often does the source, logic, or business process change?
- Detectability: Would a human notice the issue quickly without an automated check?
High-impact, high-frequency, high-volatility, low-detectability data should be checked first. A revenue model used every Monday deserves more attention than an exploratory table used once by an analyst.
| Priority | Use case | Recommended checks first |
|---|---|---|
| Highest | Board reporting, revenue, billing, customer-facing automation | Freshness, uniqueness, reconciliation, required fields, severity-based alerting |
| High | Executive dashboards, sales pipeline, retention reporting | Freshness, accepted values, duplicate detection, metric plausibility |
| Medium | Team dashboards and recurring analysis | Freshness, completeness, basic validity, documented definitions |
| Lower | Exploratory analysis and temporary datasets | Light checks only if reused or promoted into operating reports |
Practical examples of founder-grade checks
Founder-grade checks are specific, readable, and tied to operating consequences. They do not need to be fancy.
For a revenue dashboard, useful checks might include: billing data loaded in the last 24 hours, invoice IDs are unique, invoice amounts are non-negative except for credit memos, currency is present, and the daily total does not move outside a defined tolerance without review.
For a product activation dashboard, useful checks might include: signup events arrived yesterday, user IDs are present, internal domains are excluded, the activation event exists in the expected event list, and activation count has not dropped to zero unexpectedly.
For a sales pipeline report, useful checks might include: opportunity IDs are unique, close dates are valid, stage names match the approved CRM stages, deleted opportunities are handled consistently, and pipeline totals reconcile to the CRM within an agreed tolerance.
For a customer health workflow, useful checks might include: every scored account has an account owner, usage data is fresh, plan type is present, churned accounts are excluded, and no account is sent to an automated action list without required fields.
Avoid check theater: more tests do not always mean better data
A common mistake is to create hundreds of checks because it feels responsible. This can make the system worse if failures are noisy, unactioned, or poorly understood.
Bad checks create alert fatigue. Teams start ignoring failures. Dashboards display warnings nobody trusts. Engineers spend time maintaining low-value tests while the real business definitions remain ambiguous.
Useful checks have a clear reason to exist. If nobody knows what action to take when a check fails, it is probably not ready. Rewrite it, lower its severity, route it to the right owner, or remove it.
The best early quality system is small and respected. It protects the most important data, fails loudly when leadership should care, and stays quiet when a minor issue has no decision impact.
A failed check that nobody owns is not a control. It is noise with a nicer name.
The operating model: ownership, severity, and response
Data quality is not only a technical problem. It is an operating model problem. A check without an owner is a dashboard decoration.
Define severity levels in plain language:
- Critical: The data should not be used for a key decision or customer-facing action until resolved.
- Warning: The data may be usable, but someone needs to review the issue and explain the risk.
- Informational: The issue should be tracked, but it does not block current use.
Then define response expectations. Who receives the alert? Who confirms whether the issue is upstream, transformation-related, or a business process change? Who communicates to dashboard users? Who decides whether historical numbers need restatement?
These questions matter more than the tool. A basic check with a clear owner usually beats a sophisticated check that nobody investigates.
A simple 30-day path to better data quality
You can make meaningful progress without boiling the ocean. A practical first month looks like this:
- Pick three business-critical assets. Examples: revenue dashboard, activation funnel, sales pipeline, or customer health list.
- Document the grain and definition of each asset. State what one row represents and how the main metric is calculated.
- List the top five ways each asset can be wrong. Use past incidents, stakeholder complaints, and known source system quirks.
- Create one to three checks per failure mode. Start with freshness, uniqueness, required fields, accepted values, and reconciliation.
- Assign severity and owner. Decide which failures block use and which only warn.
- Review failures weekly for one month. Remove noisy checks, tighten useful checks, and add missing checks only where repeated problems appear.
At the end of 30 days, you should have fewer surprises in the numbers that matter most. You should also have a better map of where the data system is fragile.
Key takeaways
- Data quality checks should protect decisions, not chase abstract perfection.
- Start from business risk: which metric, dashboard, or workflow would cause damage if it were wrong?
- The first useful checks usually cover freshness, completeness, validity, uniqueness, consistency, and reconciliation.
- Severity and ownership matter as much as the check logic itself.
- A small set of respected checks on critical data is better than hundreds of noisy tests nobody investigates.
Next step
Choose one important dashboard or operational workflow this week. Write down the top five ways it could be wrong, then add the smallest checks that would catch those failures before the next decision cycle.
- Read Data Quality Checks: Plain-English Guide: A practical guide to finding bad data before it breaks dashboards, reports, automations, and operational decisions.
- Read Data Quality Checks: Migration Playbook: A practical way to validate migrated data before dashboards, metrics, and stakeholder decisions depend on it.