What Is dbt Column-Level Lineage?
Length: About 3,100 words · Reading time: about 14–17 minutes
Short Answer
dbt column-level lineage maps each output column in a dbt model back to the upstream columns, expressions, filters, joins, and transformations that produced it.
dbt already gives teams a model graph: source → staging model → intermediate model → mart model. That graph is useful, but it usually answers lineage at the model or table level. Column-level lineage answers a more precise question:
If this dbt model outputs
customer_lifetime_value, which source columns, intermediate model columns, filters, joins, and aggregations contributed to that value?
This matters for impact analysis, PII tracking, metric debugging, data catalog accuracy, governance, and AI data workflows. To recover accurate dbt column-level lineage, a system usually needs more than dbt refs and sources. It needs dbt artifacts, compiled SQL, catalog context, and SQL semantic analysis.
Key Takeaways
- dbt model-level lineage shows how models depend on other models, sources, and exposures.
- dbt column-level lineage shows how each output column depends on upstream columns and expressions.
- Model-level lineage can tell you that
fct_ordersdepends onstg_orders; column-level lineage should tell you thatfct_orders.gross_revenuedepends onstg_orders.amount,stg_orders.status, and possibly exchange-rate or tax fields. - Accurate dbt column lineage usually requires analyzing compiled SQL, not raw Jinja SQL.
- Data catalogs such as DataHub or OpenMetadata can display lineage, but complex SQL still needs a semantic lineage engine to recover field-level dependencies.
- dbt column-level lineage is useful for impact analysis, PII flow tracing, metric explanation, CI checks, and SQL governance.
- A practical sidecar approach is to read dbt artifacts, analyze compiled SQL, generate
column_lineage.json, and emit lineage into DataHub, OpenMetadata, SQLFlow, or CI workflows.
Why dbt Column-Level Lineage Matters
dbt gives teams a useful model graph, but many data teams need a more precise view of how individual columns move through models, joins, filters, aggregations, and transformations.
This matters when you need to answer questions such as:
- If a source column changes, which downstream dbt model columns are affected?
- Which columns contribute to a critical metric?
- Where does sensitive data flow across staging, intermediate, and mart models?
- Can a data catalog such as DataHub or OpenMetadata show field-level impact, not only model-level dependencies?
To answer those questions, teams usually need to combine dbt artifacts, compiled SQL, catalog context, and SQL semantic analysis.
dbt Already Has Lineage — So What Is Missing?
dbt projects are built around dependencies. A typical dbt project contains sources, staging models, intermediate models, marts, tests, exposures, and documentation. dbt understands references such as:
select *
from {{ ref('stg_orders') }}
and sources such as:
select *
from {{ source('raw', 'orders') }}
From those references, dbt can build a model graph:
raw.orders
→ stg_orders
→ int_order_revenue
→ fct_customer_revenue
→ dashboard / exposure
That graph is valuable. It helps analytics engineers understand build order, dependency structure, and high-level impact. If stg_orders changes, the team can see which downstream models may be affected.
But this is usually not enough for column-level questions.
For example, suppose raw.orders.discount_amount changes type. A model graph can show that several downstream models depend on raw.orders, but it may not tell you exactly which downstream columns depend on discount_amount.
That is the difference:
- Model-level lineage answers: Which dbt models depend on this model or source?
- Column-level lineage answers: Which output columns depend on this specific input column, and how?
Both are useful. They solve different problems.
dbt Model-Level Lineage vs dbt Column-Level Lineage
| Question | Model-level lineage | Column-level lineage |
|---|---|---|
Which models depend on stg_orders? |
Yes | Yes, indirectly |
Which mart columns depend on orders.amount? |
No | Yes |
Which metrics are affected if customer_email is removed? |
Usually no | Yes, if lineage is accurate |
| Does a PII field flow into a BI-facing model? | Limited | Yes |
Does gross_revenue depend on a filter, join, or aggregation? |
No | Yes, if influence is modeled |
| Can a PR show which columns lost upstream lineage? | Not usually | Yes, with column-level diff support |
| Can a catalog explain how a field was derived? | Limited | Yes |
Model-level lineage is like a map of roads between cities. Column-level lineage is like knowing which pipes, valves, and meters carry a specific flow inside each building.
For governance, debugging, and impact analysis, teams usually need both.
A Simple dbt Example
Consider a simplified dbt model:
-- models/marts/fct_customer_revenue.sql
with orders as (
select
order_id,
customer_id,
amount,
status,
created_at
from {{ ref('stg_orders') }}
), customers as (
select
customer_id,
country,
segment
from {{ ref('stg_customers') }}
)
select
c.customer_id,
c.country,
c.segment,
date_trunc('month', o.created_at) as revenue_month,
sum(o.amount) as gross_revenue,
count(distinct o.order_id) as order_count
from orders o
join customers c
on o.customer_id = c.customer_id
where o.status = 'paid'
group by
c.customer_id,
c.country,
c.segment,
date_trunc('month', o.created_at)
At the model level, dbt can show:
fct_customer_revenue
depends on stg_orders
depends on stg_customers
That is correct, but incomplete.
Column-level lineage should go further:
fct_customer_revenue.customer_id
<- stg_customers.customer_id
<- joined through stg_orders.customer_id = stg_customers.customer_id
fct_customer_revenue.country
<- stg_customers.country
fct_customer_revenue.segment
<- stg_customers.segment
fct_customer_revenue.revenue_month
<- stg_orders.created_at
<- transformed by date_trunc('month', ...)
<- filtered by stg_orders.status = 'paid'
fct_customer_revenue.gross_revenue
<- stg_orders.amount
<- transformed by sum(...)
<- filtered by stg_orders.status = 'paid'
<- joined through stg_orders.customer_id = stg_customers.customer_id
fct_customer_revenue.order_count
<- stg_orders.order_id
<- transformed by count(distinct ...)
<- filtered by stg_orders.status = 'paid'
This output is more useful because it tells the team which columns matter, how they are transformed, and which conditions influence them.
For a data catalog, this can improve field-level documentation. For a governance team, it can identify sensitive-field propagation. For an analytics engineer, it can explain which metrics are affected by a source change.
Why Compiled SQL Matters
A dbt model file is often not pure SQL. It may contain Jinja, macros, variables, adapter dispatch, conditional logic, and references:
select
{{ dbt_utils.generate_surrogate_key(['customer_id', 'order_id']) }} as order_key,
amount
from {{ ref('stg_orders') }}
where status = '{{ var("paid_status") }}'
A normal SQL parser should not be expected to fully understand dbt Jinja. The safer path is to let dbt do what dbt does best: compile the project.
After dbt compile or dbt build, dbt produces artifacts such as:
target/manifest.json
target/catalog.json
target/run_results.json
target/compiled/.../*.sql
The compiled SQL is closer to what the warehouse will actually execute. The manifest also contains dbt metadata such as model IDs, refs, sources, resource types, adapter information, and compiled code.
That is why a practical dbt column-lineage workflow often looks like this:
dbt build / dbt compile
↓
target/manifest.json + compiled SQL
↓
SQL semantic lineage analyzer
↓
column_lineage.json
↓
DataHub / OpenMetadata / SQLFlow / CI
This approach avoids pretending that raw Jinja is ordinary SQL. It also avoids replacing dbt. dbt remains responsible for compilation and project metadata. The lineage engine analyzes the resulting SQL and maps column dependencies.
Why Column-Level Lineage Is Harder Than Finding Column Names
A basic extractor can list the columns that appear in a SQL statement. Column-level lineage requires more.
It needs to understand:
- aliases;
- nested CTE scopes;
- subqueries;
- joins;
- filters;
- aggregations;
- window functions;
case whenexpressions;unionand set operations;- dialect-specific syntax;
- catalog metadata;
- compiled SQL generated by macros.
For example:
select
customer_id,
sum(case when status = 'paid' then amount else 0 end) as paid_revenue
from {{ ref('stg_orders') }}
group by customer_id
The output column paid_revenue does not merely depend on amount. It is also influenced by status, because status determines which rows contribute to the sum.
A useful lineage system should distinguish at least these kinds of relationships:
| Relationship | Example | Why it matters |
|---|---|---|
| Projection | email as customer_email |
Direct field dependency |
| Transformation | lower(email) |
Derived field tracking |
| Aggregation | sum(amount) |
Metric explanation |
| Filter influence | where status = 'paid' |
The output depends on row selection |
| Join influence | join customers on orders.customer_id = customers.customer_id |
Output rows depend on join keys |
| Case condition | case when risk_score > 80 then ... |
Conditional logic affects derived value |
| Window dependency | row_number() over (partition by customer_id order by created_at) |
Ranking and deduplication depend on partition/order fields |
This is why dbt column-level lineage is not just a metadata problem. It is a SQL semantics problem.
How dbt Column-Level Lineage Supports Impact Analysis
Impact analysis is one of the clearest reasons to care about dbt column-level lineage.
Suppose a data platform team plans to rename or remove a source column:
raw.customers.email → raw.customers.email_address
Model-level lineage can show which models depend on raw.customers. But a large project may have hundreds of models downstream of that source. Not every model depends on email.
Column-level lineage can answer more specific questions:
- Which dbt models use
raw.customers.email? - Which downstream columns expose or transform that field?
- Does
emailflow into a BI-facing mart? - Does it flow into a hashed key, masked field, or derived segment?
- Which tests, metrics, or exposures may need review?
This can reduce noisy reviews. Instead of telling every model owner that a source table changed, the team can identify the specific downstream columns that depend on the changed field.
That is the difference between broad lineage and actionable lineage.
How dbt Column-Level Lineage Supports PII and Governance
Column-level lineage is also important for sensitive data governance.
A source field may be classified as sensitive:
raw.customers.email: pii.email
raw.customers.ssn: pii.national_id
raw.payments.card_last4: pii.payment
The governance question is not only whether a dbt model depends on raw.customers. The real question is:
Where do these sensitive fields flow, and are they still protected in downstream models?
Column-level lineage helps answer:
- Does
emailflow into a marketing mart? - Is
ssnprojected directly, hashed, masked, or excluded? - Does a sensitive field appear only in a join or filter, or is it exposed as an output column?
- Which BI-facing models contain derived sensitive fields?
- Which role or policy should be required before a generated SQL query can access those columns?
This is also where dbt lineage connects to SQL governance and Text-to-SQL safety. If an AI assistant generates a query against a dbt-backed semantic environment, field-level access checks need to know which columns are sensitive and where they flow.
A model graph alone is not precise enough for that.
Where DataHub and OpenMetadata Fit
DataHub, OpenMetadata, and similar catalogs are valuable because they give teams a place to search, browse, document, and govern data assets. They can display datasets, schemas, owners, tags, glossary terms, lineage graphs, and usage metadata.
But catalogs are only as accurate as the lineage they receive.
For dbt projects, a catalog may ingest dbt artifacts and show model-level dependencies. It may also attempt column-level lineage extraction. The hard part is complex SQL semantics, especially when compiled SQL contains nested CTEs, macro-generated SQL, dialect-specific constructs, stored procedure patterns, or advanced warehouse syntax.
A practical architecture is:
dbt artifacts + compiled SQL
↓
SQL semantic lineage sidecar
↓
column_lineage.json
↓
DataHub / OpenMetadata / SQLFlow / CI
In this model, the catalog remains the discovery and governance surface. The sidecar improves the quality of the SQL-derived column lineage before that lineage reaches the catalog.
This is not about replacing DataHub or OpenMetadata. It is about giving them better column-level facts.
What Should a dbt Column-Lineage Output Contain?
A useful dbt column-lineage output should be machine-readable and honest about confidence.
At minimum, it should include:
| Field | Purpose |
|---|---|
| dbt node ID | Connect lineage to the dbt model |
| output column | The column produced by the model |
| upstream model or source | The upstream dbt object or warehouse table |
| upstream column | The source field that contributes to the output |
| transformation type | Projection, expression, aggregation, window, case, etc. |
| influence type | Projection, filter, join, control, or unknown |
| confidence | Whether the mapping is high-confidence or partial |
| unresolved items | Ambiguous columns, parser errors, missing catalog metadata, unsupported syntax |
| evidence | SQL fragments, line numbers, parser diagnostics, or backend notes where available |
A simplified JSON shape might look like this:
{
"node_id": "model.analytics.fct_customer_revenue",
"output_column": "gross_revenue",
"upstream": [
{
"node_id": "model.analytics.stg_orders",
"column": "amount",
"influence": "projection",
"transformation": "sum",
"confidence": "high"
},
{
"node_id": "model.analytics.stg_orders",
"column": "status",
"influence": "filter",
"transformation": "where status = 'paid'",
"confidence": "high"
}
],
"unresolved": []
}
The exact schema can vary by implementation. The important point is that column-level lineage should not be a screenshot or a vague graph only. It should produce facts that downstream systems can consume.
What dbt Column-Level Lineage Should Not Promise
It is important to set realistic expectations.
A dbt column-lineage system should not claim that it can perfectly understand every macro, every dynamic SQL pattern, every warehouse-specific construct, and every runtime behavior automatically.
Common limitations include:
- raw Jinja that has not been compiled;
- dynamic SQL generated outside dbt;
- macros whose semantics are not visible in the compiled SQL;
- missing catalog metadata;
- ambiguous unqualified column names;
- unsupported dialect features;
- runtime behavior that cannot be inferred from static SQL alone;
- partial parser failures;
- incomplete mapping through complex procedural logic.
A trustworthy lineage system should expose these limitations instead of hiding them. It should report unresolved columns, parser diagnostics, low-confidence edges, and unsupported constructs.
For governance and CI, an honest partial result is often better than a polished but false graph.
Common Use Cases
1. Source column impact analysis
A source column is renamed, removed, or changes type. Column-level lineage identifies the downstream dbt model columns that may be affected.
2. PII and sensitive-field tracing
A sensitive field appears in a source table. Column-level lineage shows whether it is projected, transformed, hashed, joined, filtered, or exposed downstream.
3. Metric explanation
A finance or operations team asks how gross_revenue, net_revenue, or active_customer_count is calculated. Column-level lineage shows the source fields and transformations behind the metric.
4. Data catalog accuracy
A catalog displays model-level lineage, but column-level panels are empty or incomplete for complex models. A SQL semantic lineage sidecar can enrich the catalog with better field-level facts.
5. CI and pull-request review
A dbt PR changes a model. Column-level lineage can help answer whether the change removed upstream dependencies, introduced sensitive-field propagation, or affected critical downstream outputs.
6. AI and Text-to-SQL governance
An AI assistant generates SQL over governed datasets. Field-level permission checks and policy decisions need accurate column facts, including where sensitive fields flow through dbt models.
Quick Reference
| Concept | Short definition |
|---|---|
| dbt model graph | The dependency graph built from refs, sources, models, tests, and exposures |
| Model-level lineage | Lineage between dbt models, sources, and downstream assets |
| Column-level lineage | Lineage from each output column to upstream columns and expressions |
| Compiled SQL | SQL produced after dbt resolves Jinja, refs, sources, vars, and macros |
| dbt artifact | Files such as manifest.json, catalog.json, and run_results.json |
| SQL semantic analysis | Name binding, scope resolution, expression analysis, and dialect-aware interpretation of SQL |
| Sidecar lineage | A separate tool that reads dbt artifacts and outputs lineage without replacing dbt |
| DataHub / OpenMetadata emitter | A connector that sends lineage facts into a catalog |
| Column-level diff | A CI check that compares which upstream columns were added or lost |
How This Connects to SQLFlow and GSP
SQLFlow and GSP are useful in this workflow because the hard part of dbt column-level lineage is SQL semantic analysis.
A practical pattern is:
dbt project
↓
dbt build / compile
↓
manifest.json + compiled SQL
↓
GSP / SQLFlow semantic lineage engine
↓
column_lineage.json
↓
DataHub / OpenMetadata / SQLFlow / CI
For dbt users, this should not feel like replacing dbt Docs, dbt Cloud Explorer, or their catalog. It should feel like a lineage enhancement layer for cases where model-level lineage is not enough.
For data platform teams, the useful question is not “which parser is better?” The useful question is:
Can we recover trustworthy column-level facts from the SQL our dbt project actually runs?
That is the problem a SQL semantic lineage engine is designed to solve.
Common Questions
Is dbt column-level lineage the same as dbt model lineage?
No. dbt model lineage shows dependencies between models, sources, and exposures. Column-level lineage shows dependencies between specific fields, such as fct_orders.gross_revenue depending on stg_orders.amount and stg_orders.status.
Why not analyze raw dbt model files directly?
Raw dbt model files often contain Jinja, macros, variables, and refs. A SQL parser is designed for SQL, not raw Jinja. The more reliable path is to analyze compiled SQL plus dbt artifacts.
Does dbt already provide column-level lineage?
dbt and related ecosystem tools can provide lineage signals, but many teams still need more precise field-level dependencies for complex SQL, macro-generated SQL, catalog integration, CI checks, and governance workflows.
Do DataHub or OpenMetadata solve this automatically?
They can display and ingest lineage, and they are valuable catalog surfaces. But complex SQL-derived column lineage still depends on the quality of the underlying SQL analysis. A sidecar can enrich those catalogs with more precise column-level facts.
Is column-level lineage only about selected columns?
No. Useful lineage may include projection dependencies, expression dependencies, filter influence, join influence, aggregation, window functions, and conditional logic. For governance, knowing that a sensitive field influenced a filter or join can matter even if it is not directly selected.
Can dbt column-level lineage support CI checks?
Yes, if the lineage output is machine-readable. A CI workflow can compare current lineage to a baseline and flag lost upstream columns, new sensitive-field flows, or unsupported SQL patterns. Per-column semantic diff is more useful than only comparing model-level edge counts.
How does this relate to Text-to-SQL governance?
Text-to-SQL systems need field-level facts to decide whether generated SQL should be allowed, denied, warned, or routed for approval. dbt column-level lineage helps explain where governed fields come from and how they flow through modeled datasets.
Summary
dbt column-level lineage is the field-level map behind a dbt project. It answers which upstream columns, expressions, filters, joins, and transformations produced each downstream model column.
This is different from dbt model-level lineage. Model-level lineage tells you which models depend on each other. Column-level lineage tells you which fields depend on which fields.
The practical path is to combine dbt artifacts, compiled SQL, catalog context, and SQL semantic analysis. That gives data teams a more reliable foundation for impact analysis, sensitive-field tracing, catalog enrichment, CI checks, and SQL governance.
If your dbt project already has model lineage but still cannot explain which source columns feed critical metrics or sensitive downstream fields, the next step is to evaluate dbt column-level lineage on your compiled SQL.
Try SQLFlow’s SQL lineage demo, contact DPRiver to review a representative dbt compiled SQL model, or try SQL Guard-style validation with your SQL.

