Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros — and How to Recover It

DataHub silently drops column-level lineage on the dbt-utils deduplicate macro because of how sqlglot's column resolver handles ARRAY_AGG + struct unpack. Here's why — and an open-source post-processor that recovers the missing lineage.

If your dbt + BigQuery stack lands models in DataHub, you may have hit a quiet failure mode on a subset of your tables: the table-level lineage edge appears, but the column-level lineage is empty. The graph looks complete — until you click into a column and find no upstream.

This is a known issue (datahub-project/datahub#11670, open since October 2024). The root cause is specific: the BigQuery SQL that the dbt-utils deduplicate macro emits uses two BigQuery-specific semantics — row-as-STRUCT and STRUCT.* field expansion — that DataHub’s bundled SQL parser (sqlglot) does not currently model in its column-level lineage walker.

This post explains exactly what’s going on, with an honest accounting of what’s recoverable from the SQL alone versus what needs schema metadata, and shows how to recover the missing edge today using an open-source post-processor — without modifying your DataHub installation.

What dbt-utils is actually doing on BigQuery

The dbt-utils deduplicate macro is dialect-dispatched. Most warehouses get a qualify/row_number() translation, but BigQuery gets a different implementation on purposearray_agg ... limit 1 is significantly cheaper at scale than row_number() over (...) in BigQuery’s execution engine. The compiled SQL looks like this:

CREATE VIEW analytics.deduplicated_articles AS
SELECT unique.*
FROM (
    SELECT
        ARRAY_AGG(
            original
            ORDER BY article_name DESC
            LIMIT 1
        )[OFFSET(0)] AS unique
    FROM all_articles AS original
    GROUP BY id
);

The variable names are deceptive in a way that matters for parsers, so it’s worth being precise about what each one is:

  • original is the table alias for all_articles. Used as a value (inside ARRAY_AGG(original)), it’s the whole row of all_articles viewed as a STRUCT — not a column.
  • unique is a STRUCT-valued column alias in the inner query. ARRAY_AGG(original ...)[OFFSET(0)] returns one STRUCT per group (the deduplicated winner row), and AS unique names that STRUCT.
  • SELECT unique.* is STRUCT field expansion (per BigQuery’s SELECT expression.* rules), not a star against a table. It explodes the STRUCT’s fields into top-level output columns.
  • Net effect: analytics.deduplicated_articles has the same top-level column shape as all_articles, with one row per id chosen by article_name DESC.

    What lineage is recoverable from the SQL alone

    Here’s an honest expected-vs-actual table for what a metadata-free parser can produce on this SQL:

    Lineage you’d like Recoverable without schema?
    Table edge: all_articles → analytics.deduplicated_articles Yes — sqlglot already gets this
    Whole-row flow: all_articles.* → analytics.deduplicated_articles.* Yes — but only by a parser that models row-as-STRUCT and STRUCT.* expansion
    Per-column edges: all_articles.<col> → analytics.deduplicated_articles.<col> No — needs schema for all_articles to enumerate the column names
    Row-selection dependency on article_name and id Yes — the GROUP BY and ORDER BY keys can be emitted as separate edges into the output (as control inputs, not value lineage)

    For the deeper walkthrough on what BigQuery semantics a metadata-free parser needs to implement to recover the row-flow shape (SELECT AS STRUCT, ARRAY(SELECT AS STRUCT ...), row-as-STRUCT in array_agg, [OFFSET(n)], STRUCT.* field expansion), see the companion post: BigQuery Column-Level Lineage Without Metadata.

    Where DataHub’s column walker stops short

    DataHub’s dbt source runs the model SQL through sqlglot. sqlglot extracts the table-level edge here (all_articles → analytics.deduplicated_articles) and resolves the inner unique alias correctly. The gap is in the outer query: the projection is literally unique.*, and sqlglot’s column-lineage walker does not currently expand a STRUCT-valued column alias into its fields. Without that expansion, there are no concrete target columns to attach column-level lineage to.

    You can see the symptom in three lines:

    >>> import sqlglot
    >>> from sqlglot.lineage import lineage
    >>> outer = """SELECT unique.* FROM (
    ...   SELECT ARRAY_AGG(original ORDER BY article_name DESC LIMIT 1)[OFFSET(0)] AS unique
    ...   FROM all_articles AS original GROUP BY id)"""
    >>> # Tables: works fine
    >>> [str(t) for t in sqlglot.parse_one(outer, dialect='bigquery').find_all(sqlglot.exp.Table)]
    ['all_articles AS original']
    >>> # Lineage walker on the outer query
    >>> lineage('unique', outer, dialect='bigquery')
    SqlglotError: Cannot find column 'unique' in query.

    The error message is a side effect, not the root cause. unique isn’t a final output column of the outer query — only unique.* (a star expansion) is. The walker needs to (a) recognize unique as a STRUCT-valued projection in the inner query, (b) carry that STRUCT type through the subquery boundary, and (c) expand unique.* into its field set. None of those steps are wired up today, so the projection drops out of column-level lineage.

    This isn’t a defect in sqlglot per se — modeling row-as-STRUCT, array element extraction, and STRUCT.* expansion is genuinely additional surface area, and the upstream maintainers have asked (rightly) for a sqlglot-side issue to track it. But for teams running DataHub today on a BigQuery + dbt stack, the column lineage on every deduplicate-macro model is missing now.

    Workaround: a post-processor sidecar

    We built gsp-datahub-sidecar, an Apache-2.0 tool that fills this gap. It runs alongside your existing DataHub ingestion — no changes to DataHub itself, and it doesn’t replace anything sqlglot already produced.

    The sidecar re-parses SQL that sqlglot’s column walker couldn’t fully resolve using Gudu SQLFlow, whose engine implements the row-as-STRUCT, [OFFSET(N)], and STRUCT.* semantics natively across BigQuery, Snowflake, and other dialects. It then emits the recovered lineage to DataHub via the standard REST API.

    DataHub ingestion (unchanged)        gsp-datahub-sidecar
      dbt source -> sqlglot -> lineage     |
                      |                    | re-parse the SQL with Gudu SQLFlow
                      v                    | (handles row-as-STRUCT + STRUCT.*)
           table edge OK, column            v
           expansion missing for #11670 -> DataHub GMS (lineage filled in)

    Try it in three commands

    Step 1 — Install:

    pip install gsp-datahub-sidecar

    Step 2 — Dry-run on the dbt-utils deduplicate pattern. The repo ships with examples/bigquery_dbt_dedup.sql — the exact SQL from issue #11670, ready to run:

    gsp-datahub-sidecar --sql-file examples/bigquery_dbt_dedup.sql --dry-run

    Output (high-level summary, with the per-edge URN lines collapsed for readability — full output is shown in the GitHub repo):

    [INFO] Processing 1 SQL statement(s) in 'anonymous' mode...
    [INFO] Extracted 1 table-level lineage relationships
    [INFO]   ALL_ARTICLES --> ANALYTICS.DEDUPLICATED_ARTICLES (3 columns)
    [INFO] Built 2 MCPs for 1 downstream tables (3 column-level mappings)
    [INFO] [DRY RUN] Would emit 4 MCPs to http://localhost:8080
    [INFO] [DRY RUN]   UpstreamLineageClass (1 upstream table, 3 column-level lineages):
    [INFO] [DRY RUN]     all_articles.*            ->  analytics.deduplicated_articles.*
    [INFO] [DRY RUN]     all_articles.id           ->  analytics.deduplicated_articles.*
    [INFO] [DRY RUN]     all_articles.article_name ->  analytics.deduplicated_articles.*

    The three column-level edges map cleanly onto the dbt-utils macro’s structure:

    • all_articles.* -> analytics.deduplicated_articles.* — the row-flow edge. Every column of the source flows into the same-named column on the target, because unique.* field-expands the row STRUCT.
    • all_articles.id -> analytics.deduplicated_articles.*id is the GROUP BY key. Different id values produce different output rows, so id is a row-selection (control) input to every output column.
    • all_articles.article_name -> analytics.deduplicated_articles.*article_name is the ORDER BY ... DESC LIMIT 1 key. It picks which row wins per group, so it’s also a row-selection input to every output column.
    • The two row-selection edges (id, article_name) are the GROUP BY and ORDER BY columns annotated as control dependencies, not value lineage. That’s the right semantics for impact analysis: if article_name changes, the choice of winning row may change, but the value still flows from the row’s other columns.

      With schema metadata supplied (via SQLFlow’s database integrations or a supplied schema file), the engine expands the * on the right into concrete same-named column edges per known column of all_articles. Without metadata, you still get the structural row-flow edge in DataHub rather than a silent gap — which is enough for impact analysis even when the leaf column names aren’t enumerated.

      Step 3 — Emit to DataHub. Once the dry-run looks right, drop --dry-run and point at your DataHub GMS:

      gsp-datahub-sidecar \
        --sql-file examples/bigquery_dbt_dedup.sql \
        --datahub-server http://localhost:8080

      You can run the sidecar on real dbt project SQL the same way — point --sql-file at the compiled SQL in target/run/.../<model>.sql, or feed a directory of compiled models with a wrapper script.

      Backend modes

      Pick the backend based on your security and volume needs:

      Mode Auth Limit Data location Use case
      anonymous (default) None 50/day per IP SQL sent to api.gudusoft.com Quick evaluation
      authenticated userId + secretKey 10k/month SQL sent to api.gudusoft.com Extended evaluation
      self_hosted userId + secretKey Unlimited SQL stays in your VPC Production
      local_jar None (local subprocess) Per JAR license SQL never leaves the process Air-gapped / CI

      For production use with regulated SQL, the self-hosted SQLFlow Docker keeps everything in your network — no SQL leaves your infrastructure.

      What’s next

      The sidecar is a workaround, not a substitute for fixing this upstream. The honest path forward:

      1. Patch sqlglot’s column-lineage walker to model row-as-STRUCT, array element extraction, and STRUCT.* expansion. That unblocks every downstream tool that uses sqlglot, not just DataHub.
      2. Until then, run the sidecar on dbt models that hit the deduplicate macro (or any other array_agg-based dedup pattern), so your column lineage isn’t silently empty.
      3. If you’re affected by #11670 or other lineage gaps in your DataHub instance, browse the GitHub repo for the full example library — BigQuery procedural SQL, Power BI comments, MSSQL stored procedures, Oracle CREATE VIEW. The companion post on BigQuery procedural SQL lineage covers DataHub issue #11654 with the same playbook. Issues, PRs, and feedback welcome on GitHub.

        Disclosure: This tool is built by Gudu Software, the team behind General SQL Parser and SQLFlow. We specialize in deep SQL parsing and column-level lineage across SQL dialects.