{"id":3211,"date":"2026-05-01T14:24:30","date_gmt":"2026-05-01T06:24:30","guid":{"rendered":"https:\/\/www.dpriver.com\/blog\/?p=3211"},"modified":"2026-05-01T15:31:30","modified_gmt":"2026-05-01T07:31:30","slug":"why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re","status":"publish","type":"post","link":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/","title":{"rendered":"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It"},"content":{"rendered":"\n<p>If your dbt + BigQuery stack lands models in DataHub, you may have hit a quiet failure mode on a subset of your tables: the <strong>table-level<\/strong> lineage edge appears, but the <strong>column-level<\/strong> lineage is empty. The graph looks complete \u2014 until you click into a column and find no upstream.<\/p>\n\n\n\n<p>This is a known issue (<a href=\"https:\/\/github.com\/datahub-project\/datahub\/issues\/11670\">datahub-project\/datahub#11670<\/a>, open since October 2024). The root cause is specific: the BigQuery SQL that the <a href=\"https:\/\/github.com\/dbt-labs\/dbt-utils\/blob\/main\/macros\/sql\/deduplicate.sql\">dbt-utils <code>deduplicate<\/code> macro<\/a> emits uses two BigQuery-specific semantics \u2014 row-as-STRUCT and <code>STRUCT.&#042;<\/code> field expansion \u2014 that DataHub&#8217;s bundled SQL parser (sqlglot) does not currently model in its column-level lineage walker.<\/p>\n\n\n\n<p>This post explains exactly what&#8217;s going on, with an honest accounting of what&#8217;s recoverable from the SQL alone versus what needs schema metadata, and shows how to recover the missing edge today using an open-source post-processor \u2014 without modifying your DataHub installation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What dbt-utils is actually doing on BigQuery<\/h2>\n\n\n\n<p>The dbt-utils <code>deduplicate<\/code> macro is dialect-dispatched. Most warehouses get a <code>qualify<\/code>\/<code>row_number()<\/code> translation, but BigQuery gets a <a href=\"https:\/\/github.com\/dbt-labs\/dbt-utils\/issues\/335#issuecomment-788157572\">different implementation on purpose<\/a> \u2014 <code>array_agg ... limit 1<\/code> is significantly cheaper at scale than <code>row_number() over (...)<\/code> in BigQuery&#8217;s execution engine. The compiled SQL looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE VIEW analytics.deduplicated_articles AS\nSELECT unique.*\nFROM (\n    SELECT\n        ARRAY_AGG(\n            original\n            ORDER BY article_name DESC\n            LIMIT 1\n        )[OFFSET(0)] AS unique\n    FROM all_articles AS original\n    GROUP BY id\n);<\/code><\/pre>\n\n\n\n<p>The variable names are deceptive in a way that matters for parsers, so it&#8217;s worth being precise about what each one is:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n\n<li><code>original<\/code> is the <strong>table alias<\/strong> for <code>all_articles<\/code>. Used as a value (inside <code>ARRAY_AGG(original)<\/code>), it&#8217;s the <em>whole row<\/em> of <code>all_articles<\/code> viewed as a STRUCT \u2014 not a column.<\/li>\n\n<li><code>unique<\/code> is a <strong>STRUCT-valued column alias<\/strong> in the inner query. <code>ARRAY_AGG(original ...)[OFFSET(0)]<\/code> returns one STRUCT per group (the deduplicated winner row), and <code>AS unique<\/code> names that STRUCT.<\/li>\n\n<li><code>SELECT unique.&#042;<\/code> is <strong>STRUCT field expansion<\/strong> (per BigQuery&#8217;s <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/reference\/standard-sql\/query-syntax#select_expression-star\"><code>SELECT expression.*<\/code><\/a> rules), not a star against a table. It explodes the STRUCT&#8217;s fields into top-level output columns.<\/li>\n\n\n\n\n<p>Net effect: <code>analytics.deduplicated_articles<\/code> has the same top-level column shape as <code>all_articles<\/code>, with one row per <code>id<\/code> chosen by <code>article_name DESC<\/code>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What lineage is recoverable from the SQL alone<\/h2>\n\n\n\n<p>Here&#8217;s an honest expected-vs-actual table for what a metadata-free parser can produce on this SQL:<\/p>\n\n\n\n<p><table> <thead> <tr style=\"background:#f0f4f8\"> <th style=\"padding:8px 12px;border:1px solid #ddd;text-align:left\">Lineage you&#8217;d like<\/th> <th style=\"padding:8px 12px;border:1px solid #ddd;text-align:left\">Recoverable without schema?<\/th> <\/tr> <\/thead> <tbody> <tr> <td style=\"padding:8px 12px;border:1px solid #ddd\">Table edge: <code>all_articles &rarr; analytics.deduplicated_articles<\/code><\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">Yes \u2014 sqlglot already gets this<\/td> <\/tr> <tr style=\"background:#f9fafb\"> <td style=\"padding:8px 12px;border:1px solid #ddd\">Whole-row flow: <code>all_articles.&#042; &rarr; analytics.deduplicated_articles.&#042;<\/code><\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">Yes \u2014 but only by a parser that models row-as-STRUCT and <code>STRUCT.&#042;<\/code> expansion<\/td> <\/tr> <tr> <td style=\"padding:8px 12px;border:1px solid #ddd\">Per-column edges: <code>all_articles.&lt;col&gt; &rarr; analytics.deduplicated_articles.&lt;col&gt;<\/code><\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">No \u2014 needs schema for <code>all_articles<\/code> to enumerate the column names<\/td> <\/tr> <tr style=\"background:#f9fafb\"> <td style=\"padding:8px 12px;border:1px solid #ddd\">Row-selection dependency on <code>article_name<\/code> and <code>id<\/code><\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">Yes \u2014 the <code>GROUP BY<\/code> and <code>ORDER BY<\/code> keys can be emitted as separate edges into the output (as control inputs, not value lineage)<\/td> <\/tr> <\/tbody> <\/table><\/p>\n\n\n\n<p>For the deeper walkthrough on what BigQuery semantics a metadata-free parser needs to implement to recover the row-flow shape (<code>SELECT AS STRUCT<\/code>, <code>ARRAY(SELECT AS STRUCT ...)<\/code>, row-as-STRUCT in <code>array_agg<\/code>, <code>[OFFSET(n)]<\/code>, <code>STRUCT.&#042;<\/code> field expansion), see the companion post: <a href=\"https:\/\/www.dpriver.com\/blog\/2026\/05\/bigquery-column-level-lineage-without-metadata-inferring-struct-and-array-types\/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=companion-blog\">BigQuery Column-Level Lineage Without Metadata<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where DataHub&#8217;s column walker stops short<\/h2>\n\n\n\n<p>DataHub&#8217;s dbt source runs the model SQL through <strong>sqlglot<\/strong>. sqlglot extracts the table-level edge here (<code>all_articles \u2192 analytics.deduplicated_articles<\/code>) and resolves the inner <code>unique<\/code> alias correctly. The gap is in the outer query: the projection is literally <code>unique.&#042;<\/code>, and sqlglot&#8217;s column-lineage walker does not currently expand a STRUCT-valued column alias into its fields. Without that expansion, there are no concrete target columns to attach column-level lineage to.<\/p>\n\n\n\n<p>You can see the symptom in three lines:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&gt;&gt;&gt; import sqlglot\n&gt;&gt;&gt; from sqlglot.lineage import lineage\n&gt;&gt;&gt; outer = \"\"\"SELECT unique.* FROM (\n...   SELECT ARRAY_AGG(original ORDER BY article_name DESC LIMIT 1)[OFFSET(0)] AS unique\n...   FROM all_articles AS original GROUP BY id)\"\"\"\n&gt;&gt;&gt; # Tables: works fine\n&gt;&gt;&gt; [str(t) for t in sqlglot.parse_one(outer, dialect='bigquery').find_all(sqlglot.exp.Table)]\n['all_articles AS original']\n&gt;&gt;&gt; # Lineage walker on the outer query\n&gt;&gt;&gt; lineage('unique', outer, dialect='bigquery')\nSqlglotError: Cannot find column 'unique' in query.<\/code><\/pre>\n\n\n\n<p>The error message is a side effect, not the root cause. <code>unique<\/code> isn&#8217;t a final output column of the outer query \u2014 only <code>unique.&#042;<\/code> (a star expansion) is. The walker needs to (a) recognize <code>unique<\/code> as a STRUCT-valued projection in the inner query, (b) carry that STRUCT type through the subquery boundary, and (c) expand <code>unique.&#042;<\/code> into its field set. None of those steps are wired up today, so the projection drops out of column-level lineage.<\/p>\n\n\n\n<p>This isn&#8217;t a defect in sqlglot per se \u2014 modeling row-as-STRUCT, array element extraction, and <code>STRUCT.&#042;<\/code> expansion is genuinely additional surface area, and the upstream maintainers have asked (rightly) for a sqlglot-side issue to track it. But for teams running DataHub today on a BigQuery + dbt stack, the column lineage on every deduplicate-macro model is missing now.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Workaround: a post-processor sidecar<\/h2>\n\n\n\n<p>We built <a href=\"https:\/\/github.com\/gudusoftware\/gsp-datahub-sidecar\">gsp-datahub-sidecar<\/a>, an Apache-2.0 tool that fills this gap. It runs <strong>alongside<\/strong> your existing DataHub ingestion \u2014 no changes to DataHub itself, and it doesn&#8217;t replace anything sqlglot already produced.<\/p>\n\n\n\n<p>The sidecar re-parses SQL that sqlglot&#8217;s column walker couldn&#8217;t fully resolve using <a href=\"https:\/\/sqlflow.gudusoft.com\/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=engine-mention\">Gudu SQLFlow<\/a>, whose engine implements the row-as-STRUCT, <code>[OFFSET(N)]<\/code>, and <code>STRUCT.&#042;<\/code> semantics natively across BigQuery, Snowflake, and other dialects. It then emits the recovered lineage to DataHub via the standard REST API.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DataHub ingestion (unchanged)        gsp-datahub-sidecar\n  dbt source -&gt; sqlglot -&gt; lineage     |\n                  |                    | re-parse the SQL with Gudu SQLFlow\n                  v                    | (handles row-as-STRUCT + STRUCT.*)\n       table edge OK, column            v\n       expansion missing for #11670 -&gt; DataHub GMS (lineage filled in)<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Try it in three commands<\/h3>\n\n\n\n<p><strong>Step 1 \u2014 Install:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install gsp-datahub-sidecar<\/code><\/pre>\n\n\n\n<p><strong>Step 2 \u2014 Dry-run on the dbt-utils <code>deduplicate<\/code> pattern.<\/strong> The repo ships with <code>examples\/bigquery_dbt_dedup.sql<\/code> \u2014 the exact SQL from issue #11670, ready to run:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>gsp-datahub-sidecar --sql-file examples\/bigquery_dbt_dedup.sql --dry-run<\/code><\/pre>\n\n\n\n<p>Output (high-level summary, with the per-edge URN lines collapsed for readability \u2014 full output is shown in <a href=\"https:\/\/github.com\/gudusoftware\/gsp-datahub-sidecar#examples\">the GitHub repo<\/a>):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>[INFO] Processing 1 SQL statement(s) in 'anonymous' mode...\n[INFO] Extracted 1 table-level lineage relationships\n[INFO]   ALL_ARTICLES --&gt; ANALYTICS.DEDUPLICATED_ARTICLES (3 columns)\n[INFO] Built 2 MCPs for 1 downstream tables (3 column-level mappings)\n[INFO] [DRY RUN] Would emit 4 MCPs to http:\/\/localhost:8080\n[INFO] [DRY RUN]   UpstreamLineageClass (1 upstream table, 3 column-level lineages):\n[INFO] [DRY RUN]     all_articles.&#042;            -&gt;  analytics.deduplicated_articles.&#042;\n[INFO] [DRY RUN]     all_articles.id           -&gt;  analytics.deduplicated_articles.&#042;\n[INFO] [DRY RUN]     all_articles.article_name -&gt;  analytics.deduplicated_articles.&#042;<\/code><\/pre>\n\n\n\n<p>The three column-level edges map cleanly onto the dbt-utils macro&#8217;s structure:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n\n<li><strong><code>all_articles.&#042; -&gt; analytics.deduplicated_articles.&#042;<\/code><\/strong> \u2014 the row-flow edge. Every column of the source flows into the same-named column on the target, because <code>unique.&#042;<\/code> field-expands the row STRUCT.<\/li>\n\n<li><strong><code>all_articles.id -&gt; analytics.deduplicated_articles.&#042;<\/code><\/strong> \u2014 <code>id<\/code> is the <code>GROUP BY<\/code> key. Different <code>id<\/code> values produce different output rows, so <code>id<\/code> is a row-selection (control) input to every output column.<\/li>\n\n<li><strong><code>all_articles.article_name -&gt; analytics.deduplicated_articles.&#042;<\/code><\/strong> \u2014 <code>article_name<\/code> is the <code>ORDER BY ... DESC LIMIT 1<\/code> key. It picks <em>which<\/em> row wins per group, so it&#8217;s also a row-selection input to every output column.<\/li>\n\n\n\n\n<p>The two row-selection edges (<code>id<\/code>, <code>article_name<\/code>) are the <code>GROUP BY<\/code> and <code>ORDER BY<\/code> columns annotated as control dependencies, not value lineage. That&#8217;s the right semantics for impact analysis: if <code>article_name<\/code> changes, the <em>choice<\/em> of winning row may change, but the <em>value<\/em> still flows from the row&#8217;s other columns.<\/p>\n\n\n\n<p>With schema metadata supplied (via SQLFlow&#8217;s database integrations or a supplied schema file), the engine expands the <code>&#042;<\/code> on the right into concrete same-named column edges per known column of <code>all_articles<\/code>. Without metadata, you still get the structural row-flow edge in DataHub rather than a silent gap \u2014 which is enough for impact analysis even when the leaf column names aren&#8217;t enumerated.<\/p>\n\n\n\n<p><strong>Step 3 \u2014 Emit to DataHub.<\/strong> Once the dry-run looks right, drop <code>--dry-run<\/code> and point at your DataHub GMS:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>gsp-datahub-sidecar \\\n  --sql-file examples\/bigquery_dbt_dedup.sql \\\n  --datahub-server http:\/\/localhost:8080<\/code><\/pre>\n\n\n\n<p>You can run the sidecar on real dbt project SQL the same way \u2014 point <code>--sql-file<\/code> at the compiled SQL in <code>target\/run\/...\/&lt;model&gt;.sql<\/code>, or feed a directory of compiled models with a wrapper script.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Backend modes<\/h2>\n\n\n\n<p>Pick the backend based on your security and volume needs:<\/p>\n\n\n\n<p><table> <thead> <tr style=\"background:#f0f4f8\"> <th style=\"padding:8px 12px;border:1px solid #ddd;text-align:left\">Mode<\/th> <th style=\"padding:8px 12px;border:1px solid #ddd;text-align:left\">Auth<\/th> <th style=\"padding:8px 12px;border:1px solid #ddd;text-align:left\">Limit<\/th> <th style=\"padding:8px 12px;border:1px solid #ddd;text-align:left\">Data location<\/th> <th style=\"padding:8px 12px;border:1px solid #ddd;text-align:left\">Use case<\/th> <\/tr> <\/thead> <tbody> <tr> <td style=\"padding:8px 12px;border:1px solid #ddd\"><code>anonymous<\/code> (default)<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">None<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">50\/day per IP<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">SQL sent to api.gudusoft.com<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">Quick evaluation<\/td> <\/tr> <tr style=\"background:#f9fafb\"> <td style=\"padding:8px 12px;border:1px solid #ddd\"><code>authenticated<\/code><\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">userId + secretKey<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">10k\/month<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">SQL sent to api.gudusoft.com<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">Extended evaluation<\/td> <\/tr> <tr> <td style=\"padding:8px 12px;border:1px solid #ddd\"><code>self_hosted<\/code><\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">userId + secretKey<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">Unlimited<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">SQL stays in your VPC<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">Production<\/td> <\/tr> <tr style=\"background:#f9fafb\"> <td style=\"padding:8px 12px;border:1px solid #ddd\"><code>local_jar<\/code><\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">None (local subprocess)<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">Per JAR license<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">SQL never leaves the process<\/td> <td style=\"padding:8px 12px;border:1px solid #ddd\">Air-gapped \/ CI<\/td> <\/tr> <\/tbody> <\/table><\/p>\n\n\n\n<p>For production use with regulated SQL, the <a href=\"https:\/\/docs.gudusoft.com\/docker\/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=docker-mention\">self-hosted SQLFlow Docker<\/a> keeps everything in your network \u2014 no SQL leaves your infrastructure.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What&#8217;s next<\/h2>\n\n\n\n<p>The sidecar is a workaround, not a substitute for fixing this upstream. The honest path forward:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n\n<li><strong>Patch sqlglot&#8217;s column-lineage walker<\/strong> to model row-as-STRUCT, array element extraction, and <code>STRUCT.&#042;<\/code> expansion. That unblocks every downstream tool that uses sqlglot, not just DataHub.<\/li>\n\n<li><strong>Until then<\/strong>, run the sidecar on dbt models that hit the <code>deduplicate<\/code> macro (or any other <code>array_agg<\/code>-based dedup pattern), so your column lineage isn&#8217;t silently empty.<\/li>\n\n\n\n\n<p>If you&#8217;re affected by <a href=\"https:\/\/github.com\/datahub-project\/datahub\/issues\/11670\">#11670<\/a> or other lineage gaps in your DataHub instance, browse the <a href=\"https:\/\/github.com\/gudusoftware\/gsp-datahub-sidecar\">GitHub repo<\/a> for the full example library \u2014 BigQuery procedural SQL, Power BI comments, MSSQL stored procedures, Oracle CREATE VIEW. The companion post on <a href=\"https:\/\/www.dpriver.com\/blog\/2026\/04\/why-your-datahub-bigquery-lineage-silently-breaks-on-procedural-sql-and-how-to-f\/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=related-blog\">BigQuery procedural SQL lineage<\/a> covers DataHub issue #11654 with the same playbook. Issues, PRs, and feedback welcome on GitHub.<\/p>\n\n\n\n<p><em>Disclosure: This tool is built by <a href=\"https:\/\/www.gudusoft.com\/?utm_source=datahub&amp;utm_medium=oss&amp;utm_campaign=issue-11670&amp;utm_content=disclosure\">Gudu Software<\/a>, the team behind General SQL Parser and SQLFlow. We specialize in deep SQL parsing and column-level lineage across SQL dialects.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>DataHub silently drops column-level lineage on the dbt-utils deduplicate macro because of how sqlglot&#8217;s column resolver handles ARRAY_AGG + struct unpack. Here&#8217;s why \u2014 and an open-source post-processor that recovers the missing lineage.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,14,93],"tags":[153,117,119,132,120,154,29,127],"blocksy_meta":{"styles_descriptor":{"styles":{"desktop":"","tablet":"","mobile":""},"google_fonts":[],"version":5}},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v19.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It<\/title>\n<meta name=\"description\" content=\"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It\" \/>\n<meta property=\"og:description\" content=\"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/\" \/>\n<meta property=\"og:site_name\" content=\"SQL and Data Blog\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-01T06:24:30+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-01T07:31:30+00:00\" \/>\n<meta name=\"author\" content=\"James\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"James\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.dpriver.com\/blog\/#organization\",\"name\":\"SQL and Data Blog\",\"url\":\"https:\/\/www.dpriver.com\/blog\/\",\"sameAs\":[],\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.dpriver.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.dpriver.com\/blog\/wp-content\/uploads\/2022\/07\/sqlpp-character.png\",\"contentUrl\":\"https:\/\/www.dpriver.com\/blog\/wp-content\/uploads\/2022\/07\/sqlpp-character.png\",\"width\":251,\"height\":72,\"caption\":\"SQL and Data Blog\"},\"image\":{\"@id\":\"https:\/\/www.dpriver.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.dpriver.com\/blog\/#website\",\"url\":\"https:\/\/www.dpriver.com\/blog\/\",\"name\":\"SQL and Data Blog\",\"description\":\"SQL related blog for database professional\",\"publisher\":{\"@id\":\"https:\/\/www.dpriver.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.dpriver.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/\",\"url\":\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/\",\"name\":\"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It\",\"isPartOf\":{\"@id\":\"https:\/\/www.dpriver.com\/blog\/#website\"},\"datePublished\":\"2026-05-01T06:24:30+00:00\",\"dateModified\":\"2026-05-01T07:31:30+00:00\",\"description\":\"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It\",\"breadcrumb\":{\"@id\":\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.dpriver.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It\"}]},{\"@type\":\"Article\",\"@id\":\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/\"},\"author\":{\"name\":\"James\",\"@id\":\"https:\/\/www.dpriver.com\/blog\/#\/schema\/person\/7bbdbb6e79c5dd9747d08c59d5992b04\"},\"headline\":\"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It\",\"datePublished\":\"2026-05-01T06:24:30+00:00\",\"dateModified\":\"2026-05-01T07:31:30+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/\"},\"wordCount\":1180,\"publisher\":{\"@id\":\"https:\/\/www.dpriver.com\/blog\/#organization\"},\"keywords\":[\"array-agg\",\"bigquery\",\"column-lineage\",\"datahub\",\"dbt\",\"deduplication\",\"SQLFlow\",\"sqlglot\"],\"articleSection\":[\"Data Governance\",\"gsp\",\"SQLFlow\"],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.dpriver.com\/blog\/#\/schema\/person\/7bbdbb6e79c5dd9747d08c59d5992b04\",\"name\":\"James\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.dpriver.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/eeddf4ca7bdafa37ab025068efdc7302?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/eeddf4ca7bdafa37ab025068efdc7302?s=96&d=mm&r=g\",\"caption\":\"James\"},\"sameAs\":[\"http:\/\/www.dpriver.com\"],\"url\":\"https:\/\/www.dpriver.com\/blog\/author\/james\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It","description":"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/","og_locale":"en_US","og_type":"article","og_title":"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It","og_description":"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It","og_url":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/","og_site_name":"SQL and Data Blog","article_published_time":"2026-05-01T06:24:30+00:00","article_modified_time":"2026-05-01T07:31:30+00:00","author":"James","twitter_card":"summary_large_image","twitter_misc":{"Written by":"James","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Organization","@id":"https:\/\/www.dpriver.com\/blog\/#organization","name":"SQL and Data Blog","url":"https:\/\/www.dpriver.com\/blog\/","sameAs":[],"logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.dpriver.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.dpriver.com\/blog\/wp-content\/uploads\/2022\/07\/sqlpp-character.png","contentUrl":"https:\/\/www.dpriver.com\/blog\/wp-content\/uploads\/2022\/07\/sqlpp-character.png","width":251,"height":72,"caption":"SQL and Data Blog"},"image":{"@id":"https:\/\/www.dpriver.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"WebSite","@id":"https:\/\/www.dpriver.com\/blog\/#website","url":"https:\/\/www.dpriver.com\/blog\/","name":"SQL and Data Blog","description":"SQL related blog for database professional","publisher":{"@id":"https:\/\/www.dpriver.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.dpriver.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/","url":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/","name":"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It","isPartOf":{"@id":"https:\/\/www.dpriver.com\/blog\/#website"},"datePublished":"2026-05-01T06:24:30+00:00","dateModified":"2026-05-01T07:31:30+00:00","description":"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It","breadcrumb":{"@id":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.dpriver.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It"}]},{"@type":"Article","@id":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/#article","isPartOf":{"@id":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/"},"author":{"name":"James","@id":"https:\/\/www.dpriver.com\/blog\/#\/schema\/person\/7bbdbb6e79c5dd9747d08c59d5992b04"},"headline":"Why DataHub Loses Column-Level Lineage on dbt Deduplication Macros \u2014 and How to Recover It","datePublished":"2026-05-01T06:24:30+00:00","dateModified":"2026-05-01T07:31:30+00:00","mainEntityOfPage":{"@id":"https:\/\/www.dpriver.com\/blog\/2026\/05\/why-datahub-loses-column-level-lineage-on-dbt-deduplication-macros-and-how-to-re\/"},"wordCount":1180,"publisher":{"@id":"https:\/\/www.dpriver.com\/blog\/#organization"},"keywords":["array-agg","bigquery","column-lineage","datahub","dbt","deduplication","SQLFlow","sqlglot"],"articleSection":["Data Governance","gsp","SQLFlow"],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.dpriver.com\/blog\/#\/schema\/person\/7bbdbb6e79c5dd9747d08c59d5992b04","name":"James","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.dpriver.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/eeddf4ca7bdafa37ab025068efdc7302?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/eeddf4ca7bdafa37ab025068efdc7302?s=96&d=mm&r=g","caption":"James"},"sameAs":["http:\/\/www.dpriver.com"],"url":"https:\/\/www.dpriver.com\/blog\/author\/james\/"}]}},"_links":{"self":[{"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/posts\/3211"}],"collection":[{"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/comments?post=3211"}],"version-history":[{"count":4,"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/posts\/3211\/revisions"}],"predecessor-version":[{"id":3215,"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/posts\/3211\/revisions\/3215"}],"wp:attachment":[{"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/media?parent=3211"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/categories?post=3211"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.dpriver.com\/blog\/wp-json\/wp\/v2\/tags?post=3211"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}