Why Your DataHub BigQuery Lineage Silently Breaks on Procedural SQL

If you use DataHub to track data lineage from BigQuery, you may have noticed something unsettling: some of your most important queries produce no lineage at all. No upstream tables, no column mappings — just a silent gap in your lineage graph.

This isn’t a DataHub bug. It’s a parser limitation that affects a specific but increasingly common class of SQL: GoogleSQL procedural blocks.

This post explains exactly what breaks, why it happens, and how to recover the missing lineage using an open-source sidecar tool — without modifying your DataHub installation.

What breaks: procedural SQL in BigQuery

BigQuery’s GoogleSQL supports procedural constructs: DECLARE, IF/THEN/END IF, CALL, BEGIN...EXCEPTION...END, and CREATE TEMP TABLE inside scripting blocks. These are widely used in production BigQuery workloads for ETL orchestration, conditional logic, and error handling.

Here’s a simplified example (derived from datahub-project/datahub#11654):

DECLARE current_job_start TIMESTAMP DEFAULT CURRENT_TIMESTAMP;
DECLARE partitions STRUCT<max_record_ts TIMESTAMP, dates ARRAY<DATE>>;

CALL `internal_project.get_partitions`(
  ('project.dataset.view_name', 'EventTimestamp'),
  partitions
);

IF ARRAY_LENGTH(partitions.dates) > 0 THEN
  CREATE OR REPLACE TEMP TABLE temp_table AS
  SELECT * EXCEPT (SnapshotTimestamp)
  FROM `project.dataset.view_name`
  WHERE (IDField, FlagField, ForeignKeyField, StartDate)
        IN UNNEST(partitions.dates)
  ORDER BY EventTimestamp;

  IF (SELECT COUNT(1) FROM temp_table_delta) > 0 THEN
    CREATE OR REPLACE TEMP TABLE final_output AS
    SELECT DISTINCT IDField, Email, UserID,
           EventTimestamp, BusinessDate
    FROM temp_table_delta
    WHERE EventTimestamp BETWEEN '2023-01-01' AND '2023-12-31';
  END IF;
END IF;

This SQL contains two clear lineage relationships:

project.dataset.view_name → temp_table (6 columns)
temp_table_delta → final_output (5 columns)

But when DataHub ingests this query, both relationships are lost.

Why it happens

DataHub’s BigQuery ingestion uses sqlglot for SQL parsing and lineage extraction. sqlglot handles standard SQL well — SELECT, INSERT, CREATE VIEW, joins, CTEs, window functions — but its BigQuery dialect does not support procedural language constructs.

When sqlglot encounters DECLARE, IF/THEN, or CALL, it falls back to an opaque Command node. The SQL inside the procedural block is not parsed, and no lineage is extracted. This isn’t a bug in sqlglot — procedural SQL is genuinely hard to parse, and full procedural language support is outside sqlglot’s current scope.

The result: DataHub silently drops lineage for any BigQuery query that uses procedural syntax. No error message, no warning in the UI — the lineage simply doesn’t appear.

This issue was first reported in October 2024 (datahub-project/datahub#11654) and remains open.

The fix: a post-processor sidecar

We built gsp-datahub-sidecar, an open-source (Apache 2.0) tool that recovers the missing lineage. It works as a post-processor alongside your existing DataHub ingestion — no changes to your DataHub installation required.

The sidecar re-parses the SQL that sqlglot couldn’t handle using Gudu SQLFlow, which supports procedural SQL natively across 20+ dialects, including BigQuery’s GoogleSQL. It then emits the recovered lineage to DataHub via the standard REST API.

DataHub ingestion (unchanged)       gsp-datahub-sidecar
  BQ audit log → sqlglot → lineage     |
                    |                   | re-parse procedural SQL
                    v                   | with Gudu SQLFlow
              "Command fallback"        v
              (lineage lost)      DataHub GMS (lineage restored)

The sidecar does not replace or interfere with sqlglot-generated lineage. It adds to it — filling the gaps that procedural SQL creates.

How to use it

Step 1: Install

pip install git+https://github.com/gudusoftware/gsp-datahub-sidecar.git

Step 2: Dry-run (no DataHub needed)

Preview the lineage that would be recovered:

gsp-datahub-sidecar \
  --sql-file examples/bigquery_procedural.sql \
  --dry-run

Output:

Processing 1 SQL statement(s) in 'anonymous' mode...
  Lineage: PROJECT.DATASET.VIEW_NAME --> TEMP_TABLE (12 columns)
  Lineage: TEMP_TABLE_DELTA --> FINAL_OUTPUT (5 columns)
[DRY RUN] Would emit 2 MCPs to http://localhost:8080

The two lineage relationships that DataHub was missing are recovered, along with 11 column-level mappings.

Step 3: Emit to DataHub

Point the sidecar at your DataHub GMS:

gsp-datahub-sidecar \
  --sql-file examples/bigquery_procedural.sql \
  --datahub-server http://localhost:8080

Step 4: Verify in DataHub

Open the DataHub UI and search for temp_table. The lineage tab now shows the upstream relationship from project.dataset.view_name, with column-level lineage arrows:

temp_table lineage in DataHub — table-level and column-level arrows recovered

And for final_output, the lineage from temp_table_delta with all 5 column mappings:

final_output lineage in DataHub — 5 column-level lineage arrows from temp_table_delta

Both table-level and column-level lineage are fully visible in DataHub’s lineage graph — recovered from procedural SQL that was previously invisible.

Three backend modes

The sidecar supports three backends, depending on your security and volume requirements:

Mode	Auth	Limit	Data location	Use case
`anonymous` (default)	None	50/day	SQL sent to api.gudusoft.com	Quick evaluation
`authenticated`	userId + secretKey	10k/month	SQL sent to api.gudusoft.com	Extended evaluation
`self_hosted`	userId + secretKey	Unlimited	SQL stays in your VPC	Production

For production use with sensitive SQL, the self-hosted SQLFlow Docker keeps all data in your infrastructure — no SQL leaves your network.

What’s next

If you’re affected by datahub-project/datahub#11654 or similar lineage gaps in your DataHub instance:

Try the dry-run with your own SQL files to see what lineage you’re currently missing
Check the GitHub repo for full documentation, configuration options, and additional examples (including Oracle stored procedures)
Sign up for the authenticated tier at docs.gudusoft.com/sign-up for higher volume evaluation

The sidecar is Apache 2.0 licensed. Gudu SQLFlow is a commercial product with a free evaluation tier. For questions or feedback, open an issue on GitHub or reach out at support@gudusoft.com.

Disclosure: This tool is built by Gudu Software, the team behind General SQL Parser and SQLFlow. We specialize in deep SQL parsing and data lineage across 20+ SQL dialects.