BigQuery Column-Level Lineage Without Metadata: Inferring STRUCT and ARRAY Types from SQL Alone
If you are searching for a column-level lineage solution for BigQuery, you have probably already discovered the part of the problem nobody talks about up front: BigQuery is a typed, nested SQL dialect. The moment your warehouse adopts STRUCT, ARRAY<STRUCT>, SELECT AS STRUCT, or array_agg(table_alias), a parser that does not know the schema can stop dead, give up on the column, or — worse — silently produce wrong lineage.
This post walks through two real BigQuery patterns and shows what a SQL parser can actually infer from the SQL text alone, with no access to the BigQuery catalog, no INFORMATION_SCHEMA, no dbt manifest. We will use it to explain why generic SQL parsers struggle with BigQuery lineage, and what it takes to get column-level lineage right when metadata is missing.
The two patterns we will analyze:
ARRAY(SELECT AS STRUCT ...)— building a nestedARRAY<STRUCT>column from a flat input row.array_agg(original ORDER BY ... LIMIT 1)[OFFSET(0)] uniquefollowed bySELECT unique.*— the dbt-utils-style deduplication pattern that flattens a row STRUCT back into top-level columns.
Both produce correct BigQuery lineage even without metadata — but only if the parser understands BigQuery semantics, not just BigQuery grammar.
Why BigQuery lineage is harder than “normal” SQL lineage
A column-level lineage extractor for plain ANSI SQL has one job per output column: walk the SELECT list, resolve each expression back to a base table column. That breaks down in BigQuery for three reasons:
- Output columns are not always scalar. A
SELECT AS STRUCTexpression produces a single output column whose type is itself a record. A downstream view’scolumn_namemay actually expand into a dozen physical fields. - Row variables behave like STRUCTs. BigQuery lets you write
array_agg(table_alias)and get backARRAY<STRUCT>— without ever naming the columns in the SQL. STRUCT.is field expansion, not column projection. The same syntax (alias.) means very different things depending on whetheraliasis a table or a STRUCT-typed expression.
If your lineage tool was originally built for Postgres or MySQL, every one of those is a hole. Most popular open-source parsers — JSQLParser, sqlglot, sqlparse — handle two of the three at best, and even then only with metadata. Without metadata, generic parsers either drop the column from lineage entirely or attribute it to the wrong source.
The good news: a lot more is recoverable from the SQL alone than people assume, as long as the parser knows BigQuery’s nested-type rules.
Pattern 1: ARRAY(SELECT AS STRUCT ...) builds a nested column
Here is a typical BigQuery view that flattens an orders_raw table into a single nested orders column:
CREATE VIEW orders_view AS
SELECT
ARRAY(SELECT AS STRUCT
customer.name AS customer_name,
order_id AS order_id
) AS orders
FROM
orders_raw;
Read by a metadata-free parser, this SQL says quite a lot.
Step 1 — the inner query is a STRUCT constructor.
SELECT AS STRUCT
customer.name AS customer_name,
order_id AS order_id
SELECT AS STRUCT is a BigQuery-specific keyword. It is not “select two columns” — it is “build one row of type STRUCT“. The parser does not need to know whether customer.name is STRING or BYTES; it knows for certain that customer_name and order_id are fields of a STRUCT. That’s a semantic guarantee from the grammar itself.
Step 2 — ARRAY(...) wraps the STRUCT in an ARRAY.
ARRAY(SELECT AS STRUCT ...)
The result type is ARRAY<STRUCT>. Again, the leaf scalar types are unknown without metadata, but the shape is fully determined by the SQL.
Step 3 — the outer alias names a single column.
) AS orders
So orders_view has exactly one top-level column named orders. Its type form is:
orders ARRAY<STRUCT<customer_name ?, order_id ?>>
What lineage can we extract without metadata?
A BigQuery-aware parser can produce this column-level lineage:
orders_raw.customer.name -> orders_view.orders.customer_name
orders_raw.order_id -> orders_view.orders.order_id
That is real, navigable, nested column-level lineage from the SQL text — no schema lookup required. The only thing the parser cannot prove from SQL alone is the leaf scalar type (STRING? INT64?) and whether orders_raw.customer is itself a STRUCT in the source table. Those need either metadata or BigQuery’s own type checker at execution time.
What a generic parser typically gets wrong
Without BigQuery semantics, a generic parser tends to do one of these:
- Treat
ARRAY(SELECT AS STRUCT ...)as an unknown function and drop the lineage. - Emit a single edge from
orders_raw -> orders_view.orderswith no internal field structure, hiding the fact thatcustomer_nameandorder_idare separate fields downstream. - Flatten the STRUCT incorrectly, producing two top-level columns
customer_nameandorder_idonorders_view— which is wrong: there is only one top-level column calledorders.
If your data catalog tells you orders_view has columns customer_name and order_id, the parser feeding it has misread the SQL.
Pattern 2: array_agg(row)[OFFSET(0)] unique followed by unique.*
This is the canonical dbt-utils deduplication pattern, which is one of the most common SQL shapes in modern BigQuery warehouses:
CREATE TABLE deduplicated_articles AS
SELECT unique.*
FROM (
SELECT
array_agg(
original
ORDER BY article_name DESC
LIMIT 1
)[OFFSET(0)] unique
FROM all_articles original
GROUP BY id
);
This SQL is harder for a generic parser because no individual columns are ever named in the projection. And yet, a BigQuery-aware parser can still extract correct lineage from the structure.
Step 1 — original is a row variable, not a column.
FROM all_articles original
original is the table alias for all_articles. When passed as a value expression (instead of with a .column access), it represents the entire row, which BigQuery treats as a STRUCT containing every column of all_articles.
Step 2 — array_agg(original) aggregates whole rows.
array_agg(original ORDER BY article_name DESC LIMIT 1)
This produces ARRAY<STRUCT>. The ORDER BY ... LIMIT 1 keeps the top row per group.
Step 3 — [OFFSET(0)] extracts a STRUCT.
array_agg(...)[OFFSET(0)] unique
Now unique is a single STRUCT containing every column of all_articles. Crucially, the parser knows this without knowing what those columns are.
Step 4 — unique.* flattens the STRUCT into top-level columns.
SELECT unique.*
This is the syntactic twist that breaks naive parsers. unique. does not* mean “select the column named unique“. It means “expand every field of the STRUCT unique into its own top-level column”. So deduplicated_articles ends up with the same top-level column set as all_articles.
What lineage can we extract without metadata?
The parser can confidently emit:
all_articles.* -> deduplicated_articles.*
That is, “every column of all_articles flows into deduplicated_articles, picking the row with the largest article_name per id“. The parser can also note that article_name participates in the row-selection logic (for impact analysis), even though it does not change the value lineage.
What the parser cannot do without metadata is enumerate the exact column names. If you need deduplicated_articles.article_name -> all_articles.article_name as a concrete edge, you need either the schema or a metadata-aware second pass.
Why this pattern defeats most generic parsers
Three reasons:
- They treat
original(a bare table alias used as a value) as an error or as an unresolved column. - They treat
[OFFSET(0)]after a function call as unknown syntax. - They treat
unique.liketable_alias.and look for a table calledunique, fail to find one, and drop the projection.
Any one of those failures collapses the entire lineage edge.
What a BigQuery-aware parser needs to know
Stepping back, here is the irreducible set of BigQuery semantics a metadata-free lineage tool must implement:
Syntax Semantic What lineage gains SELECT AS STRUCTBuild a single STRUCT row, not multiple columns Output is one column whose fields are named in the SELECT list ARRAY(SELECT AS STRUCT ...)Wrap STRUCT rows into ARRAY<STRUCT<…>> Output column type form is fully recoverable array_agg(table_alias)Aggregate whole rows; result is ARRAY<row STRUCT> Whole-row flow is detectable even without column enumeration arr[OFFSET(n)]Extract one element of an ARRAY Result type is the array element type (often STRUCT) struct_alias.*Expand STRUCT fields into top-level columns Disambiguates “select all from table” vs “explode this STRUCT” nested.path.accessWalk into nested STRUCT fields Edges target the specific leaf field, not the parent column
If a parser does not implement at least these six semantics, BigQuery column-level lineage will be either incomplete or misleading whenever your team uses nested types — which, in modern BigQuery warehouses, is most of the time.
A simple memory aid for reading BigQuery SQL
When eyeballing BigQuery SQL for lineage, three patterns are worth memorizing:
- See
SELECT AS STRUCT→ the SQL is constructing a STRUCT. - See
ARRAY(SELECT AS STRUCT ...)→ the result column isARRAY<STRUCT>. - See
table_aliasused as a value (e.g.array_agg(original)) → the value is the entire row as a STRUCT. - See
alias.wherealiasis a STRUCT-typed expression → fields are being flattened* into top-level columns, not selecting from a table.
Anywhere those patterns appear, a metadata-free parser can still recover the lineage shape. What it cannot recover is the leaf scalar types, the full column list of a wildcard-flattened STRUCT, or the schema of the underlying table — those still need metadata or a second resolution pass.
How Gudu’s SQL engine handles it
All three of Gudu Software’s products — General SQL Parser (GSP), SQLFlow, and SQL Omni — share one column-level lineage engine. They differ in how you use it:
- GSP is the SDK: embed lineage extraction in a Java application or data platform.
- SQLFlow is the hosted lineage platform with REST APIs and visualization.
- SQL Omni is the VS Code extension for individual developers, 100% offline.
The engine implements BigQuery’s nested-type semantics — SELECT AS STRUCT, ARRAY(...), row-as-STRUCT in array_agg, [OFFSET(n)] array indexing, and STRUCT. field expansion — so that the two patterns in this post produce correct lineage with or without metadata. When metadata is available (via SQLFlow’s database integrations or a supplied schema), the engine fills in concrete column names where wildcards appeared. When metadata is not* available, you still get the structural lineage shown above, rather than silent gaps.
Disclosure: I work at Gudu Software.
Try it yourself
Paste either of the SQL samples in this post into the SQLFlow online demo and pick BigQuery as the dialect. You’ll see column-level lineage rendered as a graph, including the nested orders.customer_name and orders.order_id edges from Pattern 1, and the all_articles. -> deduplicated_articles. whole-row flow from Pattern 2.
If you’d rather embed the engine in your own pipeline, the GSP BigQuery quickstart shows how to extract the same lineage as JSON in Java. For VS Code users on regulated stacks, SQL Omni runs the same analysis 100% offline.
Takeaways
- BigQuery column-level lineage is harder than ANSI lineage because of
STRUCT,ARRAY<STRUCT>, row-as-STRUCT, andSTRUCT.*field expansion. - A lot is recoverable from SQL text alone, including the type shape of nested output columns, the field names of
SELECT AS STRUCT, and whole-row flows fromarray_agg(row). - A little is genuinely impossible without metadata: leaf scalar types and the column list of any wildcard-flattened STRUCT.
- A generic SQL parser that does not implement these BigQuery semantics will produce either silent gaps or wrong attributions on any modern BigQuery warehouse — which is why a BigQuery-specialized parser matters when you’re choosing a column-level lineage solution.

