BigQuery Column-Level Lineage Without Metadata: Inferring STRUCT and ARRAY Types from SQL Alone

How a BigQuery-aware SQL parser can extract column-level lineage from ARRAY<STRUCT>, SELECT AS STRUCT, and array_agg(row) patterns without any catalog metadata — and where generic parsers silently fail.

If you are searching for a column-level lineage solution for BigQuery, you have probably already discovered the part of the problem nobody talks about up front: BigQuery is a typed, nested SQL dialect. The moment your warehouse adopts STRUCT, ARRAY<STRUCT>, SELECT AS STRUCT, or array_agg(table_alias), a parser that does not know the schema can stop dead, give up on the column, or — worse — silently produce wrong lineage.

This post walks through two real BigQuery patterns and shows what a SQL parser can actually infer from the SQL text alone, with no access to the BigQuery catalog, no INFORMATION_SCHEMA, no dbt manifest. We will use it to explain why generic SQL parsers struggle with BigQuery lineage, and what it takes to get column-level lineage right when metadata is missing.

The two patterns we will analyze:

  1. ARRAY(SELECT AS STRUCT ...) — building a nested ARRAY<STRUCT> column from a flat input row.
  2. array_agg(original ORDER BY ... LIMIT 1)[OFFSET(0)] unique followed by SELECT unique.* — the dbt-utils-style deduplication pattern that flattens a row STRUCT back into top-level columns.

Both produce correct BigQuery lineage even without metadata — but only if the parser understands BigQuery semantics, not just BigQuery grammar.

Why BigQuery lineage is harder than “normal” SQL lineage

A column-level lineage extractor for plain ANSI SQL has one job per output column: walk the SELECT list, resolve each expression back to a base table column. That breaks down in BigQuery for three reasons:

  • Output columns are not always scalar. A SELECT AS STRUCT expression produces a single output column whose type is itself a record. A downstream view’s column_name may actually expand into a dozen physical fields.
  • Row variables behave like STRUCTs. BigQuery lets you write array_agg(table_alias) and get back ARRAY<STRUCT> — without ever naming the columns in the SQL.
  • STRUCT. is field expansion, not column projection. The same syntax (alias.) means very different things depending on whether alias is a table or a STRUCT-typed expression.

If your lineage tool was originally built for Postgres or MySQL, every one of those is a hole. Most popular open-source parsers — JSQLParser, sqlglot, sqlparse — handle two of the three at best, and even then only with metadata. Without metadata, generic parsers either drop the column from lineage entirely or attribute it to the wrong source.

The good news: a lot more is recoverable from the SQL alone than people assume, as long as the parser knows BigQuery’s nested-type rules.

Pattern 1: ARRAY(SELECT AS STRUCT ...) builds a nested column

Here is a typical BigQuery view that flattens an orders_raw table into a single nested orders column:

CREATE VIEW orders_view AS
SELECT
    ARRAY(SELECT AS STRUCT
        customer.name AS customer_name,
        order_id      AS order_id
    ) AS orders
FROM
    orders_raw;

Read by a metadata-free parser, this SQL says quite a lot.

Step 1 — the inner query is a STRUCT constructor.

SELECT AS STRUCT
    customer.name AS customer_name,
    order_id      AS order_id

SELECT AS STRUCT is a BigQuery-specific keyword. It is not “select two columns” — it is “build one row of type STRUCT“. The parser does not need to know whether customer.name is STRING or BYTES; it knows for certain that customer_name and order_id are fields of a STRUCT. That’s a semantic guarantee from the grammar itself.

Step 2 — ARRAY(...) wraps the STRUCT in an ARRAY.

ARRAY(SELECT AS STRUCT ...)

The result type is ARRAY<STRUCT>. Again, the leaf scalar types are unknown without metadata, but the shape is fully determined by the SQL.

Step 3 — the outer alias names a single column.

) AS orders

So orders_view has exactly one top-level column named orders. Its type form is:

orders ARRAY<STRUCT<customer_name ?, order_id ?>>

What lineage can we extract without metadata?

A BigQuery-aware parser can produce this column-level lineage:

orders_raw.customer.name -> orders_view.orders.customer_name
orders_raw.order_id      -> orders_view.orders.order_id

That is real, navigable, nested column-level lineage from the SQL text — no schema lookup required. The only thing the parser cannot prove from SQL alone is the leaf scalar type (STRING? INT64?) and whether orders_raw.customer is itself a STRUCT in the source table. Those need either metadata or BigQuery’s own type checker at execution time.

What a generic parser typically gets wrong

Without BigQuery semantics, a generic parser tends to do one of these:

  • Treat ARRAY(SELECT AS STRUCT ...) as an unknown function and drop the lineage.
  • Emit a single edge from orders_raw -> orders_view.orders with no internal field structure, hiding the fact that customer_name and order_id are separate fields downstream.
  • Flatten the STRUCT incorrectly, producing two top-level columns customer_name and order_id on orders_view — which is wrong: there is only one top-level column called orders.

If your data catalog tells you orders_view has columns customer_name and order_id, the parser feeding it has misread the SQL.

Pattern 2: array_agg(row)[OFFSET(0)] unique followed by unique.*

This is the canonical dbt-utils deduplication pattern, which is one of the most common SQL shapes in modern BigQuery warehouses:

CREATE TABLE deduplicated_articles AS
  SELECT unique.*
  FROM (
       SELECT
           array_agg(
               original
                   ORDER BY article_name DESC
                   LIMIT 1
           )[OFFSET(0)] unique
       FROM all_articles original
       GROUP BY id
  );

This SQL is harder for a generic parser because no individual columns are ever named in the projection. And yet, a BigQuery-aware parser can still extract correct lineage from the structure.

Step 1 — original is a row variable, not a column.

FROM all_articles original

original is the table alias for all_articles. When passed as a value expression (instead of with a .column access), it represents the entire row, which BigQuery treats as a STRUCT containing every column of all_articles.

Step 2 — array_agg(original) aggregates whole rows.

array_agg(original ORDER BY article_name DESC LIMIT 1)

This produces ARRAY<STRUCT>. The ORDER BY ... LIMIT 1 keeps the top row per group.

Step 3 — [OFFSET(0)] extracts a STRUCT.

array_agg(...)[OFFSET(0)] unique

Now unique is a single STRUCT containing every column of all_articles. Crucially, the parser knows this without knowing what those columns are.

Step 4 — unique.* flattens the STRUCT into top-level columns.

SELECT unique.*

This is the syntactic twist that breaks naive parsers. unique. does not* mean “select the column named unique“. It means “expand every field of the STRUCT unique into its own top-level column”. So deduplicated_articles ends up with the same top-level column set as all_articles.

What lineage can we extract without metadata?

The parser can confidently emit:

all_articles.* -> deduplicated_articles.*

That is, “every column of all_articles flows into deduplicated_articles, picking the row with the largest article_name per id“. The parser can also note that article_name participates in the row-selection logic (for impact analysis), even though it does not change the value lineage.

What the parser cannot do without metadata is enumerate the exact column names. If you need deduplicated_articles.article_name -> all_articles.article_name as a concrete edge, you need either the schema or a metadata-aware second pass.

Why this pattern defeats most generic parsers

Three reasons:

  • They treat original (a bare table alias used as a value) as an error or as an unresolved column.
  • They treat [OFFSET(0)] after a function call as unknown syntax.
  • They treat unique. like table_alias. and look for a table called unique, fail to find one, and drop the projection.

Any one of those failures collapses the entire lineage edge.

What a BigQuery-aware parser needs to know

Stepping back, here is the irreducible set of BigQuery semantics a metadata-free lineage tool must implement:

SyntaxSemanticWhat lineage gains
SELECT AS STRUCTBuild a single STRUCT row, not multiple columnsOutput is one column whose fields are named in the SELECT list
ARRAY(SELECT AS STRUCT ...)Wrap STRUCT rows into ARRAY<STRUCT<…>>Output column type form is fully recoverable
array_agg(table_alias)Aggregate whole rows; result is ARRAY<row STRUCT>Whole-row flow is detectable even without column enumeration
arr[OFFSET(n)]Extract one element of an ARRAYResult type is the array element type (often STRUCT)
struct_alias.*Expand STRUCT fields into top-level columnsDisambiguates “select all from table” vs “explode this STRUCT”
nested.path.accessWalk into nested STRUCT fieldsEdges target the specific leaf field, not the parent column

If a parser does not implement at least these six semantics, BigQuery column-level lineage will be either incomplete or misleading whenever your team uses nested types — which, in modern BigQuery warehouses, is most of the time.

A simple memory aid for reading BigQuery SQL

When eyeballing BigQuery SQL for lineage, three patterns are worth memorizing:

  • See SELECT AS STRUCT → the SQL is constructing a STRUCT.
  • See ARRAY(SELECT AS STRUCT ...) → the result column is ARRAY<STRUCT>.
  • See table_alias used as a value (e.g. array_agg(original)) → the value is the entire row as a STRUCT.
  • See alias. where alias is a STRUCT-typed expression → fields are being flattened* into top-level columns, not selecting from a table.

Anywhere those patterns appear, a metadata-free parser can still recover the lineage shape. What it cannot recover is the leaf scalar types, the full column list of a wildcard-flattened STRUCT, or the schema of the underlying table — those still need metadata or a second resolution pass.

How Gudu’s SQL engine handles it

All three of Gudu Software’s products — General SQL Parser (GSP), SQLFlow, and SQL Omni — share one column-level lineage engine. They differ in how you use it:

  • GSP is the SDK: embed lineage extraction in a Java application or data platform.
  • SQLFlow is the hosted lineage platform with REST APIs and visualization.
  • SQL Omni is the VS Code extension for individual developers, 100% offline.

The engine implements BigQuery’s nested-type semantics — SELECT AS STRUCT, ARRAY(...), row-as-STRUCT in array_agg, [OFFSET(n)] array indexing, and STRUCT. field expansion — so that the two patterns in this post produce correct lineage with or without metadata. When metadata is available (via SQLFlow’s database integrations or a supplied schema), the engine fills in concrete column names where wildcards appeared. When metadata is not* available, you still get the structural lineage shown above, rather than silent gaps.

Disclosure: I work at Gudu Software.

Try it yourself

Paste either of the SQL samples in this post into the SQLFlow online demo and pick BigQuery as the dialect. You’ll see column-level lineage rendered as a graph, including the nested orders.customer_name and orders.order_id edges from Pattern 1, and the all_articles. -> deduplicated_articles. whole-row flow from Pattern 2.

If you’d rather embed the engine in your own pipeline, the GSP BigQuery quickstart shows how to extract the same lineage as JSON in Java. For VS Code users on regulated stacks, SQL Omni runs the same analysis 100% offline.

Takeaways

  • BigQuery column-level lineage is harder than ANSI lineage because of STRUCT, ARRAY<STRUCT>, row-as-STRUCT, and STRUCT.* field expansion.
  • A lot is recoverable from SQL text alone, including the type shape of nested output columns, the field names of SELECT AS STRUCT, and whole-row flows from array_agg(row).
  • A little is genuinely impossible without metadata: leaf scalar types and the column list of any wildcard-flattened STRUCT.
  • A generic SQL parser that does not implement these BigQuery semantics will produce either silent gaps or wrong attributions on any modern BigQuery warehouse — which is why a BigQuery-specialized parser matters when you’re choosing a column-level lineage solution.