Tracking Column Lineage in BigQuery Views: A Practical Guide for the SELECT * Challenge

Data professionals often face a common challenge in BigQuery and other SQL databases: tracking the journey of a specific column through multiple layers of views, especially when those views are created with SELECT *. While SELECT * is convenient during development, it can obscure dependencies and make impact analysis a daunting task.

The Problem: “Where is This Column Used?”

Imagine this scenario, which is a common question on platforms like Reddit and Stack Overflow:

You have a base table in BigQuery, dataset.table1, with two columns, col1 and col2. You then create a view, dataset.view1, using a simple select * from dataset.table1. To make things more interesting, you create another view, dataset.view2, based on the first one: select * from dataset.view1.

Now, the critical question arises: How can you programmatically determine that table1.col1 is used in view2?

BigQuery’s INFORMATION_SCHEMA can help if columns are explicitly named in the view’s DDL, but it falls short when SELECT * is used. This leaves you with the difficult and error-prone task of manual inspection.

A Concrete Example

Here is the SQL code that illustrates this exact problem:

create table dataset.table1(
 col1 int,
 col2 char
);

create view dataset.view1 as select * from dataset.table1;
create view dataset.view2 as select * from dataset.view1;

How can we trace col1 and col2 from their origin in table1 all the way to view2?

The Solution: Automated Column-Level Data Lineage

The most reliable and efficient way to solve this is with an automated data lineage tool. These tools parse your SQL code, understand the dependencies created by statements like CREATE VIEW, and map the flow of data from source to destination, right down to the column level—even through complex SELECT * statements.

Here are three easy ways to get this done.

Method 1: Use a No-Code Web Application

For a quick and visual answer, you can use an online tool like SQLFlow. You can simply paste your SQL code into the web interface and get an instant, interactive diagram and report of the data lineage. This is perfect for quick, one-off analyses.

Method 2: Use a VS Code Extension

If you prefer to work within your development environment, you can use a free VS Code extension like Gudu SQL Omni. This brings the power of data lineage directly into your editor. You can analyze your SQL files and visualize the column-level relationships without ever leaving your IDE, streamlining your workflow.

Method 3: Use a REST API for Automation

For programmatic access, automation, or integration into a larger data governance framework, a REST API is the ideal solution. You can submit your SQL code via a simple curl command and receive the detailed, column-level lineage in a structured format like CSV.

Here’s how you can do it with the SQLFlow API. First, save the SQL above into a file (e.g., views.sql), then run the following command:

curl -X POST "https://api.gudusoft.com/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsCsv" \
-H "Request-Origion:testClientDemo" \
-H "Content-Type:multipart/form-data" \
-F "sqlfile=@/path/to/your/views.sql" \
-F "dbvendor=dbvbigquery" \
-F "userId=YOUR USER ID HERE" \
-F "token=YOUR SECRET KEY"

The API will process the SQL and return a detailed CSV report that explicitly maps the column relationships. This is the “Aha!” moment.

The Result

Here is a sample of the CSV output you would receive from the API:

source_db,source_schema,source_table_id,source_table,source_column_id,source_column,target_db,target_schema,target_table_id,target_table,target_column_id,target_column,relation_type,effect_type,procedure
DEFAULT,dataset,4,table1,5,COL1,DEFAULT,dataset,12,view1,14,COL1,fdd,create_view,
DEFAULT,dataset,4,table1,6,COL2,DEFAULT,dataset,12,view1,15,COL2,fdd,create_view,
DEFAULT,dataset,12,view1,14,COL1,DEFAULT,dataset,21,view2,23,COL1,fdd,create_view,
DEFAULT,dataset,12,view1,15,COL2,DEFAULT,dataset,21,view2,24,COL2,fdd,create_view,

As you can see, the output clearly shows the lineage:

table1.COL1 flows to view1.COL1
view1.COL1 flows to view2.COL1

This directly answers the question and provides the precise, actionable information needed for reliable impact analysis.

A Note on Other Databases

While this article uses BigQuery as the primary example, it’s important to note that this column-level lineage challenge is universal across all SQL databases. The solutions presented here are not limited to BigQuery; they work just as effectively for other major platforms, including Snowflake, Redshift, Oracle, SQL Server, PostgreSQL, MySQL, and many others. The core principles of parsing SQL to trace dependencies remain the same, regardless of the database vendor.

Conclusion

While SELECT * can be a convenient shortcut in SQL, it doesn’t have to be a black box for data lineage. By leveraging modern data lineage tools, you can easily trace data flows through complex views, ensuring you always have a clear understanding of your data’s journey.

The Problem: “Where is This Column Used?”

A Concrete Example

The Solution: Automated Column-Level Data Lineage

Method 1: Use a No-Code Web Application

Method 2: Use a VS Code Extension

Method 3: Use a REST API for Automation

The Result

A Note on Other Databases

Conclusion

Newsletter Updates

Related Posts

SQL Server Best Practices: [dbo] vs. dbo – A Guide to Consistency

Mastering SQL Syntax: A Comprehensive Guide for 2024 and Beyond

Free SQLFlow Premium Account Trial