Why Power BI SQL Comments Break Your DataHub Lineage — and How to Fix It

Power BI encodes newlines as #(lf) in M-language SQL. When DataHub parses queries with -- comments, it silently drops all subsequent JOINs from lineage. The gsp-datahub-sidecar recovers every missing relationship.

If you use Power BI with DataHub, some of your lineage is probably missing right now — and you don’t know it.

Background: How Power BI Handles SQL

A common question from newcomers: does Power BI have its own SQL dialect? The short answer is no. Power BI connects to external databases — SQL Server, Snowflake, PostgreSQL, and others — and sends SQL queries written in that database’s native dialect. If your Power BI report pulls from SQL Server, the SQL inside is T-SQL; if it pulls from Snowflake, it’s Snowflake SQL.

However, Power BI does have its own query language called M (also known as Power Query Formula Language). M wraps everything — including embedded SQL. When you write a Value.NativeQuery expression in Power BI, the SQL string lives inside an M expression, and M applies its own encoding rules.

The most important encoding rule: M replaces actual newline characters with #(lf) (M’s literal for “line feed”). So a query that a human writes as:

SELECT name
FROM customers
-- old filter
WHERE active = 1

gets stored by Power BI as:

select#(lf)name#(lf)from customers#(lf)-- old filter#(lf)where active = 1

This is where the trouble starts. The SQL itself is standard — the encoding around it is not.

The Problem: Why Comments Break Lineage

In standard SQL, -- marks a single-line comment: everything from -- to the next actual newline character is ignored. But when Power BI encodes newlines as #(lf), those aren’t real newline characters — they’re just ordinary text. So any SQL parser that receives this encoded string sees the -- comment as running all the way to the end of the entire query, because there is no real newline to terminate it.

This is not a parser bug in sqlglot or any other SQL parser — it’s a missing preprocessing step. Before parsing, someone needs to decode M’s #(lf) sequences back into real newline characters. DataHub’s Power BI ingestion connector does not do this, and neither does GSP (General SQL Parser) on its own — no SQL parser should be expected to understand Power BI M-language encoding.

Here’s a real query from DataHub issue #11251 that demonstrates the problem:

select
upper(cs.customercode) as customercode
, cs.ear2id as ear2id
, db.branch_rollup_name
, db.subregion1
, db.subregion2
from nast_truckload_domain.broker.dim_customer cs
--join ... (commented out)
join nast_customer_domain.broker.dim_customer_ear2_single_ownership_history as so
  on cs.ear2id = so.ear2_id and is_current = true
join nast_customer_domain.broker.ref_branch db
  on db.Branch_code = so.branch_code
-- join ... (commented out)
where cs.customerstatusid = 1 --active
and db.primary_business_line_id in ('62','73')

This query references two upstream tables: dim_customer and ref_branch. But when Power BI encodes it with #(lf) and DataHub’s parser reads it, the --join comment on line 8 swallows everything after it — because #(lf) doesn’t terminate the comment. DataHub sees only one table.

Why It’s Worse Than It Looks

SQL comments are everywhere in Power BI datasets. Developers comment out JOINs while debugging, leave notes like --active next to filter conditions, and disable WHERE clauses during development. This is normal SQL practice — but when Power BI encodes these queries with #(lf), every single -- comment becomes a lineage-destroying landmine.

Any Power BI dataset with even a single commented-out JOIN or WHERE clause will produce incomplete lineage. The lineage graph shows only a fraction of the upstream tables, and DataHub gives no warning. We verified this: sqlglot handles -- comments perfectly when the newlines are real — the problem is entirely that #(lf) is not decoded before parsing.

The issue has been open since August 2024. Three users have confirmed it blocks DataHub adoption in their organizations:

“This is something very important. Especially when SQL comments are very common in PBI M-queries.” — @AntonisCSt

“We also run into this issue and it is a real issue for part of our business to accept Datahub as a common catalog solution.” — @rospe

The Fix: gsp-datahub-sidecar

The gsp-datahub-sidecar solves this with a two-step approach:

  1. Preprocessing: The sidecar decodes Power BI’s M-language escape sequences (#(lf), #(cr), #(tab)) back into real characters before any SQL parsing happens. This is the step that DataHub’s ingestion pipeline is missing.
  2. Parsing: The decoded SQL — now with real newlines — is sent to Gudu’s General SQL Parser (GSP) via SQLFlow, which correctly handles comments, JOINs, and WHERE clauses to extract complete lineage.

We verified this with the exact query from the issue. First, the raw Power BI encoded form (all on one line with #(lf)):

select#(lf)upper(cs.customercode) as customercode#(lf), cs.ear2id ...#(lf)from nast_truckload_domain.broker.dim_customer cs#(lf)--join ...#(lf)join ...#(lf)where cs.customerstatusid = 1 --active#(lf)and ...

Without preprocessing, every SQL parser fails — GSP included. The --join on the encoded “line 8” swallows everything after it. Zero lineage extracted.

With the sidecar’s preprocessing, the #(lf) sequences become real newlines, and GSP correctly parses the result:

$ gsp-datahub-sidecar \
    --sql-file powerbi_comments.sql \
    --db-vendor dbvmssql \
    --dry-run

Extracted 2 table-level lineage relationships
  NAST_TRUCKLOAD_DOMAIN.BROKER.DIM_CUSTOMER --> DBO.CUSTOMER_BRANCHES (2 columns)
  NAST_CUSTOMER_DOMAIN.BROKER.REF_BRANCH --> DBO.CUSTOMER_BRANCHES (3 columns)
Column-level mappings extracted: 5

Both upstream tables found. All five column-level lineages recovered:

  • customercodedim_customer.customercode (through UPPER())
  • ear2iddim_customer.ear2id
  • branch_rollup_nameref_branch.branch_rollup_name
  • subregion1ref_branch.subregion1
  • subregion2ref_branch.subregion2

Comments correctly stripped. JOINs preserved. WHERE clause intact.

Quick Start

Three commands to recover your Power BI lineage:

# 1. Install the sidecar
pip install git+https://github.com/gudusoftware/gsp-datahub-sidecar.git

# 2. Verify with the built-in example
gsp-datahub-sidecar \
  --sql-file examples/powerbi_comments.sql \
  --db-vendor dbvmssql \
  --dry-run

# 3. Point at your DataHub instance
gsp-datahub-sidecar \
  --sql-file your_query.sql \
  --db-vendor dbvmssql \
  --datahub-url http://your-datahub:8080

The sidecar emits standard DataHub MCPs (Metadata Change Proposals) via the REST API — DataHub merges the lineage with your existing metadata automatically.

Your Power BI Connects to More Than One Database

The #(lf) encoding problem is Power BI specific, but the SQL inside your Power BI datasets comes from many different databases. One report might query SQL Server, another Snowflake, another Oracle. The sidecar handles all of them — GSP supports 20+ SQL dialects, so you just change the --db-vendor flag to match your backend.

This matters because SQL dialects differ in syntax — T-SQL’s TOP, Snowflake’s QUALIFY, Oracle’s CONNECT BY — and a parser that only handles one dialect will silently drop lineage from the others. The sidecar gives you correct lineage regardless of which database your Power BI report connects to.

Part of a Broader Pattern

The #(lf) comment issue is one example of a broader pattern: DataHub’s parser silently drops lineage on SQL constructs it can’t fully parse. We’ve documented similar gaps with BigQuery procedural SQL (DECLARE, IF/END IF, temp tables) and dbt deduplication macros (array_agg with struct unpacking).

The sidecar fills these gaps without replacing DataHub’s parser — it augments it.


Disclosure: The gsp-datahub-sidecar is built by Gudu Software, the company behind General SQL Parser and SQLFlow. The sidecar is open source under the Apache 2.0 license.

Have a SQL parsing challenge? Try SQLFlow with your own SQL, or visit the sidecar repo to get started.