Column impact analysis

Click to Try Live Demo

This demo illustrates how to use general sql parser to analyze column level dataflow in a complex SQL query with join clause and nested subquery. You can modify this demo to fit your own need, or just contact us for an improved demo to meet your requirement.

This is also an answer to the question in this post:
http://www.sqlparser.com/letTalk/viewtopic.php?f=1&t=16736


I wonder whether sqlparser provides the capability to match source and target columns to extract the column-level dataflow (lineage), even if the sources are in a subquery. This would be very useful to provide impact analysis feature to ETL mappings in data warehouses and marts.

Let’s take this SQL for example:

SELECT a.deptno "Department", 
       a.num_emp/b.total_count "Employees", 
       a.sal_sum/b.total_sal "Salary"
  FROM
(SELECT deptno, COUNT(*) num_emp, SUM(SAL) sal_sum
    FROM scott.emp
    GROUP BY deptno) a,
(SELECT COUNT(*) total_count, SUM(sal) total_sal
    FROM scott.emp) b

Our demo can trace the relationship between aliases in top-level select clause and table/view columns in bottom levels, And this demo will let you know that some of the target columns are influenced by a source table itself, for example by the number of records in previous sample SQL as well.

Search a.deptno
 --> scott.emp.deptno

Search a.num_emp
 --> num_emp(alias)
 --> aggregate function COUNT(* ) 
 --> table scott.emp 
 --> emp.deptno(group by)
 --> scott.emp.deptno

Search b.total_count
 --> total_count(alias)
 --> aggregate function COUNT(* ) 
 --> table scott.emp 

Search a.sal_sum
 --> sal_sum(alias)
 --> aggregate function SUM(SAL) 
 --> table scott.emp 
 --> scott.emp.SAL
 --> emp.deptno(group by)
 --> scott.emp.deptno

Search b.total_sal
 --> total_sal(alias)
 --> aggregate function SUM(sal) 
 --> table scott.emp 
 --> scott.emp.sal

This demo can trace data lineage in views as well. For each column in a view it traces the name or names of the source columns and the expression used (if any). Take this SQL for example:

CREATE VIEW vIMWhseBinID AS 
SELECT DISTINCT timWhseBin.WhseBinKey,
timWhseBin.WhseKey,
timWarehouse.WhseKey AS WhseKey2
FROM timWhseBin
JOIN timWarehouse ON timWhseBin.WhseKey = timWarehouse.WhseKey

run this demo like this:

Java version: java columnImpact test.sql /c
C# version: columnImpact test.sql /v

RESULT(Output):

view=vIMWhseBinID    column=WhseBinKey     source=timWhseBin.WhseBinKey   expression=
view=vIMWhseBinID    column=WhseKey        source=timWhseBin.WhseKey      expression=
view=vIMWhseBinID    column=WhseKey2       source=timWarehouse.WhseKey      expression=

Download this demo: C# version, Java version