Instantly get the data lineage from complex SQL statements

Data lineage plays an important role in Data Governance, you can refer to this blog for why data lineage is so important for an enterprise and how can we correctly use the data lineage. In this article, we will introduce how to use Gudu SQLFlow UI, SQLFlow Rest API and SQLFlow Java library to retrieve data lineage from complex SQL statements with which you can potentially integrate in your own data governance platform.

Automation with SQLFlow in building data lineage

A complex SQL statement example

select data.*, location.remote
from (
	select e.last_name, e.department_id, d.department_name
	from employees e
	left outer join department d
		on (e.department_id = d.department_id)
) data inner join
(
	select s.remote,s.department_id 
	from source s
	inner join location l 
	on s.location_id = l.id
) location on data.department_id = location.department_id;

Above SQL has some complexities and let’s take this SQL statement as an example source of retrieving data lineage with SQLFlow. Note that in the real situation, our SQL statements might be far more complicated than this example especially when we have views or stored procedures to deal with.

To generate the correspond data lineage, we need to know which fields or columns are included in the top-level select list, and what is the data source for these fields or columns comes from. After some analysis, an ideal result should be as shown in the figure below:

Data Lineage for Top Level Select List

To archive the above results, we have to:

1.Expand the asterisk (* in the SQL statement) in data.* to find the corresponding fields

The asterisk in the Select list stands for all fields of the relation, we will need to retrieve the exact fields to generate the data lineage. In this example, Gudu SQLFlow is able to get the fields info based on the contextual information provided in the SQL statement. Nevertheless, most of SQL statements are not able to provide sufficient information for us to detect what are the fields in the relation of the asterisk. In that case, giving metadata would be necessary to generate the result. You can refer to this article for how to get metadata.

2.Perform data tracing for each field in the select list to find the original tables and fields. This process may require multi-level tracing until we find the final data source as the result.

Gudu SQLFlow UI

You can check our online Gudu SQLFlow Cloud. This is probably the most easiest way to instantly archive data lineage cause no toolkit or software is required on your side. The usage is quite simple, directly paste the SQL statement to be processed to the input area, and then click Visualize button.

Gudu SQLFlow Data Lineage

If you prefer to not display the intermediate processing steps in the data lineage, you can show the source data with target data by config the setting as:

Gudu SQLFlow Setting

Download the data lineage in Json for further processing:

Gudu SQLFlow Data Lineage in JSON

Gudu SQLFlow Rest API

Calling SQLFlow Rest API in your code to instantly get the data lineage from SQL statements. You can embed the request in your code and processing the result after getting the Rest Api response.

A Gudu SQLFlow Cloud Premium Account is required to use the Rest Api feature. Check our docs here to get the way of calling our Rest Api.

You will also need to install SQLFlow On-Premise on your server if you would like to use SQLFlow internally.

You can also check our sample code, the sample code is in Python:

  • Sample code of connecting to the Gudu SQLFlow server ( Gudu SQLFlow on-premise server ) deployed on your own server.
  • Sample code with SQLFlow public cloud server

Check here for the data lineage generated by the above sample code.

Gudu SQLFlow Java library

Use the Gudu SQLFlow Java library to analyze the SQL statements and get the data lineage. The advantage of using the Gudu SQLFlow Java class library is that you don’t need to install the Gudu SQLFlow server software and it can be executed independently without relying on any third-party Java class library, easy to integrate into your own project.

We’ve prepared a Java file for you which you can directly use without compilation as long as you have Java install on your machine. To get the select list in the above example, you can run the Java file with the following parameters:

/s /topselectlist

The result by default is in XML. Check here for the result of the execution. You can config the result in Json as well by adding the /json flag.

Newsletter Updates

Enter your email address below to subscribe to our newsletter