Bound AST, Logical Plan, and Relational Algebra Explained

A deep technical guide to SQL parser ASTs, bound ASTs, Logical IR, Semantic IR, logical plans, and relational algebra for parser, lineage, and governance engineers.

For SQL parser engineers, the phrase Bound AST logical plan relational algebra describes more than a sequence of compiler phases. It is the boundary between syntax and semantics, between recognizing a valid SQL string and understanding what that SQL means.

A raw parser AST can tell you that a query contains a SELECT, a JOIN, a WHERE, and a column reference named id. It cannot, by itself, tell you whether id means orders.id, customers.id, a select-list alias, a correlated outer reference, or an invalid ambiguous column. That requires binding. And once SQL has been bound, most serious analysis tasks—optimization, lineage, policy enforcement, dialect translation, impact analysis—benefit from an intermediate representation closer to relational algebra than to the surface grammar of SQL.

This article explains the layers in depth:

SQL text
  -> parser AST
  -> bound / resolved AST
  -> Semantic IR for lineage and governance questions
  -> Logical IR / logical plan for relational semantics
  -> optimized logical plan
  -> physical plan, execution, or governance output

The exact names differ across systems. PostgreSQL speaks of parse trees, query trees, rewrite, and plans. Apache Calcite uses SqlNode, validation, and RelNode. Spark Catalyst uses unresolved and analyzed logical plans. DuckDB has parsed statements, binding, logical operators, optimization, and physical operators. GoogleSQL/ZetaSQL distinguishes parse AST from resolved AST. But the underlying engineering problem is the same: SQL syntax must be converted into a semantically grounded representation before the system can reason about it reliably.

Executive summary

  • A parser AST is a syntactic tree. It preserves SQL grammar structure but usually does not resolve names, scopes, aliases, catalog objects, data types, or expression dependencies.
  • A bound AST (also called resolved AST, analyzed tree, or semantic AST) attaches meaning: table bindings, column bindings, function/operator resolution, types, scopes, aliases, and sometimes catalog metadata.
  • A Logical IR represents a query using normalized operators such as scan, project, filter, join, aggregate, window, sort, union, and limit. In database engines this is often called a logical plan; in parser SDKs it may be exposed as a reusable intermediate representation.
  • A Semantic IR is a task-oriented semantic representation for questions such as lineage, policy, risk, and catalog validation. It may use logical operators, but it does not have to be a full optimizer-grade logical plan.
  • Relational algebra is the mathematical normalization layer that lets a system reason about equivalence, transformation, optimization, lineage, and policy enforcement.
  • For data lineage and SQL governance, you usually do not need a full database optimizer. You do need enough semantic binding plus a fit-for-purpose Logical IR or Semantic IR to answer: “Which output column depends on which input columns, under which operations and scopes?”

1. Raw AST: necessary, but not sufficient

A SQL parser’s first job is to decide whether text belongs to the language grammar and to produce a structured representation. For example:

SELECT c.id, SUM(o.amount) AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.status = 'paid'
GROUP BY c.id;

A grammar-level AST might contain nodes roughly like:

SelectStmt
  SelectList
    ColumnRef(c.id)
    Alias(FunctionCall(SUM, ColumnRef(o.amount)), revenue)
  From
    Join
      TableRef(customers, alias=c)
      TableRef(orders, alias=o)
      On Equals(ColumnRef(o.customer_id), ColumnRef(c.id))
  Where Equals(ColumnRef(o.status), Literal('paid'))
  GroupBy ColumnRef(c.id)

This tree is useful. It preserves the surface structure of the query and gives visitor-based tools a reliable way to traverse expressions. Formatting, simple table extraction, statement classification, and many lint rules can start here.

But for deeper analysis, this AST is under-specified. It does not necessarily know:

  • whether customers resolves to a table, view, CTE, synonym, or temporary table;
  • whether c.id resolves to catalog.schema.customers.id;
  • whether SUM resolves to a built-in aggregate, an overloaded function, or a dialect-specific construct;
  • whether revenue is visible to ORDER BY, GROUP BY, or another clause in this dialect;
  • whether id would be ambiguous if written unqualified;
  • what type SUM(o.amount) returns;
  • whether an expression is aggregate, windowed, scalar, or invalid in its scope.

This is why AST-only SQL tooling often fails on column-level lineage, permission checks, and Text-to-SQL validation. Syntax is not semantics.

2. Bound AST: where names acquire meaning

A bound AST is a syntax tree whose identifiers have been resolved against scopes and catalog metadata. Different systems use different terms: resolved AST, analyzed AST, query tree, semantic tree, or analyzer output. The essential work is the same.

For each relation reference, the binder/analyzer determines what it denotes:

TableRef(customers AS c)
  -> RelationBinding(
       relation_id = catalog.sales.customers,
       alias = c,
       columns = [id, name, email, created_at, ...]
     )

For each column reference, it determines the exact source column or derived expression:

ColumnRef(c.id)
  -> ColumnBinding(
       relation = catalog.sales.customers,
       column = id,
       type = BIGINT,
       scope = join_input_left
     )

For functions and operators, binding may resolve overloads and infer types:

SUM(o.amount)
  -> FunctionBinding(
       function = aggregate.sum(decimal) -> decimal,
       arguments = [orders.amount],
       aggregate = true
     )

For query blocks, binding builds scopes:

QueryBlock
  input scope: c.*, o.*
  group scope: c.id
  output scope:
    column 1 = c.id
    column 2 = SUM(o.amount) AS revenue

Binding is where many “simple” SQL cases become non-trivial. A robust binder must handle at least:

  • table aliases and column aliases;
  • CTEs and recursive CTEs;
  • derived tables and lateral references;
  • correlated subqueries;
  • view expansion or view references;
  • star expansion (*, t.*, EXCEPT, REPLACE variants in some dialects);
  • ambiguous and unqualified columns;
  • dialect-specific visibility rules for aliases;
  • aggregate and window scope rules;
  • case folding, quoting, and identifier normalization;
  • temporary tables, session state, search path, database/schema resolution;
  • function/operator overloads and type coercion.

This layer is the difference between saying “the query mentions amount” and saying “the output column revenue is derived from sales.orders.amount through an aggregate SUM after filtering sales.orders.status = 'paid'.”

3. Logical plan: from SQL grammar to relational operators

A bound AST is still shaped like SQL syntax. A logical plan is shaped like query semantics.

The earlier SQL query can be represented as a tree of logical operators:

Aggregate(group=[c.id], output=[c.id, SUM(o.amount) AS revenue])
  Filter(condition=(o.status = 'paid'))
    Join(type=inner, condition=(o.customer_id = c.id))
      Scan(table=sales.customers AS c)
      Scan(table=sales.orders AS o)

This form is deliberately less concerned with the spelling of the SQL and more concerned with the meaning of the dataflow:

  • Scan introduces base relations.
  • Join combines relations under a predicate.
  • Filter removes rows.
  • Aggregate groups rows and computes aggregate expressions.
  • Project chooses and computes output expressions.
  • Window computes analytic functions over partitions and orderings.
  • Union, Intersect, and Except combine compatible relation outputs.
  • Sort, Limit, and Offset affect ordering and cardinality.

A parser AST may distinguish many syntactic forms that are semantically equivalent. A logical plan attempts to normalize them into a smaller operator vocabulary. For example, a WHERE predicate, a join predicate, and a subquery predicate may remain distinct in the AST but become filter/join conditions in a plan. This makes transformations easier.

At this point it is useful to distinguish two terms that are often mixed together:

  • Logical IR is the normalized relational representation: scans, projections, filters, joins, aggregates, windows, set operations, and expression nodes. It is the representation a database optimizer or SQL analysis engine can transform systematically.
  • Semantic IR is a purpose-built semantic representation for answering product-level questions: column lineage, filter influence, join influence, sensitive-field access, policy violations, risk scoring, and catalog-aware validation.

A Semantic IR may be derived from a bound AST, from a Logical IR, or from both. It may also be intentionally flatter than a full logical plan. For example, a governance-oriented Semantic IR might store a StatementGraph with RelationSource, Projection, Filter, Join, Aggregation, OutputColumnMapping, and ExpressionDependency records. That is not enough to choose an index or produce a physical execution strategy, but it is enough to answer whether revenue depends on orders.amount, whether the value passed through SUM, whether the row set was filtered by orders.status, and whether the result depends on a join predicate.

This distinction matters for SQL parser and lineage products. A database engine usually needs a complete logical plan because it must execute the query. A SQL governance engine may instead need a lightweight Semantic IR that preserves evidence, confidence, diagnostics, and source locations while avoiding the complexity of a cost-based optimizer.

4. Relational algebra: the normalization contract

Relational algebra matters because it gives query systems a language of equivalence. Codd’s relational model introduced the mathematical foundation: relations, projections, selections, joins, set operations, and transformations over them. Modern SQL is much larger than classical relational algebra, but the optimizer and analyzer still rely on algebraic ideas.

Typical relational operators map naturally to SQL constructs:

SQL concept Relational/logical operator
FROM table scan / relation variable
SELECT a, b projection
WHERE p selection / filter
JOIN ... ON p join
GROUP BY aggregation / grouping
HAVING filter after aggregation
UNION union
EXCEPT difference
INTERSECT intersection
subquery nested logical expression, semi-join, anti-join, apply, or scalar subplan
window function window operator over partition/order frame

The algebraic representation enables rules such as:

  • push filters below joins when safe;
  • combine adjacent projections;
  • remove unused columns;
  • reorder inner joins under associativity and commutativity assumptions;
  • convert subqueries into semi-joins or anti-joins;
  • push projections into scans;
  • recognize common subexpressions;
  • reason about output column dependencies.

For an optimizer, these transformations improve performance. For a lineage or governance engine, the same representation improves correctness. If a column is removed by projection, introduced by aggregation, or derived through a window function, the plan representation should make that explicit.

5. What mature systems do

The architecture appears repeatedly in mature SQL engines and frameworks, although each system names the phases differently.

PostgreSQL: raw parse tree, parse analysis, rewrite, planner

PostgreSQL is a useful example because it does not normally describe its internal pipeline using the product-neutral terms “Logical IR” and “Semantic IR.” Instead, PostgreSQL uses its own internal names: raw parse tree, query tree, rewritten query tree, and plan tree. The mapping is still useful for SQL parser and lineage engineers.

PostgreSQL documents the parser stage as consisting of two parts: grammar/lexer parsing and a transformation process. The raw parser checks syntax and builds a parse tree. The transformation process performs parse analysis and builds a query tree. PostgreSQL describes this query tree as the structure that represents information about the tables, columns, and other objects referenced by the query.

In the terminology of this article, PostgreSQL’s query tree after parse analysis is closest to a bound AST / semantic tree:

SQL text
  -> raw parse tree                -- grammar-shaped syntax
  -> query tree after analysis     -- names, scopes, relation references, target lists

This is where PostgreSQL has moved beyond syntax. Relation names have been looked up, column references have been analyzed, target lists have been built, expression nodes have type and function/operator information, and the tree is suitable for rewrite and planning. PostgreSQL does not call this a “Semantic IR,” but it plays the same role as the semantically resolved representation that a lineage or governance tool needs before it can trust column references.

The rewrite system then takes query trees as input and produces query trees as output. PostgreSQL’s documentation describes rewrite as a module between the parser stage and the planner/optimizer; it is often easiest to think of it as semantic-preserving expansion over the analyzed query tree, for example for views and rules:

analyzed query tree
  -> rewrite system
  -> rewritten query tree

The planner/optimizer then takes the rewritten query tree and produces executable plan alternatives and, eventually, a selected execution plan. PostgreSQL’s plan nodes are closer to what database engineers usually mean by a logical/physical planning representation, although PostgreSQL’s internal plan tree is designed for optimization and execution rather than for an external SDK-style “Logical IR.”

A practical mapping is:

Article term PostgreSQL internal concept Notes
Parser AST / raw AST Raw parse tree Grammar-shaped syntax produced before catalog lookups.
Bound AST / semantic tree Query tree after parse analysis Names, relation references, target entries, expression semantics, and types are resolved enough for rewrite/planning.
Semantic IR Not a separate PostgreSQL public layer A lineage/governance product could derive one from the analyzed or rewritten query tree. PostgreSQL itself does not expose a task-oriented lineage/policy IR as a product API.
Logical IR / logical plan Planner representation / plan trees PostgreSQL plans are optimizer/execution-oriented. They are not the same as a portable SDK RelNode-style IR, but they perform the role of normalized planning structures inside the engine.
Physical plan Selected executable plan Contains concrete execution operators such as scans, joins, sorts, aggregates, and access paths.

So the precise statement is: PostgreSQL has semantically analyzed query trees and planner plan trees; it does not have a public layer literally named Logical IR or Semantic IR. If we use the terms from this article, PostgreSQL’s analyzed query tree corresponds most closely to the bound/semantic representation, while its planner structures correspond to the logical/physical planning side. A governance-oriented Semantic IR would be an additional product-facing layer derived from PostgreSQL-like semantic information, not something PostgreSQL itself exposes as a ready-made lineage API.

Apache Calcite: SqlNode, validation, RelNode

Apache Calcite is one of the clearest examples of SQL-to-relational-algebra architecture. Its documentation states that relational algebra is at the heart of Calcite: every query is represented as a tree of relational operators, and planner rules transform expression trees using semantics-preserving identities.

A typical Calcite pipeline is:

SQL string
  -> SqlNode parse tree
  -> validated SqlNode with names and types
  -> RelNode relational expression tree
  -> rule-based/cost-based optimization

Calcite is widely embedded because the relational algebra layer is not tied to one storage engine. Adapters can expose different data sources as schemas, while rules and planners operate on a common relational representation.

Spark Catalyst: unresolved plan to analyzed logical plan

Spark SQL’s Catalyst optimizer is another influential design. Catalyst represents computations as trees and applies rules to transform them. A parsed query begins with unresolved attributes and relations. The analyzer uses catalog information to resolve them, producing an analyzed logical plan. Optimization rules then transform the plan before physical planning.

The key idea for parser engineers is the explicit distinction between unresolved syntax and analyzed semantics. Until an attribute is resolved, it is unsafe to treat it as a real column dependency.

DuckDB: parser, binder, logical operators, optimizer, physical operators

DuckDB follows the conventional database pipeline: parse SQL, bind names and types, produce logical operators, optimize, then produce physical operators. The binder is where table and column names become catalog-bound references. The logical operator tree is then a compact representation for optimization and execution.

This separation is one reason DuckDB can be both SQL-rich and embeddable: surface syntax is separated from semantic binding and execution planning.

Trino/Presto: analysis and plan nodes

Trino analyzes SQL statements with scopes, relation types, and expression analysis before producing and optimizing plan nodes. Its optimizer includes cost-based and rule-based optimizations, predicate pushdown, projection pushdown, join-related transformations, and connector pushdown.

For federation engines like Trino, a logical plan is also a negotiation layer: some operations can be pushed into connectors, while others must be executed by the engine.

Apache DataFusion: SQL parser to logical plan and optimizer rules

Apache DataFusion, built on Apache Arrow, exposes logical plans and optimizer rules as reusable Rust components. Its optimizer documentation describes a modular optimizer for logical plans and physical plans, with analyzer, logical optimizer, and physical optimizer rules.

This is a good example of logical plans as a library interface, not only an internal database implementation detail.

GoogleSQL / ZetaSQL: parse AST vs resolved AST

GoogleSQL’s ZetaSQL documentation explicitly distinguishes the parser AST from the analyzer’s resolved AST. The resolved AST contains nodes such as resolved columns and resolved scans. This is close to what many parser vendors mean by a bound AST: an AST-like representation after names, types, and catalog references have been resolved.

SQLGlot: AST-first, with optimizer and lineage utilities

SQLGlot is a practical open-source SQL parser and transpiler. Its AST is expression-oriented and convenient for traversal and transformation. SQLGlot also includes scope, qualification, optimization, and lineage utilities. It demonstrates a common engineering path for SQL tooling: start with a strong AST, then add semantic layers when transpilation, qualification, or lineage requires more than syntax.

6. Bound AST vs logical plan: which one should a parser expose?

For a parser or SQL analysis SDK, this is a product and API design question.

A bound AST is often better when users care about preserving source-level structure:

  • IDE features;
  • formatting with semantic awareness;
  • precise error messages;
  • source-to-source SQL rewriting;
  • dialect transpilation that must preserve user intent;
  • clause-level explanations;
  • mapping findings back to source spans.

A logical plan is better when users care about normalized semantics:

  • column-level lineage;
  • impact analysis;
  • policy checks;
  • query risk scoring;
  • optimization;
  • equivalence reasoning;
  • dependency graphs;
  • cross-dialect normalization.

In practice, mature tooling often needs both. The bound AST keeps a connection to the original SQL. The logical plan gives a cleaner semantic graph.

A useful internal design is:

Parser AST
  - source spans
  - grammar nodes
  - dialect-specific syntax

Bound AST
  - all AST nodes plus relation/column/function/type bindings
  - scopes and aliases
  - catalog links

Logical IR
  - normalized relational operators
  - expression trees
  - output schema
  - relation provenance

Semantic IR
  - output column mapping
  - value/filter/join/aggregation dependencies
  - evidence, confidence, diagnostics
  - task-specific governance facts

This does not require every layer to be a full database optimizer. For governance workloads, the Semantic IR can be intentionally lightweight, while the Logical IR can grow over time toward a fuller relational representation.

7. Logical IR and Semantic IR for lineage and governance

A lineage engine, SQL Guard, or catalog-aware validator does not necessarily need to estimate costs, choose indexes, or generate executable physical plans. But it does need a faithful semantic representation. In practice, there are two useful levels.

A practical Logical IR should model:

  • relation sources: table, view, CTE, subquery, function table, temporary table;
  • relational operators: project, filter, join, aggregate, window, union, except, intersect, sort, limit;
  • expression trees: column references, literals, function calls, casts, CASE expressions, predicates, and subqueries;
  • scope boundaries: query blocks, CTEs, lateral references, correlated subqueries;
  • catalog binding: normalized table and column identifiers;
  • dialect annotations: syntax or behavior that affects semantics.

A practical Semantic IR for governance should model the answers users actually ask for:

  • relation sources with table, view, CTE, and subquery identity;
  • projections and output column mappings;
  • expression dependencies for value lineage;
  • filter dependencies for row influence;
  • join-condition dependencies for row matching and multiplicity;
  • aggregation and group-key transit;
  • confidence, evidence, diagnostics, and source anchors.

For the sample query, a governance-oriented Semantic IR might produce:

{
  "outputs": [
    {
      "name": "id",
      "expression": "c.id",
      "depends_on": ["sales.customers.id"]
    },
    {
      "name": "revenue",
      "expression": "SUM(o.amount)",
      "depends_on": ["sales.orders.amount"],
      "operation": "aggregate:sum"
    }
  ],
  "filters": [
    {
      "expression": "o.status = 'paid'",
      "depends_on": ["sales.orders.status"]
    }
  ],
  "joins": [
    {
      "type": "inner",
      "condition": "o.customer_id = c.id",
      "depends_on": [
        "sales.orders.customer_id",
        "sales.customers.id"
      ]
    }
  ]
}

This is not an execution plan. It is a semantic dependency plan. It is sufficient for many data governance tasks, and it avoids the complexity of pretending to be a full DBMS.

One useful way to think about the relationship is:

Bound AST
  -> resolves what every identifier means

Logical IR
  -> normalizes the query into relational operators

Semantic IR
  -> packages the semantic facts needed by lineage, validation, policy, risk, and integration APIs

A system may build Semantic IR directly from a bound AST for a first implementation. Over time, it can derive more of that Semantic IR from a richer Logical IR. The key is not the class name; the key is preserving enough binding, dependency, scope, and evidence information to make the answer trustworthy.

8. Why AST-only lineage fails

Column-level lineage is a good stress test. Consider:

WITH paid_orders AS (
  SELECT customer_id, amount
  FROM orders
  WHERE status = 'paid'
), ranked AS (
  SELECT
    customer_id,
    amount,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY amount DESC
    ) AS rn
  FROM paid_orders
)
SELECT c.email, r.amount AS top_order_amount
FROM customers c
JOIN ranked r ON r.customer_id = c.id
WHERE r.rn = 1;

An AST can list column references: customer_id, amount, status, email, rn, id. But lineage needs more:

  • r.amount comes from CTE ranked.amount;
  • ranked.amount comes from CTE paid_orders.amount;
  • paid_orders.amount comes from orders.amount;
  • r.rn is derived from a window function over paid_orders.customer_id and paid_orders.amount;
  • c.email comes from customers.email;
  • the join condition depends on orders.customer_id and customers.id through CTE propagation.

Without scopes and bindings, the tool can only guess. With a logical IR, dependency propagation becomes systematic.

9. Common pitfalls for parser and lineage engineers

Pitfall 1: treating every ColumnRef as globally resolvable

A column reference is meaningful only inside a scope. CTEs, subqueries, aliases, lateral joins, and correlated references can all change resolution.

Pitfall 2: ignoring dialect-specific alias rules

Some dialects allow select-list aliases in GROUP BY or ORDER BY; others differ in subtle ways. A binder must implement dialect semantics, not generic SQL intuition.

Pitfall 3: expanding * too late—or too early

Star expansion requires catalog knowledge and scope. But preserving * may be necessary for source mapping or for dynamic schemas. Decide whether the bound layer, logical layer, or final lineage output owns expansion.

Pitfall 4: confusing syntactic dependency with data dependency

A filter column may not appear in the output, but it still influences which rows appear. Governance systems may need both projection lineage and predicate lineage.

Pitfall 5: flattening aggregates and windows into ordinary functions

Aggregates and windows change cardinality, scope, and dependency semantics. Treating SUM(x) and ROW_NUMBER() like scalar functions loses important meaning.

Pitfall 6: treating Logical IR and Semantic IR as the same thing

A Logical IR is usually operator-shaped. A Semantic IR is answer-shaped. If a governance product exposes only a logical plan, users still have to compute lineage, policy, risk, and diagnostics themselves. If it exposes only a task-specific semantic summary, engineers may lose a reusable relational layer for deeper analysis. Keep the distinction explicit.

Pitfall 7: losing source locations

Logical and Semantic IRs are normalized, but users still need diagnostics that point back to SQL text. Keep source spans or back-pointers from IR nodes to AST nodes when possible.

10. How this maps to SQL parser products

For a SQL parser library, the raw AST is the entry point, but the higher-value layer is usually semantic analysis. In practical SQL tooling, customers often ask questions such as:

  • Which table does this column belong to?
  • Which source columns produce this output field?
  • Is this generated SQL safe to execute?
  • Does this query access a restricted or sensitive column?
  • What dashboards, reports, or pipelines are affected if this column changes?
  • Can this SQL be translated to another dialect without changing meaning?

These questions require binding and often both a normalized Logical IR and a task-oriented Semantic IR.

In Gudu’s product family, this is the role of the shared SQL analysis and lineage engine. General SQL Parser (GSP) is the embeddable Java library for parsing, semantic resolution, column-to-table resolution, and lineage extraction across many SQL dialects. SQLFlow packages the same lineage capability as an application platform with APIs, visualization, widgets, batch processing, and deployment options. SQL Omni brings offline lineage and SQL inspection into VS Code.

The product distinction matters less than the architectural principle: parser AST, semantic binding, Logical IR, and Semantic IR are separate layers. Exposing those layers clearly makes the engine more useful to platform teams, data governance teams, and AI application teams. It also lets a mature lineage engine coexist with newer semantic APIs: the production lineage path can remain stable while a Semantic IR layer adds evidence, confidence, diagnostics, and modern integration surfaces.

11. Design checklist: what to expose in a serious SQL analyzer

If you are designing a SQL parser or lineage SDK, consider exposing these artifacts explicitly:

  1. Raw AST
    – source spans;
    – dialect-specific nodes;
    – comments and formatting if needed.

  2. Diagnostic stream
    – syntax errors;
    – semantic errors;
    – ambiguous references;
    – unsupported dialect features;
    – partial-analysis warnings.

  3. Catalog interface
    – table lookup;
    – column lookup;
    – function/operator lookup;
    – view expansion policy;
    – case sensitivity and search path rules.

  4. Bound AST or resolved tree
    – relation bindings;
    – column bindings;
    – function bindings;
    – type information;
    – query-block scopes;
    – alias resolution.

  5. Logical IR
    – scan/project/filter/join/aggregate/window/set operators;
    – expression trees;
    – output schema;
    – normalized relation flow;
    – source-to-IR links.

  6. Semantic IR
    – output column mappings;
    – value dependencies;
    – filter dependencies;
    – join-condition dependencies;
    – aggregation and group-key dependencies;
    – confidence, evidence, and diagnostics.

  7. Lineage and governance outputs
    – column-level lineage;
    – table-level dependencies;
    – sensitive field access;
    – policy violations;
    – risk signals;
    – explainable diagnostics.

This interface is more useful than a monolithic “parse result” object because different users need different levels of abstraction.

12. Conclusion

A SQL parser AST tells you what the query says. A bound AST tells you what the identifiers mean. A Logical IR or logical plan tells you how relations flow through operations. A Semantic IR packages the semantic facts that lineage, validation, policy, risk, and integration APIs need. Relational algebra gives the system a normalization contract for reasoning about equivalence, transformation, lineage, and governance.

For database engines, these layers lead to execution plans. For SQL parser vendors and data governance tools, they lead to semantic validation, column-level lineage, impact analysis, SQL risk scoring, and safer AI-generated SQL.

The most practical architecture is not “AST versus logical plan.” It is a layered pipeline:

syntax -> binding -> Logical IR -> Semantic IR -> task-specific output

If your tool needs to answer semantic questions, do not stop at the raw AST. Build or expose the binding layer. Then decide which intermediate representation your users need: a Logical IR for normalized relational semantics, a Semantic IR for explainable governance answers, or both. For many SQL lineage and SQL Guard products, a lightweight Semantic IR on top of reliable binding is the fastest practical step toward a full SQL semantic governance engine.

References and further reading

  • E. F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, 1970.
  • P. Griffiths Selinger et al., “Access Path Selection in a Relational Database Management System,” SIGMOD 1979.
  • Goetz Graefe, “The Volcano Optimizer Generator: Extensibility and Efficient Search,” ICDE 1993.
  • Goetz Graefe, “The Cascades Framework for Query Optimization,” IEEE Data Engineering Bulletin, 1995.
  • PostgreSQL documentation: “The Parser Stage,” “The Rule System,” and “Planner/Optimizer.”
  • Apache Calcite documentation: “Algebra” and planner rules over RelNode relational expressions.
  • Michael Armbrust et al., “Spark SQL: Relational Data Processing in Spark,” SIGMOD 2015.
  • Apache DataFusion documentation: Query Optimizer and logical plan optimizer rules.
  • GoogleSQL / ZetaSQL documentation: parser AST and resolved AST.
  • SQLGlot documentation and AST primer.