Track column lineage by Praveen2112 · Pull Request #7354 · trinodb/trino

Praveen2112 · 2021-03-19T09:11:34Z

This allows us to compute the source column for each field. This is still in a WIP state.

Praveen2112 · 2021-03-19T09:39:59Z

A few things to be discussed is how can we expose the column lineage details

As a part of DESCRIBE OUTPUT...
Expose them in Event listener.

kasiafi

A few comments and questions.
Also, could you explain what is the advantage of this approach vs using originTable and originColumnName?

kasiafi · 2021-03-19T11:29:18Z

core/trino-main/src/main/java/io/trino/sql/analyzer/Analysis.java

Adjust the method name.
BTW, I can see no usage of the field referencedFields in Analysis. If you removed it, you could simplify creating originColumnDetails by avoiding one level of mapping by the source node.

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java

kasiafi · 2021-03-19T11:29:48Z

core/trino-main/src/main/java/io/trino/sql/analyzer/Field.java

Could you explain how it is possible to have more than 1 element in this list?

Like if we have field for a expression like func(func2(col1, col2), col3) then we might need fetch the OriginColumnDetail for col1, col2 and col3

core/trino-main/src/main/java/io/trino/sql/analyzer/Field.java

core/trino-main/src/main/java/io/trino/sql/analyzer/OriginColumnDetail.java

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

Praveen2112 · 2021-03-19T12:07:57Z

@kasiafi AC

This allow us to track column lineage for each column.

kasiafi

Collecting dependent fields is supported for single-column select expressions.
So, we will get this info for a query like SELECT a, b, c FROM (SELECT ... ).
But we will not get the info for SELECT * FROM (SELECT ...), although they might be equivalent queries. Is this limitation intentional?

I am also concerned about replacing originTable and originColumnName with a list of dependent fields.
In the previous design, we knew that the field is a column reference. Now, we only know that it has a column reference. In some contexts, it might be an important difference.

kasiafi · 2021-03-19T12:20:25Z

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java

        analyzer.analyze(expression, scope);

        updateAnalysis(analysis, analyzer, session, accessControl);
+        analysis.addReferencedFields(expression, analyzer.getReferencedFields());


Probably referencedFields could be now simplified to Set<Field>. I don't think the mapping by source node is used.

kasiafi · 2021-03-19T12:21:02Z

core/trino-main/src/main/java/io/trino/sql/analyzer/Analysis.java

    }

-    public void addReferencedFields(Multimap<NodeRef<Node>, Field> references)
+    public void addReferencedFields(Expression expression, Multimap<NodeRef<Node>, Field> references)


addColumnOriginDetails ?

kasiafi · 2021-03-19T12:28:48Z

core/trino-main/src/main/java/io/trino/sql/analyzer/Field.java

        return result.toString();
    }
+
+    public static class OriginColumnDetail


The details are distinct per expression, but if you want to reason about them further, e.g. collect all references from a query, this class will need equals().

Praveen2112 · 2021-03-19T13:14:33Z

I am also concerned about replacing originTable and originColumnName with a list of dependent fields.
In the previous design, we knew that the field is a column reference. Now, we only know that it has a column reference. In some contexts, it might be an important difference.

Other approach is to maintain both originTable and originColumnName and a list of dependent fields.

Praveen2112 added the WIP label Mar 19, 2021

Praveen2112 requested review from kasiafi and martint March 19, 2021 09:11

cla-bot bot added the cla-signed label Mar 19, 2021

Praveen2112 force-pushed the praveen/035/column_lineage branch from 61bdf04 to c7fb784 Compare March 19, 2021 09:41

ssheikin self-requested a review March 19, 2021 10:28

kasiafi reviewed Mar 19, 2021

View reviewed changes

Praveen2112 force-pushed the praveen/035/column_lineage branch from c7fb784 to 2ca4eb0 Compare March 19, 2021 12:06

Praveen2112 added 2 commits March 19, 2021 17:41

Remove unused variable

b74988c

Introduce OriginColumnDetail

9c6e746

This allow us to track column lineage for each column.

Praveen2112 force-pushed the praveen/035/column_lineage branch from 2ca4eb0 to 828dbb8 Compare March 19, 2021 12:11

Track lineage for each field.

ba5b4ef

Praveen2112 force-pushed the praveen/035/column_lineage branch from 828dbb8 to ba5b4ef Compare March 19, 2021 12:13

kasiafi reviewed Mar 19, 2021

View reviewed changes

Praveen2112 mentioned this pull request Mar 31, 2021

Record column lineage details #7465

Merged

Praveen2112 closed this Mar 31, 2021

Conversation

Praveen2112 commented Mar 19, 2021

Uh oh!

Praveen2112 commented Mar 19, 2021

Uh oh!

kasiafi left a comment

Choose a reason for hiding this comment

Uh oh!

kasiafi Mar 19, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kasiafi Mar 19, 2021

Choose a reason for hiding this comment

Uh oh!

Praveen2112 Mar 19, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Praveen2112 commented Mar 19, 2021

Uh oh!

kasiafi left a comment

Choose a reason for hiding this comment

Uh oh!

kasiafi Mar 19, 2021

Choose a reason for hiding this comment

Uh oh!

kasiafi Mar 19, 2021

Choose a reason for hiding this comment

Uh oh!

kasiafi Mar 19, 2021

Choose a reason for hiding this comment

Uh oh!

Praveen2112 commented Mar 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants