Skip to content

Track column lineage#7354

Closed
Praveen2112 wants to merge 3 commits intotrinodb:masterfrom
Praveen2112:praveen/035/column_lineage
Closed

Track column lineage#7354
Praveen2112 wants to merge 3 commits intotrinodb:masterfrom
Praveen2112:praveen/035/column_lineage

Conversation

@Praveen2112
Copy link
Copy Markdown
Member

This allows us to compute the source column for each field. This is still in a WIP state.

@Praveen2112
Copy link
Copy Markdown
Member Author

A few things to be discussed is how can we expose the column lineage details

  1. As a part of DESCRIBE OUTPUT...
  2. Expose them in Event listener.

@Praveen2112 Praveen2112 force-pushed the praveen/035/column_lineage branch from 61bdf04 to c7fb784 Compare March 19, 2021 09:41
@ssheikin ssheikin self-requested a review March 19, 2021 10:28
Copy link
Copy Markdown
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments and questions.
Also, could you explain what is the advantage of this approach vs using originTable and originColumnName?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust the method name.
BTW, I can see no usage of the field referencedFields in Analysis. If you removed it, you could simplify creating originColumnDetails by avoiding one level of mapping by the source node.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how it is possible to have more than 1 element in this list?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like if we have field for a expression like func(func2(col1, col2), col3) then we might need fetch the OriginColumnDetail for col1, col2 and col3

@Praveen2112 Praveen2112 force-pushed the praveen/035/column_lineage branch from c7fb784 to 2ca4eb0 Compare March 19, 2021 12:06
@Praveen2112
Copy link
Copy Markdown
Member Author

@kasiafi AC

This allow us to track column lineage for each column.
@Praveen2112 Praveen2112 force-pushed the praveen/035/column_lineage branch from 2ca4eb0 to 828dbb8 Compare March 19, 2021 12:11
@Praveen2112 Praveen2112 force-pushed the praveen/035/column_lineage branch from 828dbb8 to ba5b4ef Compare March 19, 2021 12:13
Copy link
Copy Markdown
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collecting dependent fields is supported for single-column select expressions.
So, we will get this info for a query like SELECT a, b, c FROM (SELECT ... ).
But we will not get the info for SELECT * FROM (SELECT ...), although they might be equivalent queries. Is this limitation intentional?

I am also concerned about replacing originTable and originColumnName with a list of dependent fields.
In the previous design, we knew that the field is a column reference. Now, we only know that it has a column reference. In some contexts, it might be an important difference.

analyzer.analyze(expression, scope);

updateAnalysis(analysis, analyzer, session, accessControl);
analysis.addReferencedFields(expression, analyzer.getReferencedFields());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably referencedFields could be now simplified to Set<Field>. I don't think the mapping by source node is used.

}

public void addReferencedFields(Multimap<NodeRef<Node>, Field> references)
public void addReferencedFields(Expression expression, Multimap<NodeRef<Node>, Field> references)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addColumnOriginDetails ?

return result.toString();
}

public static class OriginColumnDetail
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The details are distinct per expression, but if you want to reason about them further, e.g. collect all references from a query, this class will need equals().

@Praveen2112
Copy link
Copy Markdown
Member Author

I am also concerned about replacing originTable and originColumnName with a list of dependent fields.
In the previous design, we knew that the field is a column reference. Now, we only know that it has a column reference. In some contexts, it might be an important difference.

Other approach is to maintain both originTable and originColumnName and a list of dependent fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants