Skip to content

Record column lineage details #7465

Merged
Praveen2112 merged 2 commits intotrinodb:masterfrom
Praveen2112:praveen/042/column_lineage
Apr 8, 2021
Merged

Record column lineage details #7465
Praveen2112 merged 2 commits intotrinodb:masterfrom
Praveen2112:praveen/042/column_lineage

Conversation

@Praveen2112
Copy link
Copy Markdown
Member

Overrides #7354

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have test for CREATE TABLE LIKE where we should see that columns have nice source columns set.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of LIKE we don't capture the data right ? We just reuse the name and type.

@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch 2 times, most recently from 0814ffc to 0032ddd Compare March 31, 2021 13:24
@Praveen2112 Praveen2112 requested a review from phd3 March 31, 2021 13:24
Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed, looks ok

@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch from 0032ddd to 4fffa4c Compare April 1, 2021 12:40
@Praveen2112
Copy link
Copy Markdown
Member Author

@kokosing , @skrzypo987 AC

@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch from 5af1567 to c7bb2dd Compare April 2, 2021 15:27
@Praveen2112
Copy link
Copy Markdown
Member Author

@kasiafi Thanks for the review. AC

@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch 3 times, most recently from b3702ed to 6c36a33 Compare April 6, 2021 07:33
@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch from e173bb5 to e1e7426 Compare April 6, 2021 12:00
@Praveen2112
Copy link
Copy Markdown
Member Author

@kokosing Added tests

@Praveen2112 Praveen2112 requested a review from kokosing April 6, 2021 12:16
@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch 4 times, most recently from bde7caa to f7951f3 Compare April 7, 2021 10:09
Copy link
Copy Markdown
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments.

I went through the code. However, I could not make sure if all sites are covered which should expose source columns. The same about the test coverage.

@Praveen2112
Copy link
Copy Markdown
Member Author

I could not make sure if all sites are covered which should expose source columns. The same about the test coverage.

I guess we could cover all the sites in the incremental way. It is kind of experimental for now. WDYT ?

@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch from f7951f3 to 91d4501 Compare April 8, 2021 06:51
@Praveen2112
Copy link
Copy Markdown
Member Author

@kasiafi AC

@Praveen2112 Praveen2112 requested a review from kasiafi April 8, 2021 07:33
@kasiafi
Copy link
Copy Markdown
Member

kasiafi commented Apr 8, 2021

Looks good to me, provided that a more consistent column reporting will be added as a follow-up.

@Praveen2112
Copy link
Copy Markdown
Member Author

Praveen2112 commented Apr 8, 2021

provided that a more consistent column reporting will be added as a follow-up.

Sure thing

@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch 2 times, most recently from 588e4c3 to 4d88174 Compare April 8, 2021 11:30
- Remove unused method
@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch from 4d88174 to 17f5792 Compare April 8, 2021 11:53
@Praveen2112 Praveen2112 force-pushed the praveen/042/column_lineage branch from 17f5792 to ab08046 Compare April 8, 2021 12:58
@Praveen2112 Praveen2112 merged commit 31f5a89 into trinodb:master Apr 8, 2021
@Praveen2112 Praveen2112 mentioned this pull request Apr 8, 2021
7 tasks
@rsaw4
Copy link
Copy Markdown

rsaw4 commented Jul 8, 2021

@Praveen2112 how to view the lineage? what is sample output for a nested create view, ie viewA which selects from viewB which selects from viewC

@wangqinghuan
Copy link
Copy Markdown

Same question with @rsaw4. How to view the lineage details? @Praveen2112

@Praveen2112
Copy link
Copy Markdown
Member Author

The table level lineage details can be accessed via

QueryCompletedEvent -> QueryMetadata -> List<TableInfo>

what is sample output for a nested create view, ie viewA which selects from viewB which selects from viewC

For this view we get an List of TableInfo for viewB, viewC and tables which are required for viewC but only for viewB we could have TableInfo#directlyReferenced set to true while for viewC and its dependent tables - that flag would be unset (false).

For column level lineage detail we would get it from

QueryCompletedEvent -> QueryOutputMetadata -> OutputColumnMetadata for each column -> List<ColumnDetail>

ColumnDetail will provide us the necessary information about the input columns.

@amalakar
Copy link
Copy Markdown
Contributor

amalakar commented Dec 9, 2021

@Praveen2112 we are trying to leverage this lineage and found that when an expression is being used, source columns are not being captured. For a query like:

create table amalakar.new_query_log_1 as
with queries as
(
select * from 
hive.default.event_presto_query_logged p2
where ds='2021-12-08' and hr=3
)
SELECT
  p1.occurred_at as occurred_at,
  substr(p2.query_id, 1, 10) as new_query_id
  FROM queries p1 
  inner join queries p2
ON p1.query_id=p2.query_id

limit 10

Lineage I am seeing is after light transformation of the lineage we get via the QueryIOMetadata:

{
  "hive.amalakar.new_query_log_1.new_query_id": [],
  "hive.amalakar.new_query_log_1.occurred_at": [
    {
      "columnName": "hive.default.event_presto_query_logged.occurred_at"
    }
  ]
}

The transformation code looks like:

  public Optional<String> getColumnLineage(QueryIOMetadata ioMetadata) {
    Map<String, List<UpstreamColumn>> lineage = new HashMap<>();

    if (ioMetadata.getOutput().isPresent()) {
      QueryOutputMetadata outputMetadata = ioMetadata.getOutput().get();
      if (outputMetadata.getColumns().isPresent()) {
        List<OutputColumnMetadata> outputColumns = outputMetadata.getColumns().get();
        for (OutputColumnMetadata outputColumn : outputColumns) {
          List<UpstreamColumn> upstreamColumns =
              outputColumn.getSourceColumns().stream()
                  .map(col -> new UpstreamColumn(getQualifiedColumnName(col)))
                  .collect(Collectors.toList());
          String outputColumnName =
              String.format(
                  "%s.%s", getQualifiedTableName(outputMetadata), outputColumn.getColumnName());
          lineage.put(outputColumnName, upstreamColumns);
        }
      } 
...
  }

@amalakar
Copy link
Copy Markdown
Contributor

Created: #10272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

7 participants