Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add all_files system table to the Iceberg connector #11206

Closed
wants to merge 4 commits into from

Conversation

osscm
Copy link
Contributor

@osscm osscm commented Feb 26, 2022

Description

  • This is to add $all_files system table support.
  • Spark already supports the iceberg's $data_files and $all_data_files metadata tables. Trino is already supporting $files.
  • $all_files table will include data_files from all the available snapshots to a table where the $files metadata table includes only the current snapshot one.

Use cases

  • This can be used when the user is debugging data issue. As in case when Trino/Spark is used for metadata/data optimization, then it can modify the metadata and data.

  • Another case is when snapshot is rolled back/future from Trino/Spark and now trying to understand what all data-files are present, and is there any implication because of the optimization/rollback operations.

  • This and $all_manifests can also be used to add the optimization features in Trino like purging orphan files or identifying partitions modified since last time, to implement moving window data-compaction feature. detail
    Where we need to identify the

Design
Adding a new class AllFilesTable and used it as a parent for FilesTable, as both will be implementing similar responsibility,.

Testing
Added a test case in the existing TestIcebergSystemTable.
Thought of using rollback to show case that $all_files can give all the data_files, but then history was getting updated and testHistoryTable depends on the order and operations happen in the testAllFilesTable test. So have not used it. If we think, I can add it.

Is this change a fix, improvement, new feature, refactoring, or other?
Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)
NA

How would you describe this change to a non-technical end user or system administrator?
allows user to access all the data-files referred by the table ( rolled back snapshots, old snapshots). This is helpful for debugging issues like why my query is showing certain data and not current one (in case snapshots are rolled back)

syntax:
select * from "table$all_files"

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Adds support for `all_files` metadata table, supported by Spark as well. ({issue}`11172`)

@cla-bot
Copy link

cla-bot bot commented Feb 26, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cla-bot
Copy link

cla-bot bot commented Mar 1, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Mar 1, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@osscm
Copy link
Contributor Author

osscm commented Mar 1, 2022

@alexjo2144 thanks for the review!
please check, the refactored code.

@github-actions github-actions bot added the docs label Mar 1, 2022
@cla-bot
Copy link

cla-bot bot commented Mar 1, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Mar 1, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Mar 1, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

@osscm
Copy link
Contributor Author

osscm commented Mar 1, 2022

closing to fix the git email issue

@osscm osscm closed this Mar 1, 2022
@osscm osscm reopened this Mar 1, 2022
@cla-bot
Copy link

cla-bot bot commented Mar 2, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

@cla-bot
Copy link

cla-bot bot commented Mar 2, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

@martint
Copy link
Member

martint commented Mar 2, 2022

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Mar 2, 2022
@cla-bot
Copy link

cla-bot bot commented Mar 2, 2022

The cla-bot has been summoned, and re-checked this pull request!

@osscm
Copy link
Contributor Author

osscm commented Mar 3, 2022

thanks a lot @martint!

@osscm
Copy link
Contributor Author

osscm commented Mar 3, 2022

@alexjo2144 would you mind checking it, and seeing if changes are looking ok now?
Also, fixed a doc formatting and TestIcebergMetastoreAccessOperations.java as well. thanks!

Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides a few nitpicks, looks good to me.

docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed docs and left some suggestions on top of the good ideas from @findinpath

docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
@osscm
Copy link
Contributor Author

osscm commented Mar 10, 2022

@findinpath @mosabua thanks for the review! will try to add the above changes today/tomorrow.

@findinpath
Copy link
Contributor

Build is red. Are the trino-iceberg failures related?

@osscm osscm force-pushed the add-all—files branch 5 times, most recently from 90d533d to db3a6e5 Compare September 24, 2022 17:25
@osscm osscm force-pushed the add-all—files branch 3 times, most recently from bed7c17 to caf938e Compare September 25, 2022 01:33
@osscm
Copy link
Contributor Author

osscm commented Oct 11, 2022 via email

@bitsondatadev
Copy link
Member

👋 @osscm - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

@osscm
Copy link
Contributor Author

osscm commented Nov 20, 2022 via email

@bitsondatadev
Copy link
Member

Great! We won't close it out then! I figured as such but we just need to gauge the interest in continuing all older PRs. I'll work on getting a maintainer to take a look!

@osscm
Copy link
Contributor Author

osscm commented Nov 20, 2022 via email

- List of recommended split locations.
* - ``equality_ids``
- ``array(integer)``
- The set of field IDs used for equality comparison in equality delete files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above changes are unrelated to a new $all_files system table. Please extract a commit.


SELECT * FROM "test_table$all_files"

The output of the query has the columns, which is similar to the ``$files`` metadata table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is similar to

I think they're the "same". Is my understanding wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @ebyhr its same.
are you suggesting to reword similar to to same as?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

implements SystemTable
{
private final ConnectorTableMetadata tableMetadata;
private final Table icebergTable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to protected and use it from FilesTable and AllFilesTable

import static org.apache.iceberg.MetadataTableType.ALL_DATA_FILES;
import static org.apache.iceberg.MetadataTableUtils.createMetadataTableInstance;

public class AllFilesTable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class name is AllFilesTable, but MetadataTableType is ALL_DATA_FILES. It would be nice to leave a comment.

Comment on lines +35 to +40
Table allFilesTable = createMetadataTableInstance(getIcebergTable(), ALL_DATA_FILES);

TableScan tableScan = allFilesTable
.newScan()
.includeColumnStats();
return tableScan;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Table allFilesTable = createMetadataTableInstance(getIcebergTable(), ALL_DATA_FILES);
TableScan tableScan = allFilesTable
.newScan()
.includeColumnStats();
return tableScan;
return createMetadataTableInstance(getIcebergTable(), ALL_DATA_FILES).newScan()
.includeColumnStats();


assertEquals(result, expectedStatistics);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above changes are unrelated to a new $all_files system table. Please extract a commit.

}

@Test
public void testAllFilesTable() throws IOException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move throws to the next line.

Comment on lines +391 to +392
MaterializedResult allFilesAfterSingleDelete = computeActual("SELECT content, file_format, file_size_in_bytes, record_count, column_sizes, value_counts," +
"null_value_counts, nan_value_counts, key_metadata, split_offsets, equality_ids, lower_bounds, upper_bounds FROM \"" + table.getName() + "$all_files\" ORDER BY file_path");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using SELECT *.

Comment on lines +410 to +411
.row(0, "ORC", fileSizes[0], 1L, null, Map.of(1, Long.valueOf(1), 2, Long.valueOf(1)), Map.of(1, Long.valueOf(0), 2, Long.valueOf(0)), null, null, null, null, Map.of(1, "1", 2, "a"), Map.of(1, "1", 2, "a"))
.row(0, "ORC", fileSizes[1], 1L, null, Map.of(1, Long.valueOf(1), 2, Long.valueOf(1)), Map.of(1, Long.valueOf(0), 2, Long.valueOf(0)), null, null, null, null, Map.of(1, "2", 2, "b"), Map.of(1, "2", 2, "b"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove boxing.

}
}

private long[] getFileSizes(List<MaterializedRow> materializedRows) throws IOException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move throws to the next line.

@bitsondatadev
Copy link
Member

bitsondatadev commented Nov 21, 2022

Thanks @ebyhr! @osscm, reach out to me if you make changes and a few days pass without review please!

@colebow
Copy link
Member

colebow commented Mar 30, 2023

Hey @osscm, are you still working on this?

@colebow
Copy link
Member

colebow commented May 23, 2023

I'm going to close this out due to inactivity. Feel free to re-open if you want to pick this back up at any point in the future.

@colebow colebow closed this May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Support Iceberg's all_files Metadata table
10 participants