Skip to content

Conversation

@tbaeg
Copy link
Member

@tbaeg tbaeg commented Apr 25, 2025

Description

The update will allows for connectors implementing a SystemTable to distribute and better parallelize work across the cluster. There should be a noticeable speed improvement, as well as a substantial reduction in coordinator resource utilization.

In the original table function implementation, I saw x60 speed up on $files query (4.5 minutes to 4.5 seconds) as well as a massive reduction in CPU/memory utilization of 80 - 95%. I can't deploy this specific iteration of this, but expect similar results.

  • Update ALL_NODES distribution mode for SystemTables by delegating processing to ConnectorPageSourceProvider.
  • FilesTable in trino-iceberg was updated to illustrate the usage.

Additional context and related issues

https://trinodb.slack.com/archives/CJ6UC075E/p1745437019159079

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg
* Improve performance of queries on $files metadata table. ({issue}`25677`)

@cla-bot cla-bot bot added the cla-signed label Apr 25, 2025
@tbaeg tbaeg changed the title Minimal distributed $files metadata table function Draft: Minimal distributed $files metadata table function Apr 25, 2025
@github-actions github-actions bot added the iceberg Iceberg connector label Apr 25, 2025
@ebyhr
Copy link
Member

ebyhr commented Apr 25, 2025

SystemTable supports ALL_NODES distribution. Could you please explain why we need to add a new table function? I'm not a fan of this approach.

@tbaeg
Copy link
Member Author

tbaeg commented Apr 25, 2025

SystemTable supports ALL_NODES distribution. Could you please explain why we need to add a new table function? I'm not a fan of this approach.

If I recall correctly, I tried to implement this as an ALL_NODES mode SystemTable but found the splits were being cast as SystemSplit. I will note, I was operating off of 468 fork at the time so this may not be true on main.

It also seemed like the paradigm for ALL_NODES was to run the same logic on each node (i.e. - TaskSystemTable). There is also the added flexibility of additional arguments (i.e. - snapshot id) that could be leveraged to view different files for different snapshots.

@tbaeg tbaeg marked this pull request as draft April 26, 2025 00:04
@tbaeg tbaeg requested a review from ebyhr April 26, 2025 00:05
@ebyhr ebyhr removed their request for review April 26, 2025 00:08
@tbaeg tbaeg changed the title Draft: Minimal distributed $files metadata table function Draft: Distributed $files metadata table function Apr 26, 2025
@tbaeg tbaeg force-pushed the distributed-meta-files branch from b397410 to d17720e Compare April 26, 2025 01:08
@tbaeg tbaeg changed the title Draft: Distributed $files metadata table function Distributed $files metadata table function Apr 27, 2025
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there should be two separate ways of querying $files table. We should improve the existing implementation.
I think you should be able to implement this by overiding io.trino.spi.connector.SystemTable#splitSource for io.trino.plugin.iceberg.FilesTable.

@tbaeg
Copy link
Member Author

tbaeg commented Apr 28, 2025

I don't think there should be two separate ways of querying $files table. We should improve the existing implementation. I think you should be able to implement this by overiding io.trino.spi.connector.SystemTable#splitSource for io.trino.plugin.iceberg.FilesTable.

Yeah, I had originally tried to do that but the SPI expects a SystemSplit source here. I didn't want to introduce changes to the SPI so ended up this route.

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/connector/system/SystemPageSourceProvider.java#L82

If folks are ok with changing above, I can throw something together for review.

@raunaqmorarka
Copy link
Member

I don't think there should be two separate ways of querying $files table. We should improve the existing implementation. I think you should be able to implement this by overiding io.trino.spi.connector.SystemTable#splitSource for io.trino.plugin.iceberg.FilesTable.

Yeah, I had originally tried to do that but the SPI expects a SystemSplit source here. I didn't want to introduce changes to the SPI so ended up this route.

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/connector/system/SystemPageSourceProvider.java#L82

If folks are ok with changing above, I can throw something together for review.

Its okay to change the SPI if we need to, we should do whatever changes are neeeded to make this work in the system tables framework

@tbaeg tbaeg force-pushed the distributed-meta-files branch from d17720e to 5689635 Compare May 15, 2025 04:41
@vlad-lyutenko
Copy link
Contributor

I apply last commit 5689635
from @tbaeg - big thanks for your input.

And the problem, which I encountered is in SystemPageSourceProvider:

public ConnectorPageSource createPageSource(
...
SystemTable systemTable = tables.getSystemTable(session, tableName)

On coordinator everything is working.
But on worker tables is StaticSystemTablesProvider which doesn't have our needed system table.

I dived a little bit how it's done on coordinator and I see that to fetch this table we use MetadataManager :

Optional<CatalogMetadata> optionalCatalogMetadata = transactionManager.getOptionalCatalogMetadata(session.getRequiredTransactionId(), catalogName);

But as far as I understand on worker transactionManager is NoOpTransactionManager.
So we kind of not able to get SystemTable

Maybe you have some ideas how we can overcome this issue.
@ebyhr @raunaqmorarka
Thanks.

@tbaeg
Copy link
Member Author

tbaeg commented May 15, 2025

Thanks for validating this @vlad-lyutenko! I came to roughly the same conclusion as well.

I also realized the current commit doesn't work in SINGLE_COORDINATOR code due to the FilesTableSplit not setting the isRemotelyAccessible and getAddresses information (among other things, ha).

I have a few thoughts around possible solutions but probably want to iterate on them a bit/chat over slack with folks if possible.

@tbaeg
Copy link
Member Author

tbaeg commented May 19, 2025

So, I had a two general approaches that I think could work, but unsure if any of them is the "right" thing to do. I wanted to run it by the community before running with either as I'm sure I'm missing something.

  1. Introduce a method in the ConnectorPageSourceProvider that handles a new ConnectorSystemSplit. This would allow distributed dynamic tables to work and for SystemTable not be required during split processing. Instead would be delegated to the Connectors ConnectorPageSourceProvider.createPageSource implementation. It's a smaller change but feels a little incomplete.
  2. I think the more complete approach is a more formal interface to Connector that has things like ConnectorSystemPageSourceProvider/ConnectorSystemPageSourceProviderFactory and the like. In the long run, I think SystemTable could instead be treated the same (or at least similar) to any other "table".

@ebyhr @raunaqmorarka @vlad-lyutenko

@vlad-lyutenko
Copy link
Contributor

  1. Introduce a method in the ConnectorPageSourceProvider that handles a new ConnectorSystemSplit. This would allow distributed dynamic tables to work and for SystemTable not be required during split processing. Instead would be delegated to the Connectors ConnectorPageSourceProvider.createPageSource implementation. It's a smaller change but feels a little incomplete.

Maybe lets start with first one, just as draft, play with it, understand some limitiations, and then we could decide do we want dive deeper (like proper testing e.t.c) or maybe it's even not possible and we should go with option 2.

@tbaeg tbaeg force-pushed the distributed-meta-files branch from 5689635 to d69b5b8 Compare May 20, 2025 04:03
@tbaeg
Copy link
Member Author

tbaeg commented May 21, 2025

I pushed some thing that at least passes one of the $files related tests as a proof of concept for the createPageSource implementation. Still feels a bit disjointed because of the selective use of the Connector provided PageSource, but going to clean it up a bit to see.

@vlad-lyutenko
Copy link
Contributor

I pushed some thing that at least passes one of the $files related tests as a proof of concept for the createPageSource implementation. Still feels a bit disjointed because of the selective use of the Connector provided PageSource, but going to clean it up a bit to see.

Big thx! I will try ti play/experiment with this.

@vlad-lyutenko
Copy link
Contributor

@tbaeg big thx, I forgot that I can not force push on your fork.
So as discussed offline I clean up you PR for better reviewing.

@ebyhr @raunaqmorarka please take a look on this PR - #25861
It basically clean up version of current PR.

Should we move in this direction?

@tbaeg tbaeg force-pushed the distributed-meta-files branch from d69b5b8 to ca9864e Compare May 26, 2025 21:06
@tbaeg
Copy link
Member Author

tbaeg commented May 26, 2025

@vlad-lyutenko I pushed up an update that actually tries not to change too much of the SPI and instead delegate through the existingConnectorPageSourceProvider.createPageSource method. It's minimal in changes and is mostly a documentation change on how the SystemTable.splitSource method actually behaves. That said, it's fairly nuanced and I think the API could use another look in the future.

I need clean up the rest of the actual implementation of the FilesTable portions. But I think this should give maintainers a smaller set of changes to look at for the SystemTable framework.

@tbaeg tbaeg force-pushed the distributed-meta-files branch from ca9864e to eb612dc Compare May 27, 2025 01:06
@tbaeg tbaeg closed this May 27, 2025
@tbaeg tbaeg force-pushed the distributed-meta-files branch from eb612dc to 0ca172d Compare May 27, 2025 01:06
@tbaeg tbaeg reopened this May 27, 2025
@tbaeg tbaeg force-pushed the distributed-meta-files branch 2 times, most recently from 529d39b to c714772 Compare August 2, 2025 16:38

// This means there is no known static table, but that doesn't mean a dynamic table must exist.
// This node could have a different config that causes that table to not exist.
if (!isCoordinatorTransaction(session)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if user enable schedule on coordinator - node-scheduler.include-coordinator, will it be broken?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, there shouldn't be an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make sure we have a test which runs a query on $files in both modes. My guess is the existing tests already use the default node-scheduler.include-coordinator of true

@tbaeg tbaeg force-pushed the distributed-meta-files branch 2 times, most recently from 10ce16a to c9c1654 Compare August 4, 2025 13:15
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep your commits squashed
we can use github to compare the pushed changes as long as any rebases to master are not mixed with pushes for addressing review comments


// This means there is no known static table, but that doesn't mean a dynamic table must exist.
// This node could have a different config that causes that table to not exist.
if (!isCoordinatorTransaction(session)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make sure we have a test which runs a query on $files in both modes. My guess is the existing tests already use the default node-scheduler.include-coordinator of true

@tbaeg tbaeg force-pushed the distributed-meta-files branch 2 times, most recently from 7d2b6ae to f0e2613 Compare August 4, 2025 16:12
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

@tbaeg tbaeg force-pushed the distributed-meta-files branch 3 times, most recently from 956244c to 471a2cb Compare August 4, 2025 16:55
@tbaeg tbaeg force-pushed the distributed-meta-files branch from 471a2cb to f29b93e Compare August 4, 2025 16:58
@tbaeg
Copy link
Member Author

tbaeg commented Aug 4, 2025

Looks like the keyMetadata is read internally in the iceberg libraries. Going to revert some changes.

@tbaeg tbaeg force-pushed the distributed-meta-files branch 2 times, most recently from 4cff52a to 1c1ed90 Compare August 4, 2025 18:06
@raunaqmorarka raunaqmorarka requested a review from Copilot August 4, 2025 18:55

This comment was marked as outdated.

- Add support for SystemTable's to delegate split processing to connectors
- Add support for distributed processing for $files
@raunaqmorarka raunaqmorarka force-pushed the distributed-meta-files branch from 1c1ed90 to e647356 Compare August 4, 2025 19:09
@raunaqmorarka raunaqmorarka requested a review from Copilot August 4, 2025 19:13
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables distributed execution of system tables implementing the SystemTable interface, with a focus on improving performance and resource utilization for the $files metadata table. The implementation introduces a new split-based approach for distributing work across the cluster.

  • Updates SystemTable distribution mode for ALL_NODES by delegating processing to ConnectorPageSourceProvider
  • Refactors the FilesTable implementation to use distributed splits instead of single-coordinator execution
  • Introduces new split and page source classes for distributed $files table processing

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/system/files/FilesTable.java New distributed implementation of FilesTable using split-based architecture
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/system/files/FilesTableSplit.java New split class for distributing manifest processing across nodes
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/system/files/FilesTablePageSource.java New page source for processing individual manifest files
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java Updated to handle FilesTableSplit instances
core/trino-spi/src/main/java/io/trino/spi/connector/SystemTable.java Added splitSource() method for distributed system table support
core/trino-main/src/main/java/io/trino/connector/system/SystemTablesProvider.java Refactored from interface to concrete class

@raunaqmorarka raunaqmorarka changed the title Distributed $files SystemTable implementation Implement distributed processing for iceberg $files system table Aug 4, 2025
@raunaqmorarka raunaqmorarka merged commit 170c714 into trinodb:master Aug 4, 2025
95 checks passed
@github-actions github-actions bot added this to the 477 milestone Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

6 participants