Filter Iceberg splits based on $path column predicates by ebyhr · Pull Request #13012 · trinodb/trino

ebyhr · 2022-06-28T10:15:40Z

Description

Filter Iceberg splits based on $path column predicates
Fixes #12785

Documentation

(x) No documentation is needed.

Release notes

(x) Release notes entries required with the following suggested text:

# Iceberg
* Support filtering splits based on `$path` column predicates. ({issue}`12785`)
* Return `$path` column without encoding when the path contains double slashes on S3. ({issue}`13012`)

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

alexjo2144

Nice

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

ebyhr · 2022-06-29T09:39:39Z

CI hit #12950

findepi · 2022-06-29T09:49:16Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

This should make only $path column enforced, and if it is any other metadata column, we should still consider this unsupported.

findepi · 2022-06-29T09:50:38Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

That doesn't need to return Optional.
it's either Domain.none(), Domain.all(), or some specific Domain

findepi · 2022-06-29T09:52:52Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

I don't think hadoopPath is appropriate here. It appends some #... in case of s3://,
but that suffix is not -- i hope so -- returned when doing SELECT "$path" FROM t

can we have some test coverage with s3?

sth like

create a table with two files

get "$path" for one of the files

select where $path = selected_path

verify we get data from that one file only

I think we should use hadoopPath() unless fixing $path result.

IcebergSplitSource private IcebergSplit toIcebergSplit(FileScanTask task) { return new IcebergSplit( hadoopPath(task.file().path().toString()), ↓ IcebergPageSourceProvider ReaderPageSource dataPageSource = createDataPageSource( session, hdfsContext, new Path(split.getPath()), ... else if (column.isPathColumn()) { columnAdaptations.add(ColumnAdaptation.constantColumn(nativeValueToBlock(FILE_PATH.getType(), utf8Slice(path.toString())))); }

It appends # when the bucket starts with / likes s3:///bucket (not always) as you already know (#11998)

I tried if we can create such bucket in S3 & Minio, but failing by invalid bucket name.

~ aws s3api create-bucket --bucket /ebyhr-test Parameter validation failed: Invalid bucket name "/ebyhr-test": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"

https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html

the #... suffix is internal thing, not to be exposed to users.

findepi · 2022-06-29T09:57:26Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.java

Let's move this logic to the place which applies conditions on metadata columns, so that we know no constraint gets ignored.

thus, here it would be checkArgument(! isMetadataColumnId(columnHandle.getId()))
and IcebergSplitSource would divide TupleDomain into

$path domain --> handled directly there

the reset --> passed to toIcebergExpression

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplit.java

findinpath · 2022-07-04T07:55:08Z

The PR seems to be in a better shape now.
Please reorganize the commits to have them without fixups to continue the review process.

phd3 · 2022-07-11T02:44:20Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

nit: we can defensively check if all metadata column predicates are consumed here, so that for any future additions for metadata-column predicates are actually applied in the splitsource. but don't feel too strongly - as we'll add tests anyway.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

phd3 · 2022-07-11T03:05:38Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

Q are we changing the behavior here? i.e. before this PR, effectively hadoopPath() was being returned here since that was propagated from the split, but now it's the actual path without encoding.

Yes, it's changing. I will mention in a release note. Relates to #13012 (comment)

alexjo2144

@homar added some code in #12704 which relies on an assumption that enforced predicates always fall on partition boundaries. After this change that will no longer be true. We need to figure out how to resolve that before merging this

alexjo2144

@homar how do you feel about changing finishOptimize to only remove delete files if the whole table is being optimized? I know we spent a while trying to figure out exactly when we could remove the files in a partition, but maintaining that logic feels pretty error prone.

alexjo2144 · 2022-07-19T15:03:37Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

Maybe rename to dataColumnPredicate or nonMetadataPredicate to signal that it's a filtered version

Renamed to dataColumnPredicate.

homar · 2022-07-19T15:15:08Z

@homar how do you feel about changing finishOptimize to only remove delete files if the whole table is being optimized? I know we spent a while trying to figure out exactly when we could remove the files in a partition, but maintaining that logic feels pretty error prone.

I am fine with that tough I am afraid that if we do that then there a lot of clients will run into situation when delete files are never removed as they never optimize entire table.

alexjo2144 · 2022-07-19T15:32:23Z

They should eventually get cleaned up by remove_orphan_files, I believe. As long as all of the data files referenced by the delete file have been optimized away.

alexjo2144 · 2022-07-25T18:45:02Z

@ebyhr we can either merge this first, or you can include it in this PR. Whatever you'd like #13343

The initial implementation removed delete files from a partition even if the whole table was not scanned. This was fine, but assumes the enforced predicate describes entire partitions. This assumption will not be true after trinodb#13012

ebyhr · 2022-07-28T03:43:07Z

Just rebased on upstream to resolve conflicts. Let's merge #13343 first.

The initial implementation removed delete files from a partition even if the whole table was not scanned. This was fine, but assumes the enforced predicate describes entire partitions. This assumption will not be true after #13012

alexjo2144 · 2022-07-29T14:06:18Z

We should do this for the file_modified_time column as well. Want to do that here, or in a separate PR?

ebyhr · 2022-07-29T23:56:30Z

I want to separate a PR for file_modified_time column.

ebyhr · 2022-08-01T12:00:32Z

@findepi Could you please review when you have time?

alexjo2144 · 2022-08-01T14:12:20Z

Can you add more more test? Specifically

Insert rows 4 times, resulting in files f1, f2, f3, f4
Run an optimize WHERE $path = f1 OR $path = f2
Run an optimize WHERE $path = f3 OR $path = f4
Show table contains files f5 and f6

ebyhr · 2022-08-02T00:30:22Z

@alexjo2144 Added another test case.

findepi · 2022-08-02T11:09:07Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java


 import java.io.IOException;
 import java.io.UncheckedIOException;
+import java.net.URI;


should "Return $path without URL encoding in Iceberg" commit have any test changes/additions?

Ideally it should, but let me handle in #13457

url encoding (or lack of) should be exercisable independently from double slashes.
eg path containing %.

Not sure if I understood your suggestion correctly, but just adding % to file or directory name wouldn't work because it doesn't pass in !path.equals(hadoopPath.toString()) in hadoopPath().

findepi · 2022-08-02T11:09:57Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.java

Add a message

"Constraint on an unexpected column %s", columnHandle

findepi · 2022-08-02T11:44:45Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

nit: can inline pathMatchesPredicate

findepi · 2022-08-02T11:49:21Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

if effective predicate is none, this method returns ALL domain.
The NONE tuple domain needs to be handled explicitly.

OTOH, the none is not expected here (should be filtered out earlier), so sth like

IcebergColumnHandle pathColumn = pathColumnHandle(); Domain domain = effectivePredicate.getDomains().orElseThrow(() -> new IllegalArgumentException("Unexpected NONE tuple domain")) .get(pathColumn); if (domain == null) { return Domain.all(pathColumn.getType()); } return domain;

The initial implementation removed delete files from a partition even if the whole table was not scanned. This was fine, but assumes the enforced predicate describes entire partitions. This assumption will not be true after trinodb#13012

cla-bot bot added the cla-signed label Jun 28, 2022

alexjo2144 reviewed Jun 28, 2022

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java Outdated Show resolved Hide resolved

alexjo2144 approved these changes Jun 29, 2022

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java Outdated Show resolved Hide resolved

findepi reviewed Jun 29, 2022

View reviewed changes

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from dab1097 to c0805b6 Compare July 4, 2022 02:30

findinpath reviewed Jul 4, 2022

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplit.java Outdated Show resolved Hide resolved

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from c0805b6 to 7391dd9 Compare July 4, 2022 06:50

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from 7391dd9 to fef1224 Compare July 4, 2022 08:28

findinpath self-requested a review July 4, 2022 08:56

findinpath approved these changes Jul 4, 2022

View reviewed changes

ebyhr requested review from electrum and phd3 July 5, 2022 05:40

ebyhr mentioned this pull request Jul 5, 2022

Add $file_modified_time column in Iceberg #13082

Merged

phd3 reviewed Jul 11, 2022

View reviewed changes

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from fef1224 to 05ce82b Compare July 12, 2022 07:56

ebyhr requested a review from phd3 July 13, 2022 05:28

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from 05ce82b to f54431a Compare July 14, 2022 07:17

ebyhr requested a review from Praveen2112 July 14, 2022 07:56

alexjo2144 suggested changes Jul 14, 2022

View reviewed changes

ebyhr removed the request for review from Praveen2112 July 19, 2022 08:24

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from f54431a to a3aa655 Compare July 19, 2022 09:30

alexjo2144 reviewed Jul 19, 2022

View reviewed changes

findepi requested a review from alexjo2144 July 25, 2022 15:10

alexjo2144 mentioned this pull request Jul 25, 2022

Only remove Iceberg delete files when safe to do so #13343

Merged

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from a4cec23 to c984e7d Compare July 28, 2022 03:42

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from c984e7d to 2dacf04 Compare July 29, 2022 03:45

ebyhr requested review from findepi and findinpath July 29, 2022 10:16

alexjo2144 approved these changes Jul 29, 2022

View reviewed changes

Return $path without URL encoding in Iceberg

85aa77d

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from 2dacf04 to c7d041b Compare August 2, 2022 00:28

findepi approved these changes Aug 2, 2022

View reviewed changes

Filter Iceberg splits based on $path column predicates

f22fdd4

ebyhr force-pushed the ebi/iceberg-path-pushdown branch from c7d041b to f22fdd4 Compare August 2, 2022 12:27

ebyhr merged commit 3405143 into trinodb:master Aug 3, 2022

ebyhr deleted the ebi/iceberg-path-pushdown branch August 3, 2022 01:34

ebyhr mentioned this pull request Aug 3, 2022

Release notes for 392 #13320

Closed

github-actions bot added this to the 392 milestone Aug 3, 2022

colebow mentioned this pull request Aug 3, 2022

Add Trino 392 release notes #13342

Closed

Conversation

ebyhr commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Documentation

Release notes

Uh oh!

Uh oh!

alexjo2144 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ebyhr commented Jun 29, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

findinpath commented Jul 4, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebyhr Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexjo2144 left a comment

Choose a reason for hiding this comment

Uh oh!

alexjo2144 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

homar commented Jul 19, 2022

Uh oh!

alexjo2144 commented Jul 19, 2022

Uh oh!

alexjo2144 commented Jul 25, 2022

Uh oh!

ebyhr commented Jul 28, 2022

Uh oh!

alexjo2144 commented Jul 29, 2022

Uh oh!

ebyhr commented Jul 29, 2022

Uh oh!

ebyhr commented Aug 1, 2022

Uh oh!

alexjo2144 commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebyhr commented Aug 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebyhr Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ebyhr commented Jun 28, 2022 •

edited

Loading

ebyhr Jul 11, 2022 •

edited

Loading

alexjo2144 commented Aug 1, 2022 •

edited

Loading

ebyhr Aug 3, 2022 •

edited

Loading