Enable failure recovery for Iceberg connector#10622

Merged

losipiuk merged 4 commits intotrinodb:masterfrom

losipiuk:lo/test-iceberg-failure-recovery

Feb 14, 2022

Member

losipiuk commented Jan 14, 2022 •

edited

Loading

Caveat:

With current logic it is possible that data file written during DML operation, which in the end is not part of the committed table snapshot, remains in the table directory on the distributed filesystem.
It should be a rare situation.

If task which writes data file fails unfinished file is deleted by tasks itself.
If there are two concurrently running attempts for a single task, when one completes, the other is killed. Yet killing is subject to race and if both tasks manage to complete successfully the extraneous file will remain in table directory.

Orphaned files can be clean via remove_orphan_files routing using Spark (https://iceberg.apache.org/#spark-procedures/).
Eventuall we want to have similar routine in Trino (#10623)

fixes: #10253

cla-bot bot added the cla-signed label

losipiuk requested review from arhimondr and findepi

January 14, 2022 17:27

losipiuk self-assigned this

arhimondr approved these changes

View reviewed changes

losipiuk force-pushed the lo/test-iceberg-failure-recovery branch 2 times, most recently from 14f2a06 to f4cb94d Compare

January 17, 2022 11:48

losipiuk mentioned this pull request

Flaky TestHiveQueryFailureRecoveryTest(testInsertIntoExistingPartitionBucketed, testInsertIntoNewPartition, testInsertIntoNewPartitionBucketed, testReplaceExistingPartition) #10631

Closed

findepi reviewed

View reviewed changes

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/IcebergQueryRunner.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/IcebergQueryRunner.java Outdated

Member

findepi Jan 18, 2022

closeSuppressing

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergFailureRecovery.java Outdated

Member

findepi Jan 18, 2022

AbstractTest is an old naming convention, these days we call them Base...Test

Member Author

losipiuk Jan 18, 2022

Extracted to #10659

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergFailureRecovery.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergFailureRecovery.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergFailureRecovery.java Outdated

Comment on lines 79 to 80

Member

findepi Jan 18, 2022

Sounds like you're working against testTableModification abstraction.

Member Author

losipiuk Jan 18, 2022

Yeah - I guess we do not need this test at all here. It would make sense to document behaviour difference for Iceberg if test was part of superclass.
I will drop for now, but it should be moved to superclass and we should start using TestingConnectorBehavior in super class

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergFailureRecovery.java Outdated

Member

findepi Jan 18, 2022

newline before .has...

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergFailureRecovery.java Outdated

Comment on lines 90 to 91

Member

findepi Jan 18, 2022

nit: grammar is off

why not enable it for now here?
when is it going to be removed?

Member Author

losipiuk Jan 18, 2022

I will add it for now here. As extracting it out from Base class requires significant restructuring.

As for timeline: 🤷

Member

findepi commented Jan 18, 2022

Orphaned files can be clean via remove_orphan_files routing using Spark (https://iceberg.apache.org/#spark-procedures/).
Eventuall we want to have similar routine in Trino (#10623)

While i agree we should implement such a procedure, failure recovery should not leave garbage behind, in situations when this can be avoided.

Iceberg writes files with random names. The file name is determined here

trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java

Line 305 in 863da2b

String fileName = fileFormat.addExtension(randomUUID().toString());

we should consider

add queryId to file name
in io.trino.plugin.iceberg.IcebergMetadata#finishInsert, etc, when failure recovery is enabled, go over table destination folder (as determined by LocationProvider) and find left over files
- we could do this only if there were some failures (but currently finish... doesn't know that)
- we could try to limit ourselves to directories where we write some files -- this way we wouldn't scan all the sub-directories of a partitioned table, but only ones where we insert data into
  - this will work when LocationProvider is deterministic. for non-deterministic LocationProvider we would need quite different approach (eg registering write locations before creating a file), as remove_orphan_files-like procedure might have problems as well

cc @phd3 @alexjo2144

findepi approved these changes

View reviewed changes

Member

findepi left a comment

Current state of the PR - LGTM % comments.

#10622 (comment) fits well as a followup.

losipiuk force-pushed the lo/test-iceberg-failure-recovery branch 3 times, most recently from 25bcc15 to b36010f Compare

January 18, 2022 15:40

Member Author

losipiuk commented Jan 18, 2022

Orphaned files can be clean via remove_orphan_files routing using Spark (https://iceberg.apache.org/#spark-procedures/).
Eventuall we want to have similar routine in Trino (#10623)

While i agree we should implement such a procedure, failure recovery should not leave garbage behind, in situations when this can be avoided.

Iceberg writes files with random names. The file name is determined here

trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java

Line 305 in 863da2b

String fileName = fileFormat.addExtension(randomUUID().toString());

we should consider

add queryId to file name

in io.trino.plugin.iceberg.IcebergMetadata#finishInsert, etc, when failure recovery is enabled, go over table destination folder (as determined by LocationProvider) and find left over files

we could do this only if there were some failures (but currently finish... doesn't know that)

we could try to limit ourselves to directories where we write some files -- this way we wouldn't scan all the sub-directories of a partitioned table, but only ones where we insert data into

this will work when LocationProvider is deterministic. for non-deterministic LocationProvider we would need quite different approach (eg registering write locations before creating a file), as remove_orphan_files-like procedure might have problems as well

cc @phd3 @alexjo2144

@findepi I added some code for that. PTAL and tell me what you think.

github-actions bot added the tests:hive label

losipiuk force-pushed the lo/test-iceberg-failure-recovery branch from b36010f to 53a739a Compare

January 19, 2022 12:57

findepi approved these changes

View reviewed changes

Member

findepi left a comment

"Cleanup extranous output files in Iceberg DML"

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/IcebergOptimizeHandle.java Outdated

Member

findepi Jan 20, 2022

could the RetryMode be provided by the engine to the finish*() methods, so that we don't have to embed the info in handles?

Member Author

losipiuk Feb 10, 2022

We could - but I would opt for what we do now. It feels more natural - as we know information upfront. Also it allows for rejecting the request sooner if given retry mode is not supported by a connector.

Member

findepi Feb 14, 2022

allows for rejecting the request sooner if given retry mode is not supported by a connector.

i did not suggest not to provide it to begin methods.

Member Author

losipiuk Feb 14, 2022

I would be asymetric vs other "stuff" we pass to "begin*" methods (layout, list of columns). I will leave it as is.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated

Member

findepi Jan 20, 2022

maybe add a comment that one query id be cannot be a prefix of another query id

Member Author

losipiuk Feb 10, 2022

That is a good point actually.

Changed to

        verify(!queryId.contains("-"), "queryId(%s) should not contain hyphens", queryId);
        return fileName.startsWith(queryId + "-");

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

alexjo2144 reviewed

View reviewed changes

plugin/trino-hive/pom.xml Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/IcebergQueryRunner.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergWritableTableHandle.java Outdated Show resolved Hide resolved

losipiuk added 2 commits

February 10, 2022 14:57


          Add clarifying comment

ed0bd03


          Add builder for IcebergQueryRunner

3ef12d0

losipiuk force-pushed the lo/test-iceberg-failure-recovery branch from 53a739a to 018609c Compare

February 10, 2022 18:16

losipiuk added 2 commits

February 11, 2022 10:26


          Enable failure recovery for Iceberg connector

894c7ad


          Cleanup extranous output files in Iceberg DML

955d243

With query/task retries there is a chance that extra files, which
does not make it to the snapshot are written to tables directory.
While most of such cases should be cleaned up by writers on workers,
there is a slim channce that some of those will survive query exection
(e.g. if worker machine is killed).

This commit adds pre-commit routine on coordinator which deletes what
remained. This is still opportunistic and not 100% sure to delete
everything as extra files may still be written after cleanup routine
already completed, but we are trying our best. The remaining files does
not imply query correctness.

losipiuk force-pushed the lo/test-iceberg-failure-recovery branch from 018609c to 955d243 Compare

February 11, 2022 09:29

Member Author

losipiuk commented Feb 14, 2022

CI flake: #10631

losipiuk merged commit 6047480 into trinodb:master

github-actions bot added this to the 371 milestone

mosabua mentioned this pull request

Add Trino 371 release notes #10943

Merged

losipiuk mentioned this pull request

Release notes for 371 #10941

Closed

MichaelTiemannOSC mentioned this pull request

Implement Iceberg routine for removing orphaned files #10623

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels