Skip to content

Conversation

@wypoon
Copy link
Contributor

@wypoon wypoon commented Feb 10, 2023

This is a continuation of #5893.

Use the avro compression properties (write.avro.compression-codec and write.avro.compression-level) set in the table properties to determine compression when writing manifests and manifest lists for the table.

@wypoon
Copy link
Contributor Author

wypoon commented Feb 10, 2023

I am picking up #5893 where @sumeetgajjar left off, as he is now pursuing other projects.
@rdblue I believe Sumeet has addressed your feedback. Compression level is an Integer, to allow for it to be null (for codecs that do not use compression level), and I have used PropertyUtil.propertyAsNullableInt to turn the String into an Integer. Sumeet has removed unnecessary changes from tests and only added new tests.
I have rebased Sumeet's changes on master, fixed some issues, and moved the Flink changes from 1.15 to 1.16 (the current default Flink version).
This is ready for review.
cc @nastra and @amogh-jahagirdar who reviewed the original PR as well.

@wypoon wypoon force-pushed the avro_manifest_compression branch 2 times, most recently from f5457a3 to c81b209 Compare February 15, 2023 22:30
@wypoon
Copy link
Contributor Author

wypoon commented Feb 15, 2023

@nastra, thank you for your reviews. I have addressed all your feedback.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the only thing left is fixing the breaking API change. @rdblue or @danielcweeks could you guys review this as well please?

@wypoon
Copy link
Contributor Author

wypoon commented Feb 23, 2023

@rdblue thank you for reviewing. I have just been on a short vacation. I'll look into your comments.

@wypoon
Copy link
Contributor Author

wypoon commented Feb 24, 2023

@rdblue I have tried to address all your feedback. The only comment I did not understand is the one about not exposing Avro.Codec; I'm not sure what you'd like me to do. I'm fine with reverting it to private and simply using plain Strings in TestTableBase.AVRO_CODEC_NAME_MAPPING.

@wypoon wypoon requested a review from rdblue February 28, 2023 17:55
@wypoon wypoon force-pushed the avro_manifest_compression branch from e7e52c2 to 32a9a3a Compare March 8, 2023 00:14
@wypoon
Copy link
Contributor Author

wypoon commented Mar 8, 2023

@rdblue can you please review this again? @nastra has approved it.

@wypoon wypoon force-pushed the avro_manifest_compression branch from f01d1d8 to fb6fd2e Compare September 21, 2023 05:57
Copy link
Contributor

@ConeyLiu ConeyLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from my part

@wypoon
Copy link
Contributor Author

wypoon commented Sep 21, 2023

TestStructuredStreamingRead3 > [catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}] > testReadingStreamFromFutureTimetsamp[catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}] FAILED
    org.opentest4j.AssertionFailedError: 
    Expecting value to be true but was false
        at [email protected]/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at [email protected]/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at [email protected]/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at app//org.apache.iceberg.spark.source.TestStructuredStreamingRead3.lambda$testReadingStreamFromFutureTimetsamp$0(TestStructuredStreamingRead3.java:266)
        at [email protected]/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
        at [email protected]/java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:593)
        at app//org.apache.iceberg.spark.source.TestStructuredStreamingRead3.testReadingStreamFromFutureTimetsamp(TestStructuredStreamingRead3.java:263)

Is TestStructuredStreamingRead3 flaky?
spark-3x-scala-2-13-tests (8, 3.4) and spark-3x-java-17-tests (3.4, 2.13) passed; spark-3x-scala-2-13-tests (11, 3.4) failed due to this. It seems unlikely that it should fail just with java 11.
spark-3x-scala-2-13-tests (8, 3.5) and spark-3x-scala-2-13-tests (11, 3.5) were cancelled due to the above failure, but spark-3x-java-17-tests (3.5, 2.13) passed.

@wypoon
Copy link
Contributor Author

wypoon commented Sep 22, 2023

@nastra @ConeyLiu thanks for reviewing. I have updated the PR and CI is green.
@ConeyLiu raises a good point about the new newAppender in ManifestWriter. I put up #8617 as an alternative. If there are no other sticking points, then I hope one of these two can be merged and make Iceberg 1.4.0.

@wypoon
Copy link
Contributor Author

wypoon commented Sep 24, 2023

@nastra if you don't like #8617 can you please merge this instead? I have addressed all your feedback.

@nastra
Copy link
Contributor

nastra commented Sep 25, 2023

@nastra if you don't like #8617 can you please merge this instead? I have addressed all your feedback.

I think it would be good to get some additional review from one other committer. /cc @Fokko @amogh-jahagirdar @aokolnychyi could any of you review this one as well please?

@aokolnychyi
Copy link
Contributor

Will do a review by the end of this week, sorry for the delay.

* @param compressionLevel compression level of the compressionCodec
* @return a manifest writer
*/
public static ManifestWriter<DataFile> write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I already commented before going on vacation but can't seem to find the old discussion. Sorry if I post the same question again. Have we considered using a builder? My worry with the current approach was that we need to offer an overloaded method every time we add a new parameter.

@wypoon @nastra @rdblue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi thanks for reviewing. I'm interested to hear what @rdblue thinks. In the meantime, let me think about how to address your concern. However, using a builder will mean an API break, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeetgajjar originally used a Map parameter for this , and @rdblue commented that "we don't want to pass a map of properties around. That's exposing too much where it doesn't need to be, and people tend to misuse generic arguments like this."
What I propose to do then is to introduce a ManifestWriter.Options class and use that here (instead of a Map). I'll also introduce a ManifestListWriter.Options class and use that in ManifestLists.write. These Options classes define what additional parameters are applicable and may be set. If in future, additional parameters are needed, they can be added to these Options classes.

@wypoon
Copy link
Contributor Author

wypoon commented Oct 30, 2023

@nastra @aokolnychyi please let me know what you think. I updated the PR with the new approach.

sumeetgajjar and others added 11 commits November 28, 2023 17:56
1. remove unecessary edits to use compression codec and
   compression level from tests
2. move ManifestWriter tests to TestManifestWriter
3. move ManifestListWriter tests to TestManifestListWriter
4. remove unwanted `ManifestLists#write` method
Remove NumberUtil. Use PropertyUtil.propertyAsNullableInt instead.
Add convenience method propertyAsNullableInt to TableMetadata.
Fix a TestTableBase#writeManifest to write a file with .avro extension.
Revert Flink 1.15 changes and make them in Flink 1.16 instead.
Retain but deprecate old newAppender method in ManifestWriter.
Make Codec public in Avro, we can can use the enum values.
Rename CODEC_METADATA_MAPPING to AVRO_CODEC_NAME_MAPPING in TableTestBase
and provide explanatory comment. Use named constants in the map.
Adopt suggestion to use AssertJ assertions in test verification.
Update zstd-jni version.
Make zstd-jni dependency testRuntimeOnly.
Fix some nits.
Simplify validate methods in TestManifestWriter and TestManifestListWriter.
... and have the static methods in ManifestFiles and ManifestLists
for writing use them.
@wypoon wypoon force-pushed the avro_manifest_compression branch from 041b7a0 to be048b5 Compare November 30, 2023 03:07
@wypoon
Copy link
Contributor Author

wypoon commented Nov 30, 2023

Hmm, I think TestExpireSnapshotsAction > dataFilesCleanupWithParallelTasks might be a flaky test?
All Spark 3 tests passed with Java 8 and 11, and even for Java 17, they passed for Scala 2.13, so it seems unlikely that there is a problem just with Scala 2.12 on Java 17.

@wypoon
Copy link
Contributor Author

wypoon commented Nov 30, 2023

@nastra @aokolnychyi I have rebased on main and resolved the conflicts with the RewriteManifestsSparkAction refactoring. As I mentioned before, I have introduced ManifestWriter.Options and ManifestListWriter.Options to be passed to ManifestFiles.write and ManifestLists.write. These Options classes provide defined options (currently just compression codec and level but extensible in future) that may be passed.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 25, 2024
@github-actions
Copy link

github-actions bot commented Sep 1, 2024

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants