chore: Make parquet reader options Comet options instead of Hadoop options #968

parthchandra · 2024-09-25T18:26:30Z

Which issue does this PR close?

Closes #876 .

Rationale for this change

Moves configuration options of the parallel reader into CometConf and adds documentation so that end users can easily understand the configuration of the reader.

How are these changes tested?

Existing unit test

…ons.

andygrove · 2024-09-25T21:56:48Z

common/src/test/java/org/apache/comet/parquet/TestFileReader.java

  public void testWriteReadMergeScanRange() throws Throwable {
    Configuration conf = new Configuration();
-    conf.set(ReadOptions.COMET_IO_MERGE_RANGES, Boolean.toString(true));
+    conf.set(CometConf.COMET_IO_MERGE_RANGES().key(), Boolean.toString(true));


This test is setting the config values in a Hadoop Configuration still, and not setting them in the Spark config. Would it make sense to update the test?

There is no spark context in this test. I've added a new test with the configuration set thru the spark config

andygrove · 2024-09-25T21:58:05Z

common/src/main/java/org/apache/comet/parquet/ReadOptions.java

    public Builder(Configuration conf) {
      this.conf = conf;
      this.parallelIOEnabled =
          conf.getBoolean(


This is reading from the Hadoop conf. If I set the new configs on my Spark context, how would they get propagated to the Hadoop conf?

Spark sql copies over configs that are not from spark into the hadoop config when the sql context is created. There are other settings that also use this ( e.g. COMET_USE_LAZY_MATERIALIZATION, COMET_SCAN_PREFETCH_ENABLED)

andygrove · 2024-10-01T14:44:09Z

spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala

+          withSQLConf(
+            CometConf.COMET_BATCH_SIZE.key -> batchSize.toString,
+            CometConf.COMET_IO_MERGE_RANGES.key -> "true",
+            CometConf.COMET_IO_MERGE_RANGES_DELTA.key -> mergeRangeDelta.toString) {


Is it possible to test that this config is actually making into ReadOptions? If I comment out all of the code in ReadOptions that reads these configs, the test still passes. Perhaps we just need a specific test to show that setting the config on a Spark context causes a change to the ReadOptions?

I added some debug logging and I do see that it is working correctly, but would be good to have a test to confirm (and prevent regressions)

test is setting COMET_IO_MERGE_RANGES_DELTA = 1048576 ReadOptions ioMergeRangesDelta = 1048576 test is setting COMET_IO_MERGE_RANGES_DELTA = 1024 ReadOptions ioMergeRangesDelta = 1024

I added an additional check for the config. The configuration passed in to read options is not accessible but I tried to simulate the next best thing. See if that makes sense.

Thanks, Parth. LGTM. Looks like you need to run make format to fix some import ordering.

Thanks @andygrove. Fixed style.

andygrove

LGTM with some comments on testing.

## Which issue does this PR close?  Closes #. ## Rationale for this change  ## What changes are included in this PR?  ``` cb3e977 perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin (apache#1007) 3df9d5c fix: Make comet-git-info.properties optional (apache#1027) 4033687 chore: Reserve memory for native shuffle writer per partition (apache#1022) bd541d6 (public/main) remove hard-coded version number from Dockerfile (apache#1025) e3ac6cf feat: Implement bloom_filter_agg (apache#987) 8d097d5 (origin/main) chore: Revert "chore: Reserve memory for native shuffle writer per partition (apache#988)" (apache#1020) 591f45a chore: Bump arrow-rs to 53.1.0 and datafusion (apache#1001) e146cfa chore: Reserve memory for native shuffle writer per partition (apache#988) abd9f85 fix: Fallback to Spark if named_struct contains duplicate field names (apache#1016) 22613e9 remove legacy comet-spark-shell (apache#1013) d40c802 clarify that Maven central only has jars for Linux (apache#1009) 837c256 docs: Various documentation improvements (apache#1005) 0667c60 chore: Make parquet reader options Comet options instead of Hadoop options (apache#968) 0028f1e fix: Fallback to Spark if scan has meta columns (apache#997) b131cc3 feat: Support `GetArrayStructFields` expression (apache#993) 3413397 docs: Update tuning guide (apache#995) afd28b9 Quality of life fixes for easier hacking (apache#982) 18150fb chore: Don't transform the HashAggregate to CometHashAggregate if Comet shuffle is disabled (apache#991) a1599e2 chore: Update for 0.3.0 release, prepare for 0.4.0 development (apache#970) ``` ## How are these changes tested?

fix: Make parquet reader options Comet options instead of hadoop opti…

ce603ea

…ons.

andygrove reviewed Sep 25, 2024

View reviewed changes

Add additional test with configuration set thru spark sql config

0d396f7

andygrove reviewed Oct 1, 2024

View reviewed changes

andygrove approved these changes Oct 1, 2024

View reviewed changes

parthchandra added 2 commits October 1, 2024 15:40

Add configuration check in unit test

02ca31c

style fix

575bd78

andygrove merged commit 0667c60 into apache:main Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Make parquet reader options Comet options instead of Hadoop options #968

chore: Make parquet reader options Comet options instead of Hadoop options #968

Uh oh!

parthchandra commented Sep 25, 2024

Uh oh!

andygrove Sep 25, 2024

Uh oh!

parthchandra Sep 30, 2024

Uh oh!

andygrove Sep 25, 2024

Uh oh!

parthchandra Sep 30, 2024

Uh oh!

andygrove Oct 1, 2024

Uh oh!

andygrove Oct 1, 2024

Uh oh!

parthchandra Oct 1, 2024

Uh oh!

andygrove Oct 1, 2024

Uh oh!

parthchandra Oct 2, 2024

Uh oh!

andygrove left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chore: Make parquet reader options Comet options instead of Hadoop options #968

chore: Make parquet reader options Comet options instead of Hadoop options #968

Uh oh!

Conversation

parthchandra commented Sep 25, 2024

Which issue does this PR close?

Rationale for this change

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants