Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 16, 2025

Which issue does this PR close?

This is a fork of the branch created by @parthchandra that merged main into comet-parquet-exec and it also fixes a regression where a merge conflict had resulted in reverting to use DataFusion's FilterExec rather than Comet's forked version which fixed a safety issue around buffer re-use.

Rationale for this change

We want to get these changes into main because it is very time consuming to keep rebasing this work.

What changes are included in this PR?

How are these changes tested?

mbutrovich and others added 30 commits November 8, 2024 13:54
add partial support for multiple parquet files
"filter with string" test now passes
* wip - CometNativeScan

* fix and make config internal
…e debug logging (apache#1080)

* update tests, remove some debug logging

* update tests, remove some debug logging

* update tests, remove some debug logging

* remove unused import
…che#1081)

* I think serde works. Gonna try removing the old stuff.

* Fixes after merging in upstream.

* Remove previous file_config logic. Clippy.

* Temporary assertion for testing.

* Remove old path proto value.

* Selectively generate projection vector.
…stead of FileScanRDD (apache#1088)

* DataSourceRDD handling (seems to be related to prefetching, so maybe not relevant for our ParquetExec).

* Refactor to reduce duplicate code.
…pache#1106)

* init

* more

* more

* fix clippy

* Use Spark and Arrow types for partition schema
* fix: Use RDD partition index (apache#1112)

* fix: Use RDD partition index

* fix

* fix

* fix

* fix style
…e#1138)

* WIP: (POC2) A Parquet reader that uses the arrow-rs Parquet reader directly

* Change default config

---------

Co-authored-by: Parth Chandra <[email protected]>
…rquet (apache#1075)

* implement basic native code for casting struct to struct

* add another test

* rustdoc

* add scala side

* code cleanup

* clippy

* clippy

* add scala test

* improve test

* simple struct case passes

* save progress

* copy schema adapter code from DataFusion

* more tests pass

* save progress

* remove debug println

* remove debug println
…e#1142)

* Serialize original data schema and required schema, generate projection vector on the Java side.

* Sending over more schema info like column names and nullability.

* Using the new stuff in the proto. About to take the old out.

* Remove old logic.

* remove errant print.

* Serialize original data schema and required schema, generate projection vector on the Java side.

* Sending over more schema info like column names and nullability.

* Using the new stuff in the proto. About to take the old out.

* Remove old logic.

* remove errant print.

* Remove commented print. format.

* Remove commented print. format.

* Fix projection_vector to include partition_schema cols correctly.

* Rename variable.
andygrove and others added 13 commits January 7, 2025 12:58
…scan is enabled (apache#1230)

* Disable DPP in stability tests, update plans for Spark 3.4

* update plans for Spark 3.5

* fix scan name

* fix scan name

* fix scan name

* Revert a change
…pache#1237)

* fix regression in
DisableAQECometShuffleSuite

* update comments

* address feedback

* typo
…f Cast expression. (apache#1229)

* Copy cast.rs logic to a new parquet_support.rs. Remove Parquet dependencies on cast.rs.

* Move parquet_support and schema_adapter to parquet folder.

* Add fields to SparkParquetOptions.
…ive_comet (apache#1265)

* fix: fix tests failing in native_recordbatch but not in native_full

* fix: use session timestamp in native scans

* Revert "fix: use session timestamp in native scans"

This reverts commit e601deb472037338a36300992434a987bdb026e8.

* Revert Change to native record batch timezone

* Change stability plans to match original scan.

* fix after rebase

* Update plans; generate distinct plans for full native scan

* generate plans for native_recordbatch

* In struct tests, check Comet operator only for scan types that support complex types

* Revert "Revert Change to native record batch timezone"

This reverts commit 4a147f3.

* Reapply "fix: use session timestamp in native scans"

This reverts commit 370f901.

* Fix previous commit

* Rename configs and default scan impl to 'native_comet'

* add missing change

* fix build

* update plans for spark 3.5

* Add new plans for spark 3.5

* Update plans for Spark 4.0

* Plans updated from Spark 4
@andygrove andygrove changed the title ignore: fork of Comet parquet exec merge 20250114 [comet-parquet-exec] ignore: merge Parth's branch into main Jan 16, 2025
@andygrove andygrove changed the title [comet-parquet-exec] ignore: merge Parth's branch into main chore: Merge comat-parquet-exec into main Jan 16, 2025
@andygrove andygrove marked this pull request as ready for review January 16, 2025 16:07
@andygrove andygrove changed the title chore: Merge comat-parquet-exec into main chore: Merge comet-parquet-exec into main Jan 16, 2025
@andygrove
Copy link
Member Author

One test failure:

2025-01-16T17:21:07.8630758Z CometTPCDSQuerySuite:
2025-01-16T17:21:14.1575727Z 25/01/16 17:21:14 INFO core/src/lib.rs: Comet native library version 0.6.0 initialized
2025-01-16T17:21:21.0957254Z - q1 (5 seconds, 929 milliseconds)
2025-01-16T17:21:22.7696176Z - q2 *** FAILED *** (1 second, 577 milliseconds)
2025-01-16T17:21:22.7698480Z   java.lang.Exception: Expected "struct<[d_week_seq1:int,round((sun_sales1 / sun_sales2), 2):decimal(20,2),round((mon_sales1 / mon_sales2), 2):decimal(20,2),round((tue_sales1 / tue_sales2), 2):decimal(20,2),round((wed_sales1 / wed_sales2), 2):decimal(20,2),round((thu_sales1 / thu_sales2), 2):decimal(20,2),round((fri_sales1 / fri_sales2), 2):decimal(20,2),round((sat_sales1 / sat_sales2), 2):decimal(20,2)]>", but got "struct<[]>" Schema did not match

@andygrove
Copy link
Member Author

Multiple tests failing with same error. Here is one example:

SPARK-33482: Fix FileScan canonicalization *** FAILED *** (510 milliseconds)
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: ArrayBuffer(2, 0, 0)

@andygrove andygrove marked this pull request as draft January 16, 2025 18:20
@andygrove andygrove changed the title chore: Merge comet-parquet-exec into main IGNORE: chore: Merge comet-parquet-exec into (just to see diff) Jan 17, 2025
@andygrove andygrove closed this Jan 17, 2025
@andygrove andygrove deleted the comet-parquet-exec-merge-20250114 branch June 17, 2025 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants