IGNORE: chore: Merge comet-parquet-exec into (just to see diff) #1296

andygrove · 2025-01-16T15:54:08Z

Which issue does this PR close?

This is a fork of the branch created by @parthchandra that merged main into comet-parquet-exec and it also fixes a regression where a merge conflict had resulted in reverting to use DataFusion's FilterExec rather than Comet's forked version which fixed a safety issue around buffer re-use.

Rationale for this change

We want to get these changes into main because it is very time consuming to keep rebasing this work.

What changes are included in this PR?

How are these changes tested?

add partial support for multiple parquet files

"filter with string" test now passes

* wip - CometNativeScan * fix and make config internal

This reverts commit 38e32f7.

…e debug logging (apache#1080) * update tests, remove some debug logging * update tests, remove some debug logging * update tests, remove some debug logging * remove unused import

…che#1081) * I think serde works. Gonna try removing the old stuff. * Fixes after merging in upstream. * Remove previous file_config logic. Clippy. * Temporary assertion for testing. * Remove old path proto value. * Selectively generate projection vector.

…stead of FileScanRDD (apache#1088) * DataSourceRDD handling (seems to be related to prefetching, so maybe not relevant for our ParquetExec). * Refactor to reduce duplicate code.

…e#1094) * init * more * fix * more * more * fix

…pache#1106) * init * more * more * fix clippy * Use Spark and Arrow types for partition schema

* fix: Use RDD partition index (apache#1112) * fix: Use RDD partition index * fix * fix * fix * fix style

…e#1138) * WIP: (POC2) A Parquet reader that uses the arrow-rs Parquet reader directly * Change default config --------- Co-authored-by: Parth Chandra <[email protected]>

…rquet (apache#1075) * implement basic native code for casting struct to struct * add another test * rustdoc * add scala side * code cleanup * clippy * clippy * add scala test * improve test * simple struct case passes * save progress * copy schema adapter code from DataFusion * more tests pass * save progress * remove debug println * remove debug println

…e#1142) * Serialize original data schema and required schema, generate projection vector on the Java side. * Sending over more schema info like column names and nullability. * Using the new stuff in the proto. About to take the old out. * Remove old logic. * remove errant print. * Serialize original data schema and required schema, generate projection vector on the Java side. * Sending over more schema info like column names and nullability. * Using the new stuff in the proto. About to take the old out. * Remove old logic. * remove errant print. * Remove commented print. format. * Remove commented print. format. * Fix projection_vector to include partition_schema cols correctly. * Rename variable.

…scan is enabled (apache#1230) * Disable DPP in stability tests, update plans for Spark 3.4 * update plans for Spark 3.5 * fix scan name * fix scan name * fix scan name * Revert a change

…pache#1237) * fix regression in DisableAQECometShuffleSuite * update comments * address feedback * typo

…1231)

…f Cast expression. (apache#1229) * Copy cast.rs logic to a new parquet_support.rs. Remove Parquet dependencies on cast.rs. * Move parquet_support and schema_adapter to parquet folder. * Add fields to SparkParquetOptions.

…ive_comet (apache#1265) * fix: fix tests failing in native_recordbatch but not in native_full * fix: use session timestamp in native scans * Revert "fix: use session timestamp in native scans" This reverts commit e601deb472037338a36300992434a987bdb026e8. * Revert Change to native record batch timezone * Change stability plans to match original scan. * fix after rebase * Update plans; generate distinct plans for full native scan * generate plans for native_recordbatch * In struct tests, check Comet operator only for scan types that support complex types * Revert "Revert Change to native record batch timezone" This reverts commit 4a147f3. * Reapply "fix: use session timestamp in native scans" This reverts commit 370f901. * Fix previous commit * Rename configs and default scan impl to 'native_comet' * add missing change * fix build * update plans for spark 3.5 * Add new plans for spark 3.5 * Update plans for Spark 4.0 * Plans updated from Spark 4

…eberg_compat scans (apache#1279)

andygrove · 2025-01-16T17:36:43Z

One test failure:

2025-01-16T17:21:07.8630758Z CometTPCDSQuerySuite:
2025-01-16T17:21:14.1575727Z 25/01/16 17:21:14 INFO core/src/lib.rs: Comet native library version 0.6.0 initialized
2025-01-16T17:21:21.0957254Z - q1 (5 seconds, 929 milliseconds)
2025-01-16T17:21:22.7696176Z - q2 *** FAILED *** (1 second, 577 milliseconds)
2025-01-16T17:21:22.7698480Z   java.lang.Exception: Expected "struct<[d_week_seq1:int,round((sun_sales1 / sun_sales2), 2):decimal(20,2),round((mon_sales1 / mon_sales2), 2):decimal(20,2),round((tue_sales1 / tue_sales2), 2):decimal(20,2),round((wed_sales1 / wed_sales2), 2):decimal(20,2),round((thu_sales1 / thu_sales2), 2):decimal(20,2),round((fri_sales1 / fri_sales2), 2):decimal(20,2),round((sat_sales1 / sat_sales2), 2):decimal(20,2)]>", but got "struct<[]>" Schema did not match

andygrove · 2025-01-16T17:56:09Z

Multiple tests failing with same error. Here is one example:

SPARK-33482: Fix FileScan canonicalization *** FAILED *** (510 milliseconds)
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: ArrayBuffer(2, 0, 0)

mbutrovich and others added 30 commits November 8, 2024 13:54

Stash changes. Maybe runs TPC-H q1?

9535117

filters?

22b648d

Enable filter pushdown with TableParquetOptions.

8e47562

Clippy.

196311e

Fix Q1.

fb68558

add partial support for multiple parquet files

0b0d6e8

Merge branch 'native_parquet2' into df-parquet-exec

2027755

Merge pull request #1 from andygrove/df-parquet-exec

ef9f8f5

add partial support for multiple parquet files

Clippy

95d69fa

partitioning

0830840

merge

d7396e4

fix

4e525fc

fix

ef54934

Merge pull request apache#2 from andygrove/df-parquet-exec

e52fe77

"filter with string" test now passes

upmerge

16033d9

Merge remote-tracking branch 'apache/main' into comet-parquet-exec

ad46821

wip - CometNativeScan (apache#1076)

38e32f7

* wip - CometNativeScan * fix and make config internal

Revert "wip - CometNativeScan (apache#1076)"

311bc9e

This reverts commit 38e32f7.

wip - CometNativeScan (apache#1078)

bd68db8

[comet-parquet-exec] Fix compilation errors in Rust tests, remove som…

33d2b23

…e debug logging (apache#1080) * update tests, remove some debug logging * update tests, remove some debug logging * update tests, remove some debug logging * remove unused import

update some stability plans (apache#1083)

786250a

[comet-parquet-exec] Handle CometNativeScan RDD when DataSourceRDD in…

8a0df9d

…stead of FileScanRDD (apache#1088) * DataSourceRDD handling (seems to be related to prefetching, so maybe not relevant for our ParquetExec). * Refactor to reduce duplicate code.

feat: Hook DataFusion Parquet native scan with Comet execution (apach…

1cca8d6

…e#1094) * init * more * fix * more * more * fix

fix: Support partition values in feature branch comet-parquet-exec (a…

c3ad26e

…pache#1106) * init * more * more * fix clippy * Use Spark and Arrow types for partition schema

fix: Use filePath instead of pathUri (apache#1124)

4de51a8

fix: [comet-parquet-exec] Use RDD partition index (apache#1120)

29b2b77

* fix: Use RDD partition index (apache#1112) * fix: Use RDD partition index * fix * fix * fix * fix style

[comet-parquet-exec] Comet parquet exec 2 (copy of Parth's PR) (apach…

ab09337

…e#1138) * WIP: (POC2) A Parquet reader that uses the arrow-rs Parquet reader directly * Change default config --------- Co-authored-by: Parth Chandra <[email protected]>

andygrove and others added 13 commits January 7, 2025 12:58

[comet-parquet-exec] Disable DPP in stability tests when full native …

78e2820

…scan is enabled (apache#1230) * Disable DPP in stability tests, update plans for Spark 3.4 * update plans for Spark 3.5 * fix scan name * fix scan name * fix scan name * Revert a change

[comet-parquet-exec] Fix regressions in DisableAQECometShuffleSuite (a…

6ab514f

…pache#1237) * fix regression in DisableAQECometShuffleSuite * update comments * address feedback * typo

fix: Set scan implementation choice via environment variable (apache#…

01b5917

…1231)

chore: Upgrade to DataFusion 44.0.0-rc2 (apache#1154) (apache#1272)

6df59bd

test: temporarily disable plan stability tests (apache#1274)

274566f

fix: handle loading of complex types into CometVector correctly in ic…

6cafe28

…eberg_compat scans (apache#1279)

Merge branch 'main' into comet-parquet-exec

017963a

Fix build after merge

285396c

Fix tests after merge

b3703f5

Fix plans after merge

2c83bdd

fix partition id in execute plan after merge (from Andy Grove)

79717b8

andygrove changed the title ~~ignore: fork of Comet parquet exec merge 20250114~~ [comet-parquet-exec] ignore: merge Parth's branch into main Jan 16, 2025

fix regression

c137a74

andygrove changed the title ~~[comet-parquet-exec] ignore: merge Parth's branch into main~~ chore: Merge comat-parquet-exec into main Jan 16, 2025

andygrove marked this pull request as ready for review January 16, 2025 16:07

andygrove changed the title ~~chore: Merge comat-parquet-exec into main~~ chore: Merge comet-parquet-exec into main Jan 16, 2025

fix

069099e

andygrove requested review from kazuyukitanimura and viirya January 16, 2025 16:32

andygrove added 3 commits January 16, 2025 09:34

fix merge issue around spark-expr refactoring

5d0c693

Revert config change

e59f25e

update configs.md

1b45a02

andygrove marked this pull request as draft January 16, 2025 18:20

andygrove changed the title ~~chore: Merge comet-parquet-exec into main~~ IGNORE: chore: Merge comet-parquet-exec into (just to see diff) Jan 17, 2025

andygrove closed this Jan 17, 2025

andygrove deleted the comet-parquet-exec-merge-20250114 branch June 17, 2025 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IGNORE: chore: Merge comet-parquet-exec into (just to see diff) #1296

IGNORE: chore: Merge comet-parquet-exec into (just to see diff) #1296

Uh oh!

andygrove commented Jan 16, 2025 •

edited

Loading

Uh oh!

andygrove commented Jan 16, 2025

Uh oh!

andygrove commented Jan 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

IGNORE: chore: Merge comet-parquet-exec into (just to see diff) #1296

IGNORE: chore: Merge comet-parquet-exec into (just to see diff) #1296

Uh oh!

Conversation

andygrove commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Jan 16, 2025

Uh oh!

andygrove commented Jan 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Jan 16, 2025 •

edited

Loading