[AutoSparkUT] Fix cached table zero-column scan crash (issue #14098)#14446
[AutoSparkUT] Fix cached table zero-column scan crash (issue #14098)#14446wjxiz1992 merged 3 commits intoNVIDIA:mainfrom
Conversation
…o-column scans (issue NVIDIA#14098) When a cached table is used in a cross join and one side needs zero columns (only row count), ParquetCachedBatchSerializer incorrectly treated empty selectedAttributes as "select all columns". This caused a column count mismatch when the broadcast exchange deserialized the buffer with an empty output schema. Return row-only batches when selectedAttributes is empty instead of falling back to all cached columns. Fixes both the GPU path (gpuConvertCachedBatchToColumnarBatch) and the CPU fallback path (convertCachedBatchToColumnarBatch). ### Performance Cold-path only change. Normal column selection (hot path) is unaffected. The zero-column edge case is now faster since it skips unnecessary parquet decoding. ### Checklists - [ ] This PR has added documentation for new or modified features or behaviors. - [x] This PR has added new tests or modified existing tests to cover new code paths. - [x] Performance testing has been performed and its results are added in the PR description. Closes NVIDIA#14098 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
Keep lines that are under 85 chars on a single line instead of splitting them across multiple lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Fixes a crash in the Parquet-based cached table serializer when a cached table scan selects zero columns (e.g., count-only/cross-join side row-count usage), by returning row-only ColumnarBatches instead of incorrectly treating “no selected attributes” as “select all columns”.
Changes:
- Add an explicit zero-selected-columns fast-path in
ParquetCachedBatchSerializerfor both GPU conversion and CPU conversion paths. - Remove the
RapidsSQLQuerySuiteexclusion forSPARK-6743: no columns from cachenow that the bug is fixed.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala | Re-enables the Spark-derived unit test previously excluded due to the crash. |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetCachedBatchSerializer.scala | Returns row-only batches on zero-column scans and avoids unnecessary parquet decoding / schema mismatch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // When no columns are selected (e.g., count-only scan or | ||
| // cross-join side that needs only row count), return | ||
| // row-only batches without decoding parquet data. | ||
| if (selectedAttributes.isEmpty) { | ||
| return input.map { | ||
| case parquetCB: ParquetCachedBatch => | ||
| new ColumnarBatch(Array.empty, parquetCB.numRows) | ||
| case other => | ||
| throw new IllegalStateException( | ||
| s"Expected ParquetCachedBatch but got ${other.getClass}") | ||
| } |
There was a problem hiding this comment.
The zero-column fast-path mapping (selectedAttributes.isEmpty -> map ParquetCachedBatch to new ColumnarBatch(Array.empty, numRows)) is duplicated here and again in convertCachedBatchToColumnarBatch. Consider extracting a small private helper to keep behavior/exception text consistent and reduce the chance of one path diverging in future edits.
There was a problem hiding this comment.
The two call sites differ in return type semantics (gpuConvertCachedBatchToColumnarBatch returns GPU-resident batches, convertCachedBatchToColumnarBatch returns host batches) so a shared helper would need to paper over that distinction. Given the logic is just new ColumnarBatch(Array.empty, numRows), the duplication is minimal and a helper would add more abstraction than it saves.
Greptile SummaryThis PR fixes a crash in Key changes:
Minor observation: The existing Confidence Score: 5/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[convertCachedBatchToColumnarBatch\nor gpuConvertCachedBatchToColumnarBatch] --> B{selectedAttributes.isEmpty?}
B -- Yes\n NEW path --> C[Return RDD mapping each\nParquetCachedBatch →\nColumnarBatch with Array.empty\nand parquetCB.numRows]
B -- No\n Existing path --> D[getSupportedSchemaFromUnsupported\ncacheAttributes, selectedAttributes]
D --> E{isSqlEnabled &&\nisSqlExecuteOnGPU &&\nisSchemaSupportedByCudf?}
E -- Yes --> F[convertCachedBatchToColumnarInternal\nGPU decode path]
F --> G[Copy GPU→Host columns\nwrap in CloseableColumnBatchIterator]
E -- No --> H[CachedBatchIteratorConsumer\nCPU Parquet decode path]
C --> I[Consumer receives\n0-column batch with\ncorrect row count ✓]
G --> I
H --> I
Last reviewed commit: "Restore original mul..." |
| if (selectedAttributes.isEmpty) { | ||
| return input.map { | ||
| case parquetCB: ParquetCachedBatch => | ||
| new ColumnarBatch(Array.empty, parquetCB.numRows) | ||
| case other => | ||
| throw new IllegalStateException( | ||
| s"Expected ParquetCachedBatch but got ${other.getClass}") | ||
| } | ||
| } |
There was a problem hiding this comment.
Inconsistency with
CloseableColumnBatchIterator wrapping
The non-empty paths in convertCachedBatchToColumnarBatch both wrap their results in CloseableColumnBatchIterator (GPU path at line 592, CPU path via CachedBatchIteratorConsumer), but the new zero-column early-return does not. While this is functionally safe — ColumnarBatch(Array.empty, n) holds no closeable column vector resources — it is a structural inconsistency. Consider wrapping for uniformity:
| if (selectedAttributes.isEmpty) { | |
| return input.map { | |
| case parquetCB: ParquetCachedBatch => | |
| new ColumnarBatch(Array.empty, parquetCB.numRows) | |
| case other => | |
| throw new IllegalStateException( | |
| s"Expected ParquetCachedBatch but got ${other.getClass}") | |
| } | |
| } | |
| // When no columns are selected, return row-only batches | |
| if (selectedAttributes.isEmpty) { | |
| return input.mapPartitions { cbIter => | |
| CloseableColumnBatchIterator(cbIter.map { | |
| case parquetCB: ParquetCachedBatch => | |
| new ColumnarBatch(Array.empty, parquetCB.numRows) | |
| case other => | |
| throw new IllegalStateException( | |
| s"Expected ParquetCachedBatch but got ${other.getClass}") | |
| }) | |
| } | |
| } |
The same note applies to the analogous block in gpuConvertCachedBatchToColumnarBatch (lines 472–479).
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
As you noted, this is functionally safe — the empty ColumnarBatch holds no closeable resources, so wrapping it in CloseableColumnBatchIterator would be a no-op. Keeping the early return simple makes the intent clearer: no columns → no decoding, just row count.
…rInternal call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
|
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
The stash pop three-way merge re-introduced exclusions for NVIDIA#14098, Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor NVIDIA#14110, and NVIDIA#14116 that were already removed by merged PRs NVIDIA#14446, NVIDIA#14398, and NVIDIA#14400. Remove them to match origin/main.
Summary
ParquetCachedBatchSerializercrash when a cached table is scanned with zero selected columns (e.g., cross-join side that only needs row count)selectedAttributeswas incorrectly treated as "select all columns", producing a full-column buffer that mismatched the broadcast exchange's empty output schemagpuConvertCachedBatchToColumnarBatch) and CPU fallback path (convertCachedBatchToColumnarBatch)Test Plan
RapidsSQLQuerySuitepasses with 234 tests, 0 failures (buildver=330)PR Traceability
RapidsSQLQuerySuite(inherited)SPARK-6743: no columns from cachesql/core/.../SQLQuerySuite.scalaPerformance
Cold-path only change. Normal column selection (hot path) is unaffected. The zero-column edge case is now faster since it skips unnecessary parquet decoding.
Checklists
Closes #14098
🤖 Generated with Claude Code