Skip to content

[AutoSparkUT] Fix cached table zero-column scan crash (issue #14098)#14446

Merged
wjxiz1992 merged 3 commits intoNVIDIA:mainfrom
wjxiz1992:fix/14098-no-columns-from-cache
Mar 24, 2026
Merged

[AutoSparkUT] Fix cached table zero-column scan crash (issue #14098)#14446
wjxiz1992 merged 3 commits intoNVIDIA:mainfrom
wjxiz1992:fix/14098-no-columns-from-cache

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

Summary

  • Fix ParquetCachedBatchSerializer crash when a cached table is scanned with zero selected columns (e.g., cross-join side that only needs row count)
  • Root cause: empty selectedAttributes was incorrectly treated as "select all columns", producing a full-column buffer that mismatched the broadcast exchange's empty output schema
  • Return row-only batches when no columns are selected, fixing both the GPU path (gpuConvertCachedBatchToColumnarBatch) and CPU fallback path (convertCachedBatchToColumnarBatch)

Test Plan

  • RapidsSQLQuerySuite passes with 234 tests, 0 failures (buildver=330)
  • "SPARK-6743: no columns from cache" test now passes on GPU — exclusion removed
  • No new test regressions

PR Traceability

RAPIDS Test Spark Original Spark Source Lines
RapidsSQLQuerySuite (inherited) SPARK-6743: no columns from cache sql/core/.../SQLQuerySuite.scala 129-144

Performance

Cold-path only change. Normal column selection (hot path) is unaffected. The zero-column edge case is now faster since it skips unnecessary parquet decoding.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
  • Performance testing has been performed and its results are added in the PR description.

Closes #14098

🤖 Generated with Claude Code

…o-column scans (issue NVIDIA#14098)

When a cached table is used in a cross join and one side needs zero
columns (only row count), ParquetCachedBatchSerializer incorrectly
treated empty selectedAttributes as "select all columns". This caused
a column count mismatch when the broadcast exchange deserialized the
buffer with an empty output schema.

Return row-only batches when selectedAttributes is empty instead of
falling back to all cached columns. Fixes both the GPU path
(gpuConvertCachedBatchToColumnarBatch) and the CPU fallback path
(convertCachedBatchToColumnarBatch).

### Performance

Cold-path only change. Normal column selection (hot path) is
unaffected. The zero-column edge case is now faster since it skips
unnecessary parquet decoding.

### Checklists

- [ ] This PR has added documentation for new or modified features or behaviors.
- [x] This PR has added new tests or modified existing tests to cover new code paths.
- [x] Performance testing has been performed and its results are added in the PR description.

Closes NVIDIA#14098

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings March 20, 2026 09:24
Keep lines that are under 85 chars on a single line instead of
splitting them across multiple lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a crash in the Parquet-based cached table serializer when a cached table scan selects zero columns (e.g., count-only/cross-join side row-count usage), by returning row-only ColumnarBatches instead of incorrectly treating “no selected attributes” as “select all columns”.

Changes:

  • Add an explicit zero-selected-columns fast-path in ParquetCachedBatchSerializer for both GPU conversion and CPU conversion paths.
  • Remove the RapidsSQLQuerySuite exclusion for SPARK-6743: no columns from cache now that the bug is fixed.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Re-enables the Spark-derived unit test previously excluded due to the crash.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetCachedBatchSerializer.scala Returns row-only batches on zero-column scans and avoids unnecessary parquet decoding / schema mismatch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +469 to +479
// When no columns are selected (e.g., count-only scan or
// cross-join side that needs only row count), return
// row-only batches without decoding parquet data.
if (selectedAttributes.isEmpty) {
return input.map {
case parquetCB: ParquetCachedBatch =>
new ColumnarBatch(Array.empty, parquetCB.numRows)
case other =>
throw new IllegalStateException(
s"Expected ParquetCachedBatch but got ${other.getClass}")
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero-column fast-path mapping (selectedAttributes.isEmpty -> map ParquetCachedBatch to new ColumnarBatch(Array.empty, numRows)) is duplicated here and again in convertCachedBatchToColumnarBatch. Consider extracting a small private helper to keep behavior/exception text consistent and reduce the chance of one path diverging in future edits.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two call sites differ in return type semantics (gpuConvertCachedBatchToColumnarBatch returns GPU-resident batches, convertCachedBatchToColumnarBatch returns host batches) so a shared helper would need to paper over that distinction. Given the logic is just new ColumnarBatch(Array.empty, numRows), the duplication is minimal and a helper would add more abstraction than it saves.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR fixes a crash in ParquetCachedBatchSerializer when a cached table is scanned with zero selected columns (e.g., the broadcast side of a cross-join that only needs a row count). Both gpuConvertCachedBatchToColumnarBatch and convertCachedBatchToColumnarBatch previously "optimized" an empty selectedAttributes by substituting cacheAttributes (all columns), causing a full-column batch to be produced that mismatched the downstream consumer's empty schema. The fix adds an early return in both methods that directly maps each ParquetCachedBatch to ColumnarBatch(Array.empty, parquetCB.numRows), bypassing Parquet decoding entirely.

Key changes:

  • Replaces the misguided "select-all when none requested" pattern in both GPU and CPU paths with an early-exit that returns lightweight row-count-only batches
  • Correctly skips GpuSemaphore.acquireIfNecessary in the new path since no GPU resources are used
  • Removes the SPARK-6743: no columns from cache test exclusion, confirming the bug is resolved

Minor observation: The existing sizeInBytes == 0 branch inside convertCachedBatchToColumnarInternal (lines 497–504) was previously the secondary handler for zero-column cached batches, but is now unreachable via any normal code path: the only write-side producer of empty-buffer ParquetCachedBatch objects (numCols() == 0 caching branch) will also result in empty cacheAttributes, which triggers the new early-return before convertCachedBatchToColumnarInternal is ever called. This is harmless dead code but could be cleaned up in a follow-up.

Confidence Score: 5/5

  • This PR is safe to merge — the fix is minimal, well-targeted, and the normal (non-empty selection) hot path is completely unchanged.
  • Both changed code paths are correct: the early return produces the exact shape (ColumnarBatch with no columns and the right row count) that downstream consumers expect for a zero-column projection. No GPU semaphore is acquired (correct, since no GPU work is done), and existing test coverage directly validates the fix via SPARK-6743: no columns from cache. The only non-critical observation is that the sizeInBytes == 0 branch in convertCachedBatchToColumnarInternal is now dead code, which does not affect correctness.
  • No files require special attention.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetCachedBatchSerializer.scala Adds zero-column early-return guards to both gpuConvertCachedBatchToColumnarBatch and convertCachedBatchToColumnarBatch; removes the old "optimize" pattern that incorrectly substituted all columns for an empty selection. A minor side-effect is that the existing sizeInBytes == 0 branch in convertCachedBatchToColumnarInternal is now unreachable for the zero-column path, making it dead code.
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Removes the SPARK-6743: no columns from cache exclusion from RapidsSQLQuerySuite, correctly reflecting that the bug is now fixed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[convertCachedBatchToColumnarBatch\nor gpuConvertCachedBatchToColumnarBatch] --> B{selectedAttributes.isEmpty?}
    B -- Yes\n NEW path --> C[Return RDD mapping each\nParquetCachedBatch →\nColumnarBatch with Array.empty\nand parquetCB.numRows]
    B -- No\n Existing path --> D[getSupportedSchemaFromUnsupported\ncacheAttributes, selectedAttributes]
    D --> E{isSqlEnabled &&\nisSqlExecuteOnGPU &&\nisSchemaSupportedByCudf?}
    E -- Yes --> F[convertCachedBatchToColumnarInternal\nGPU decode path]
    F --> G[Copy GPU→Host columns\nwrap in CloseableColumnBatchIterator]
    E -- No --> H[CachedBatchIteratorConsumer\nCPU Parquet decode path]
    C --> I[Consumer receives\n0-column batch with\ncorrect row count ✓]
    G --> I
    H --> I
Loading

Last reviewed commit: "Restore original mul..."

Comment on lines +570 to 578
if (selectedAttributes.isEmpty) {
return input.map {
case parquetCB: ParquetCachedBatch =>
new ColumnarBatch(Array.empty, parquetCB.numRows)
case other =>
throw new IllegalStateException(
s"Expected ParquetCachedBatch but got ${other.getClass}")
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inconsistency with CloseableColumnBatchIterator wrapping

The non-empty paths in convertCachedBatchToColumnarBatch both wrap their results in CloseableColumnBatchIterator (GPU path at line 592, CPU path via CachedBatchIteratorConsumer), but the new zero-column early-return does not. While this is functionally safe — ColumnarBatch(Array.empty, n) holds no closeable column vector resources — it is a structural inconsistency. Consider wrapping for uniformity:

Suggested change
if (selectedAttributes.isEmpty) {
return input.map {
case parquetCB: ParquetCachedBatch =>
new ColumnarBatch(Array.empty, parquetCB.numRows)
case other =>
throw new IllegalStateException(
s"Expected ParquetCachedBatch but got ${other.getClass}")
}
}
// When no columns are selected, return row-only batches
if (selectedAttributes.isEmpty) {
return input.mapPartitions { cbIter =>
CloseableColumnBatchIterator(cbIter.map {
case parquetCB: ParquetCachedBatch =>
new ColumnarBatch(Array.empty, parquetCB.numRows)
case other =>
throw new IllegalStateException(
s"Expected ParquetCachedBatch but got ${other.getClass}")
})
}
}

The same note applies to the analogous block in gpuConvertCachedBatchToColumnarBatch (lines 472–479).

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noted, this is functionally safe — the empty ColumnarBatch holds no closeable resources, so wrapping it in CloseableColumnBatchIterator would be a no-op. Keeping the early return simple makes the intent clearer: no columns → no decoding, just row count.

@wjxiz1992 wjxiz1992 self-assigned this Mar 20, 2026
…rInternal call

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

build

Copy link
Copy Markdown
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wjxiz1992 wjxiz1992 merged commit 6a9f655 into NVIDIA:main Mar 24, 2026
47 checks passed
@sameerz sameerz added the bug Something isn't working label Mar 25, 2026
wjxiz1992 added a commit to wjxiz1992/spark-rapids that referenced this pull request Mar 30, 2026
The stash pop three-way merge re-introduced exclusions for NVIDIA#14098,

Signed-off-by: Allen Xu <allxu@nvidia.com>
Made-with: Cursor
NVIDIA#14110, and NVIDIA#14116 that were already removed by merged PRs NVIDIA#14446,
NVIDIA#14398, and NVIDIA#14400. Remove them to match origin/main.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AutoSparkUT]"SPARK-6743: no columns from cache" in SQLQuerySuite failed

5 participants