[BUG] ParquetCachedBatchSerializer is crashing on count #8281

razajafri · 2023-05-11T22:28:30Z

Describe the bug
ParquetCachedBatchSerializer is crashing

Steps/Code to reproduce bug

scala> val df = Seq(1231989812323l, 21989893421l, 1989823523l, 123122312123l).toDF
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> df.selectExpr("cast(value as timestamp)").cache.count
23/05/11 15:25:30 WARN GpuOverrides: 
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec; not all expressions can be replaced
  !Expression <AttributeReference> value#5 cannot run on GPU because expression AttributeReference value#5 produces an unsupported type TimestampType

23/05/11 15:25:30 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> count(1) will run on GPU
    *Expression <Count> count(1) will run on GPU
  *Expression <Alias> count(1)#13L AS count#14L will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_count(1) will run on GPU
        *Expression <Count> count(1) will run on GPU
      !Exec <InMemoryTableScanExec> cannot run on GPU because unsupported data types in output: TimestampType [value]

23/05/11 15:25:30 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> count(1) will run on GPU
    *Expression <Count> count(1) will run on GPU
  *Expression <Alias> count(1)#13L AS count#14L will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_count(1) will run on GPU
        *Expression <Count> count(1) will run on GPU
      !Exec <InMemoryTableScanExec> cannot run on GPU because unsupported data types in output: TimestampType [value]

23/05/11 15:25:30 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> count(1) will run on GPU
    *Expression <Count> count(1) will run on GPU
  *Expression <Alias> count(1)#13L AS count#14L will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_count(1) will run on GPU
        *Expression <Count> count(1) will run on GPU
      !Exec <InMemoryTableScanExec> cannot run on GPU because unsupported data types in output: TimestampType [value]

23/05/11 15:25:30 WARN GpuOverrides: 
*Exec <ShuffleExchangeExec> will run on GPU
  *Partitioning <SinglePartition$> will run on GPU
  *Exec <HashAggregateExec> will run on GPU
    *Expression <AggregateExpression> partial_count(1) will run on GPU
      *Expression <Count> count(1) will run on GPU
    !Exec <InMemoryTableScanExec> cannot run on GPU because unsupported data types in output: TimestampType [value]

23/05/11 15:25:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ArrayIndexOutOfBoundsException: 0
	at com.nvidia.spark.rapids.GpuColumnVector$GpuColumnarBatchBuilder.builder(GpuColumnVector.java:409)
	at com.nvidia.spark.rapids.GpuColumnVector$GpuColumnarBatchBuilder.copyColumnar(GpuColumnVector.java:405)
	at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.$anonfun$addBatchToConcat$2(HostColumnarToGpu.scala:259)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
	at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.$anonfun$addBatchToConcat$1(HostColumnarToGpu.scala:258)
	at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.$anonfun$addBatchToConcat$1$adapted(HostColumnarToGpu.scala:256)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.addBatchToConcat(HostColumnarToGpu.scala:256)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.addBatch(GpuCoalesceBatches.scala:590)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.populateCandidateBatches(GpuCoalesceBatches.scala:414)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$1(GpuCoalesceBatches.scala:549)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:529)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:248)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:604)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:556)
	at scala.Option.getOrElse(Option.scala:189)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:553)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:498)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:318)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Expected behavior
It should return the count

**Additional Context
Tested on Spark 3.3.2 and Spark 3.4.0

The text was updated successfully, but these errors were encountered:

razajafri added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 11, 2023

razajafri self-assigned this May 11, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label May 16, 2023

razajafri mentioned this issue May 18, 2023

Avoid out of bounds on GpuInMemoryTableScan when reading no columns #8322

Merged

razajafri closed this as completed in #8322 May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ParquetCachedBatchSerializer is crashing on count #8281

[BUG] ParquetCachedBatchSerializer is crashing on count #8281

razajafri commented May 11, 2023 •

edited

Loading

[BUG] ParquetCachedBatchSerializer is crashing on count #8281

[BUG] ParquetCachedBatchSerializer is crashing on count #8281

Comments

razajafri commented May 11, 2023 • edited Loading

razajafri commented May 11, 2023 •

edited

Loading