Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ParquetCachedBatchSerializer is crashing on count #8281

Closed
razajafri opened this issue May 11, 2023 · 0 comments · Fixed by #8322
Closed

[BUG] ParquetCachedBatchSerializer is crashing on count #8281

razajafri opened this issue May 11, 2023 · 0 comments · Fixed by #8322
Assignees
Labels
bug Something isn't working

Comments

@razajafri
Copy link
Collaborator

razajafri commented May 11, 2023

Describe the bug
ParquetCachedBatchSerializer is crashing

Steps/Code to reproduce bug

scala> val df = Seq(1231989812323l, 21989893421l, 1989823523l, 123122312123l).toDF
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> df.selectExpr("cast(value as timestamp)").cache.count
23/05/11 15:25:30 WARN GpuOverrides: 
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec; not all expressions can be replaced
  !Expression <AttributeReference> value#5 cannot run on GPU because expression AttributeReference value#5 produces an unsupported type TimestampType

23/05/11 15:25:30 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> count(1) will run on GPU
    *Expression <Count> count(1) will run on GPU
  *Expression <Alias> count(1)#13L AS count#14L will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_count(1) will run on GPU
        *Expression <Count> count(1) will run on GPU
      !Exec <InMemoryTableScanExec> cannot run on GPU because unsupported data types in output: TimestampType [value]

23/05/11 15:25:30 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> count(1) will run on GPU
    *Expression <Count> count(1) will run on GPU
  *Expression <Alias> count(1)#13L AS count#14L will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_count(1) will run on GPU
        *Expression <Count> count(1) will run on GPU
      !Exec <InMemoryTableScanExec> cannot run on GPU because unsupported data types in output: TimestampType [value]

23/05/11 15:25:30 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> count(1) will run on GPU
    *Expression <Count> count(1) will run on GPU
  *Expression <Alias> count(1)#13L AS count#14L will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_count(1) will run on GPU
        *Expression <Count> count(1) will run on GPU
      !Exec <InMemoryTableScanExec> cannot run on GPU because unsupported data types in output: TimestampType [value]

23/05/11 15:25:30 WARN GpuOverrides: 
*Exec <ShuffleExchangeExec> will run on GPU
  *Partitioning <SinglePartition$> will run on GPU
  *Exec <HashAggregateExec> will run on GPU
    *Expression <AggregateExpression> partial_count(1) will run on GPU
      *Expression <Count> count(1) will run on GPU
    !Exec <InMemoryTableScanExec> cannot run on GPU because unsupported data types in output: TimestampType [value]

23/05/11 15:25:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ArrayIndexOutOfBoundsException: 0
	at com.nvidia.spark.rapids.GpuColumnVector$GpuColumnarBatchBuilder.builder(GpuColumnVector.java:409)
	at com.nvidia.spark.rapids.GpuColumnVector$GpuColumnarBatchBuilder.copyColumnar(GpuColumnVector.java:405)
	at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.$anonfun$addBatchToConcat$2(HostColumnarToGpu.scala:259)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
	at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.$anonfun$addBatchToConcat$1(HostColumnarToGpu.scala:258)
	at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.$anonfun$addBatchToConcat$1$adapted(HostColumnarToGpu.scala:256)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.HostToGpuCoalesceIterator.addBatchToConcat(HostColumnarToGpu.scala:256)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.addBatch(GpuCoalesceBatches.scala:590)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.populateCandidateBatches(GpuCoalesceBatches.scala:414)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$1(GpuCoalesceBatches.scala:549)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:529)
	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:248)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:604)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:556)
	at scala.Option.getOrElse(Option.scala:189)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:553)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:498)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:318)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Expected behavior
It should return the count

**Additional Context
Tested on Spark 3.3.2 and Spark 3.4.0

@razajafri razajafri added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 11, 2023
@razajafri razajafri self-assigned this May 11, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants