Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable retry for Parquet writes [databricks] #8243

Merged
merged 8 commits into from
May 18, 2023

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented May 8, 2023

Closes #8028

The first commit enables retries for Parquet writes (just changes a boolean parameter). This caused test regressions because we were sometimes casting timestamp columns to a cudF type that is not supported in Spark, and this caused issues when spilling batches. The casts are now performed after batches are spilled.

@andygrove andygrove self-assigned this May 8, 2023
@andygrove andygrove force-pushed the parquet-write-retry branch from da4cfd6 to ffa2e91 Compare May 8, 2023 20:48
@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label May 8, 2023
@andygrove andygrove marked this pull request as ready for review May 9, 2023 16:26
@andygrove
Copy link
Contributor Author

build

revans2
revans2 previously approved these changes May 9, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a follow on issue. I don't like the fact that we are creating a ColumnarBatch in GpuParquetFileFormat.transform what would throw assertions if we were not calling the constructor directly. Ideally the assertions would happen in the constructor, not just in the helper methods, and we would have a way to use a table instead of a ColumnarBatch after the transform.

@andygrove
Copy link
Contributor Author

build

@pxLi
Copy link
Member

pxLi commented May 10, 2023

CI failed #8235

@sameerz
Copy link
Collaborator

sameerz commented May 11, 2023

build

@revans2
Copy link
Collaborator

revans2 commented May 12, 2023

The latest CI failures do not appear to be related to #8235.

2023-05-12T01:18:19.7703936Z [2023-05-12T00:35:43.384Z] [gw2] [ 32%] FAILED ../../src/main/python/hive_delimited_text_test.py::test_basic_hive_text_write[hive-delim-text/simple-boolean-values-StructType(List(StructField(number,BooleanType,true)))-{}-TableWriteMode.CTAS][IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] 23/05/12 00:35:43 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `default`.`tmp_table_gw4_1415522135_0` into Hive metastore in **** SQL specific format, which is NOT compatible with Hive.

This shows up for a lot of different test failures in ORC, HIVE, and others.

@andygrove
Copy link
Contributor Author

This shows up for a lot of different test failures in ORC, HIVE, and others.

Yes, I see this locally too. Root cause is Close called too many times. I am looking into it.

@andygrove
Copy link
Contributor Author

@revans2 @abellina I pushed a commit to resolve the double close issue (f90d375), but I am not confident that this is correct. Could you take a look?

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

revans2
revans2 previously approved these changes May 17, 2023
@andygrove
Copy link
Contributor Author

The blossom build is failing with:

2023-05-16T20:44:03.4035745Z ******** JOB LOGS with sensitive data redacted ************** 
2023-05-16T20:44:03.4036050Z 
2023-05-16T20:44:03.4036140Z {
2023-05-16T20:44:03.4036452Z   "errors" : [ {
2023-05-16T20:44:03.4036756Z     "status" : 404,
2023-05-16T20:44:03.4037091Z     "message" : "File not found."
2023-05-16T20:44:03.4037484Z   } ]
2023-05-16T20:44:03.4037761Z }
2023-05-16T20:44:04.9285018Z Cleaning up orphan processes

@abellina
Copy link
Collaborator

The blossom build is failing with:

2023-05-16T20:44:03.4035745Z ******** JOB LOGS with sensitive data redacted ************** 
2023-05-16T20:44:03.4036050Z 
2023-05-16T20:44:03.4036140Z {
2023-05-16T20:44:03.4036452Z   "errors" : [ {
2023-05-16T20:44:03.4036756Z     "status" : 404,
2023-05-16T20:44:03.4037091Z     "message" : "File not found."
2023-05-16T20:44:03.4037484Z   } ]
2023-05-16T20:44:03.4037761Z }
2023-05-16T20:44:04.9285018Z Cleaning up orphan processes

I think that's because of recent link changes in CI. I'll send you a direct message, but here's what I see:

[2023-05-16T20:05:47.876Z] java.lang.IllegalStateException: Close called too many times ColumnVector{rows=512, type=LIST, nullCount=Optional.empty, offHeap=(ID: 8889240 0)}
[2023-05-16T20:05:47.876Z]  at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:269)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.GpuColumnVector.close(GpuColumnVector.java:1124)
[2023-05-16T20:05:47.876Z]  at org.apache.spark.sql.vectorized.ColumnarBatch.close(ColumnarBatch.java:48)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:56)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatchWithRetry$1(ColumnarOutputWriter.scala:172)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatchWithRetry$1$adapted(ColumnarOutputWriter.scala:166)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:424)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:528)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:458)
[2023-05-16T20:05:47.876Z]  at scala.collection.Iterator.foreach(Iterator.scala:941)
[2023-05-16T20:05:47.876Z]  at scala.collection.Iterator.foreach$(Iterator.scala:941)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.foreach(RmmRapidsRetryIterator.scala:477)
[2023-05-16T20:05:47.876Z]  at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:162)
[2023-05-16T20:05:47.876Z]  at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:160)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.foldLeft(RmmRapidsRetryIterator.scala:477)
[2023-05-16T20:05:47.876Z]  at scala.collection.TraversableOnce.sum(TraversableOnce.scala:221)
[2023-05-16T20:05:47.876Z]  at scala.collection.TraversableOnce.sum$(TraversableOnce.scala:221)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.sum(RmmRapidsRetryIterator.scala:477)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.ColumnarOutputWriter.writeBatchWithRetry(ColumnarOutputWriter.scala:193)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.ColumnarOutputWriter.writeBatch(ColumnarOutputWriter.scala:153)
[2023-05-16T20:05:47.876Z]  at com.nvidia.spark.rapids.ColumnarOutputWriter.writeAndClose(ColumnarOutputWriter.scala:115)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.sql.rapids.GpuSingleDirectoryDataWriter.write(GpuFileFormatDataWriter.scala:173)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.sql.rapids.GpuFileFormatDataWriter.writeWithIterator(GpuFileFormatDataWriter.scala:87)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$executeTask$1(GpuFileFormatWriter.scala:338)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.sql.rapids.GpuFileFormatWriter$.executeTask(GpuFileFormatWriter.scala:345)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$write$15(GpuFileFormatWriter.scala:264)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[2023-05-16T20:05:47.877Z]  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
[2023-05-16T20:05:47.877Z]  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-05-16T20:05:47.877Z]  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit df54d4a into NVIDIA:branch-23.06 May 18, 2023
@andygrove andygrove deleted the parquet-write-retry branch May 18, 2023 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Retry/SplitAndRetry on Parquet Writes
5 participants