[SPARK-17666] Ensure that RecordReaders are closed by data source file scans (backport) #15271

JoshRosen · 2016-09-28T01:32:42Z

This is a branch-2.0 backport of #15245.

What changes were proposed in this pull request?

This patch addresses a potential cause of resource leaks in data source file scans. As reported in SPARK-17666, tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's NewHadoopRDD uses a TaskContext callback to close its record readers, but the new data source file scans will only close record readers once their iterators are fully-consumed.

This patch modifies RecordReaderIterator and HadoopFileLinesReader to add close() methods and modifies all six implementations of FileFormat.buildReader() to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed.

How was this patch tested?

Tested manually for now.

…e scans This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed. This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed. Tested manually for now. Author: Josh Rosen <[email protected]> Closes apache#15245 from JoshRosen/SPARK-17666-close-recordreader.

SparkQA · 2016-09-28T03:29:01Z

Test build #66009 has finished for PR 15271 at commit c0621db.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HadoopFileLinesReader(
- class RecordReaderIterator[T](

rxin · 2016-09-28T07:58:01Z

Merging in branch-2.0. Can you close the pr?

…e scans (backport) This is a branch-2.0 backport of #15245. ## What changes were proposed in this pull request? This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed. This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed. ## How was this patch tested? Tested manually for now. Author: Josh Rosen <[email protected]> Closes #15271 from JoshRosen/SPARK-17666-backport.

JoshRosen mentioned this pull request Sep 28, 2016

[SPARK-17666] Ensure that RecordReaders are closed by data source file scans #15245

Closed

JoshRosen closed this Sep 28, 2016

JoshRosen deleted the SPARK-17666-backport branch September 28, 2016 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17666] Ensure that RecordReaders are closed by data source file scans (backport) #15271

[SPARK-17666] Ensure that RecordReaders are closed by data source file scans (backport) #15271

Uh oh!

JoshRosen commented Sep 28, 2016

Uh oh!

SparkQA commented Sep 28, 2016

Uh oh!

rxin commented Sep 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-17666] Ensure that RecordReaders are closed by data source file scans (backport) #15271

[SPARK-17666] Ensure that RecordReaders are closed by data source file scans (backport) #15271

Uh oh!

Conversation

JoshRosen commented Sep 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 28, 2016

Uh oh!

rxin commented Sep 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants