Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This is a branch-2.0 backport of #15245.

What changes were proposed in this pull request?

This patch addresses a potential cause of resource leaks in data source file scans. As reported in SPARK-17666, tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's NewHadoopRDD uses a TaskContext callback to close its record readers, but the new data source file scans will only close record readers once their iterators are fully-consumed.

This patch modifies RecordReaderIterator and HadoopFileLinesReader to add close() methods and modifies all six implementations of FileFormat.buildReader() to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed.

How was this patch tested?

Tested manually for now.

…e scans

This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed.

This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed.

Tested manually for now.

Author: Josh Rosen <[email protected]>

Closes apache#15245 from JoshRosen/SPARK-17666-close-recordreader.
@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66009 has finished for PR 15271 at commit c0621db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HadoopFileLinesReader(
    • class RecordReaderIterator[T](

@rxin
Copy link
Contributor

rxin commented Sep 28, 2016

Merging in branch-2.0. Can you close the pr?

asfgit pushed a commit that referenced this pull request Sep 28, 2016
…e scans (backport)

This is a branch-2.0 backport of #15245.

## What changes were proposed in this pull request?

This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed.

This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed.

## How was this patch tested?

Tested manually for now.

Author: Josh Rosen <[email protected]>

Closes #15271 from JoshRosen/SPARK-17666-backport.
@JoshRosen JoshRosen closed this Sep 28, 2016
@JoshRosen JoshRosen deleted the SPARK-17666-backport branch September 28, 2016 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants