[SPARK-31763][PYSPARK] Add `inputFiles` method in PySpark DataFrame Class. #28652

iRakson · 2020-05-27T08:47:58Z

What changes were proposed in this pull request?

Adds inputFiles() method to PySpark DataFrame. Using this, PySpark users can list all files constituting a DataFrame.

Before changes:

>>> spark.read.load("examples/src/main/resources/people.json", format="json").inputFiles()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/***/***/spark/python/pyspark/sql/dataframe.py", line 1388, in __getattr__
    "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'inputFiles'

After changes:

>>> spark.read.load("examples/src/main/resources/people.json", format="json").inputFiles()
[u'file:///***/***/spark/examples/src/main/resources/people.json']

Why are the changes needed?

This method is already supported for spark with scala and java.

Does this PR introduce any user-facing change?

Yes, Now users can list all files of a DataFrame using inputFiles()

How was this patch tested?

UT added.

iRakson · 2020-05-27T08:48:21Z

cc @HyukjinKwon Kindly review this PR

HyukjinKwon · 2020-05-27T09:24:32Z

python/pyspark/sql/tests/test_dataframe.py

                self.spark.range(10).sameSemantics(1)

+    def test_input_files(self):
+        tmpPath = tempfile.mkdtemp()


Let's to a try-finally, and use this_naming_rule per PEP8.

tpath = tempfile.mkdtemp() shutil.rmtree(tpath) try: ... finally: shutil.rmtree(tpath)

HyukjinKwon · 2020-05-27T09:27:03Z

python/pyspark/sql/dataframe.py

+        >>> len(df.inputFiles())
+        1
+        """
+        return [f for f in self._jdf.inputFiles()]


You can just return list(self._jdf.inputFiles())

HyukjinKwon · 2020-05-27T09:27:13Z

ok to test

SparkQA · 2020-05-27T10:01:02Z

Test build #123181 has finished for PR 28652 at commit 6df9b76.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

iRakson · 2020-05-27T10:05:42Z

retest this please

SparkQA · 2020-05-27T10:41:58Z

Test build #123183 has finished for PR 28652 at commit 729ff00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-28T00:51:52Z

Merged to master.

iRakson · 2020-05-28T05:54:07Z

Thank You. @HyukjinKwon
Is it a good idea to add this method in 3.0 as well, as this do not add any major change. ?
Also inputFiles() is being used since 2.0 in scala.

HyukjinKwon · 2020-05-28T06:29:02Z

Technically this is a new API and we shouldn't backport it.

[SPARK-31763] Add inputFiles method in PySpark DataFrame Class.

6df9b76

probot-autolabeler bot added PYTHON SQL labels May 27, 2020

HyukjinKwon reviewed May 27, 2020

View reviewed changes

[SPARK-31763] fix

43b04f5

iRakson requested a review from HyukjinKwon May 27, 2020 10:00

fix

729ff00

HyukjinKwon approved these changes May 27, 2020

View reviewed changes

HyukjinKwon closed this in 2f92ea0 May 28, 2020

zero323 mentioned this pull request Jul 18, 2020

[SPARK-31763] Add inputFiles method in PySpark DataFrame Class zero323/pyspark-stubs#430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31763][PYSPARK] Add `inputFiles` method in PySpark DataFrame Class. #28652

[SPARK-31763][PYSPARK] Add `inputFiles` method in PySpark DataFrame Class. #28652

Uh oh!

iRakson commented May 27, 2020 •

edited

Loading

Uh oh!

iRakson commented May 27, 2020 •

edited

Loading

Uh oh!

HyukjinKwon May 27, 2020

Uh oh!

HyukjinKwon May 27, 2020

Uh oh!

HyukjinKwon commented May 27, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

iRakson commented May 27, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

HyukjinKwon commented May 28, 2020

Uh oh!

iRakson commented May 28, 2020 •

edited

Loading

Uh oh!

HyukjinKwon commented May 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-31763][PYSPARK] Add inputFiles method in PySpark DataFrame Class. #28652

[SPARK-31763][PYSPARK] Add inputFiles method in PySpark DataFrame Class. #28652

Uh oh!

Conversation

iRakson commented May 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

iRakson commented May 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon May 27, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 27, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 27, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

iRakson commented May 27, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

HyukjinKwon commented May 28, 2020

Uh oh!

iRakson commented May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-31763][PYSPARK] Add `inputFiles` method in PySpark DataFrame Class. #28652

[SPARK-31763][PYSPARK] Add `inputFiles` method in PySpark DataFrame Class. #28652

iRakson commented May 27, 2020 •

edited

Loading

iRakson commented May 27, 2020 •

edited

Loading

iRakson commented May 28, 2020 •

edited

Loading