Skip to content

Conversation

@iRakson
Copy link
Contributor

@iRakson iRakson commented May 27, 2020

What changes were proposed in this pull request?

Adds inputFiles() method to PySpark DataFrame. Using this, PySpark users can list all files constituting a DataFrame.

Before changes:

>>> spark.read.load("examples/src/main/resources/people.json", format="json").inputFiles()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/***/***/spark/python/pyspark/sql/dataframe.py", line 1388, in __getattr__
    "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'inputFiles'

After changes:

>>> spark.read.load("examples/src/main/resources/people.json", format="json").inputFiles()
[u'file:///***/***/spark/examples/src/main/resources/people.json']

Why are the changes needed?

This method is already supported for spark with scala and java.

Does this PR introduce any user-facing change?

Yes, Now users can list all files of a DataFrame using inputFiles()

How was this patch tested?

UT added.

@iRakson
Copy link
Contributor Author

iRakson commented May 27, 2020

cc @HyukjinKwon Kindly review this PR

self.spark.range(10).sameSemantics(1)

def test_input_files(self):
tmpPath = tempfile.mkdtemp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's to a try-finally, and use this_naming_rule per PEP8.

tpath = tempfile.mkdtemp()
shutil.rmtree(tpath)
try:
    ...
finally:
    shutil.rmtree(tpath)

>>> len(df.inputFiles())
1
"""
return [f for f in self._jdf.inputFiles()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just return list(self._jdf.inputFiles())

@HyukjinKwon
Copy link
Member

ok to test

@iRakson iRakson requested a review from HyukjinKwon May 27, 2020 10:00
@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123181 has finished for PR 28652 at commit 6df9b76.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iRakson
Copy link
Contributor Author

iRakson commented May 27, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123183 has finished for PR 28652 at commit 729ff00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@iRakson
Copy link
Contributor Author

iRakson commented May 28, 2020

Thank You. @HyukjinKwon
Is it a good idea to add this method in 3.0 as well, as this do not add any major change. ?
Also inputFiles() is being used since 2.0 in scala.

@HyukjinKwon
Copy link
Member

Technically this is a new API and we shouldn't backport it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants