-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31763][PYSPARK] Add inputFiles method in PySpark DataFrame Class.
#28652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @HyukjinKwon Kindly review this PR |
| self.spark.range(10).sameSemantics(1) | ||
|
|
||
| def test_input_files(self): | ||
| tmpPath = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's to a try-finally, and use this_naming_rule per PEP8.
tpath = tempfile.mkdtemp()
shutil.rmtree(tpath)
try:
...
finally:
shutil.rmtree(tpath)
python/pyspark/sql/dataframe.py
Outdated
| >>> len(df.inputFiles()) | ||
| 1 | ||
| """ | ||
| return [f for f in self._jdf.inputFiles()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just return list(self._jdf.inputFiles())
|
ok to test |
|
Test build #123181 has finished for PR 28652 at commit
|
|
retest this please |
|
Test build #123183 has finished for PR 28652 at commit
|
|
Merged to master. |
|
Thank You. @HyukjinKwon |
|
Technically this is a new API and we shouldn't backport it. |
What changes were proposed in this pull request?
Adds
inputFiles()method to PySparkDataFrame. Using this, PySpark users can list all files constituting aDataFrame.Before changes:
After changes:
Why are the changes needed?
This method is already supported for spark with scala and java.
Does this PR introduce any user-facing change?
Yes, Now users can list all files of a DataFrame using
inputFiles()How was this patch tested?
UT added.