-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation #14567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Although there has generally been some resistance to large style-only changes, we do enforce import order in Scala/Java including checks. So it seems pretty reasonable to do the same in one big go for Python. |
python/pyspark/context.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding these multiple imports seems counter-productive. We don't do it in Scala (and in Java you can only import one thing or everything). Is this important/canonical for PEP8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed this is a deviation from Pep recommendation. I encourage this behavior since it simplifies a lot file maintainance over its lifetime, since once this "isort" stuff is setup, developers does not have any more latitude in the placement of the "import" statements.
On our Buildbot based project, we use to have lot of conflict involving changes on import statement: on 2 differents branches (say: prod and main), we often had to import either the same import (so merge might add this line twice when the two developers have placed them in two different places) or not so easy to solve conflict (when two developers add theo import two different object from the same module).
Once we setup the "one import per line" rule, no more conflict on this lines, ever. This helped us a lot automatizing an auto merger from the release branches to the "master" branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we get that in the Scala code for sure, although you get conflicts even if you change merely adjacent lines anyway, so it only saves so much conflict. It's a decent open question, I wonder what others think? the downside is the extra duplication of import boilerplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see in https://pypi.python.org/pypi/isort, there a several way to format this multiple import statements. At worst, at least I recommende to enforce the sort of this multi import lines so there is no ambiguity where to place any "import" (and isort will correct the change from the developer)
84320dd to
5590796
Compare
|
Rebased, sorry I had to force push this PR. |
5590796 to
9d52994
Compare
|
Can you create a JIRA ticket for this? This is too large to go in without a JIRA ticket. |
|
BTW this is actually a non-trivial change and would require very careful look, since Python imports are not side effect free. |
|
@stibbons connect this to your JIRA in the title |
|
Done. Linked to #14180. Is it possible to test this PR with SparkPullRequestBuilder ? |
9d52994 to
067ab4a
Compare
|
Jenkins test this please |
|
Jenkins add to whitelist |
|
Test build #63533 has finished for PR 14567 at commit
|
|
Test build #63534 has finished for PR 14567 at commit
|
|
Test build #63536 has finished for PR 14567 at commit
|
|
Test build #63544 has finished for PR 14567 at commit
|
python/pep8rc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another pep8config file at ./dev/toxi.ini - seems like it would be good to have a single file (also unify the ignore lists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the rename of tox.ini addressing this -- I wasn't sure how
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, tox.ini looks more similar to this pep8rc than isort.cfg, but github doesn't show it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but should they be in the same file or are they separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if they can be merged. Will try it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ignorant of the config here of course, but it seems like we've replaced 1 config file (tox.ini) with 4 in total (pep8rc, isort.cfg, style.yapf.ini, .editorconfig). Maybe it would help to explain what the differences are, because maybe only some of them are really worth dealing with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- tox.ini kind of becames this pep8rc, indeed.
- isort only deals with sorting "import" statement, not checked by pep8, but pylint has some checks on it
- style.yapf.ini is the configuration for yapf, which is a more aggressive version of 'autopep8' made by google for their chrominum projects, to strictly format all their python files. I don't recommend to enforce the usage of it, but if one want to use it on its python file in the Spark project, he can execute it and review the style before commiting. In an ideal wold, this would be a mandatory formatting (on post commit hook for example), but this may take some times before reaching this point
- .editorconfig only deals with tab space, like I said, might be put outside of this PR if you want
I am not sure if yapf can sort imports like isort do
b3a176c to
ff549b6
Compare
|
Test build #63689 has finished for PR 14567 at commit
|
ff549b6 to
be39a83
Compare
|
Test build #63690 has finished for PR 14567 at commit
|
be39a83 to
46e83f2
Compare
|
Test build #63691 has finished for PR 14567 at commit
|
46e83f2 to
d3eab9f
Compare
|
Test build #63693 has finished for PR 14567 at commit
|
d3eab9f to
500b659
Compare
1a2aa35 to
b81c38f
Compare
|
Rebased |
|
Test build #65762 has finished for PR 14567 at commit
|
b81c38f to
a02f634
Compare
|
Test build #66043 has finished for PR 14567 at commit
|
a02f634 to
4a3cde7
Compare
|
Test build #66389 has finished for PR 14567 at commit
|
Use a virtualenv for isolation and easy installation. This basically reverts 85a50a6 Might have been a solution to SPARK-9385. Lot of new test disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as apache#14830 for documentation examples, which is part of the effort on code style described in apache#14567). Each subsequent pull request will fix one or more error and reenable the according pylint check. List of new disabled checks: - bad-super-call - consider-iterating-dictionary - consider-using-enumerate - eval-used - exec-used - invalid-length-returned - misplaced-comparison-constant - raising-bad-type - redefined-variable-type - trailing-newlines - trailing-whitespace - ungrouped-imports - unnecessary-pass - unneeded-not - wrong-import-order - wrong-import-position Signed-off-by: Gaetan Semet <[email protected]>
Use a virtualenv for isolation and easy installation. This basically reverts 85a50a6 Might have been a solution to SPARK-9385. Lot of new test disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as apache#14830 for documentation examples, which is part of the effort on code style described in apache#14567). Each subsequent pull request will fix one or more error and reenable the according pylint check. List of new disabled checks: - bad-super-call - consider-iterating-dictionary - consider-using-enumerate - eval-used - exec-used - invalid-length-returned - misplaced-comparison-constant - raising-bad-type - redefined-variable-type - trailing-newlines - trailing-whitespace - ungrouped-imports - unnecessary-pass - unneeded-not - wrong-import-order - wrong-import-position Signed-off-by: Gaetan Semet <[email protected]>
+ quiet mode for sphinx
4a3cde7 to
1252d08
Compare
|
Test build #66847 has finished for PR 14567 at commit
|
Use a virtualenv for isolation and easy installation. This basically reverts 85a50a6 Might have been a solution to SPARK-9385. Lot of new test disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as apache#14830 for documentation examples, which is part of the effort on code style described in apache#14567). Each subsequent pull request will fix one or more error and reenable the according pylint check. List of new disabled checks: - bad-super-call - consider-iterating-dictionary - consider-using-enumerate - eval-used - exec-used - invalid-length-returned - misplaced-comparison-constant - raising-bad-type - redefined-variable-type - trailing-newlines - trailing-whitespace - ungrouped-imports - unnecessary-pass - unneeded-not - wrong-import-order - wrong-import-position Signed-off-by: Gaetan Semet <[email protected]>
|
Hi, are you still working on this? |
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory … Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage. Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key… Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000) Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit… Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer Closes apache#18585 - SPARK-21359 Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala Added: Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I… Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0 Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to … Closes apache#18667 - Fix the simpleString used in error messages Closes apache#18782 - Branch 2.1 Added: Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads Added: Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server Added: Closes apache#18827 - Merge pull request 1 from apache/master ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18780 from HyukjinKwon/close-prs.
What changes were proposed in this pull request?
This patch adds a code style validation script named
dev/py-validate.shto enforce pep8 recommendations.The execution of this script will format a lot of files. I did not put them into this PR. I plan to execute this smoothly and reformat the code in various Pull Requests to ease reviews.
First one is automatic formatting of the examples in the documentations: See #14830.
Features of the current patch:
python/.editconfigfile (Scala files use 2 space indentation, while Python files uses 4) for compatible editors (almost every editors has a plugin to support .editconfig file)autopep8to fix basic pep8 mistakesisortto automatically split "import" statement and organise them into logically linked order (see doc here. The most important thing is that it split import statements that imports more than one object into several lines. This will increase the number of line of the file, but this facilitates a lot file maintainance and file merges if needed.py-validate.shscript in order to automatise the correction (need isort and autopep8 installed inside a virtualenv that has the right version of the tools, this is documented in the script)You can see similar script in prod in the Buildbot project
How was this patch tested?
Simple tests on my machines has been done (local mode only). There should not have any regression or feature change at all with this pull request.