[SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation #14567

gsemet · 2016-08-09T16:44:27Z

What changes were proposed in this pull request?

This patch adds a code style validation script named dev/py-validate.sh to enforce pep8 recommendations.

The execution of this script will format a lot of files. I did not put them into this PR. I plan to execute this smoothly and reformat the code in various Pull Requests to ease reviews.

First one is automatic formatting of the examples in the documentations: See #14830.

Features of the current patch:

add a python/.editconfig file (Scala files use 2 space indentation, while Python files uses 4) for compatible editors (almost every editors has a plugin to support .editconfig file)
use autopep8 to fix basic pep8 mistakes
use isort to automatically split "import" statement and organise them into logically linked order (see doc here. The most important thing is that it split import statements that imports more than one object into several lines. This will increase the number of line of the file, but this facilitates a lot file maintainance and file merges if needed.
add a py-validate.sh script in order to automatise the correction (need isort and autopep8 installed inside a virtualenv that has the right version of the tools, this is documented in the script)

You can see similar script in prod in the Buildbot project

How was this patch tested?

Simple tests on my machines has been done (local mode only). There should not have any regression or feature change at all with this pull request.

srowen · 2016-08-09T16:55:14Z

Although there has generally been some resistance to large style-only changes, we do enforce import order in Scala/Java including checks. So it seems pretty reasonable to do the same in one big go for Python.

srowen · 2016-08-09T16:56:38Z

python/pyspark/context.py

Expanding these multiple imports seems counter-productive. We don't do it in Scala (and in Java you can only import one thing or everything). Is this important/canonical for PEP8?

indeed this is a deviation from Pep recommendation. I encourage this behavior since it simplifies a lot file maintainance over its lifetime, since once this "isort" stuff is setup, developers does not have any more latitude in the placement of the "import" statements.

On our Buildbot based project, we use to have lot of conflict involving changes on import statement: on 2 differents branches (say: prod and main), we often had to import either the same import (so merge might add this line twice when the two developers have placed them in two different places) or not so easy to solve conflict (when two developers add theo import two different object from the same module).
Once we setup the "one import per line" rule, no more conflict on this lines, ever. This helped us a lot automatizing an auto merger from the release branches to the "master" branch.

Yeah we get that in the Scala code for sure, although you get conflicts even if you change merely adjacent lines anyway, so it only saves so much conflict. It's a decent open question, I wonder what others think? the downside is the extra duplication of import boilerplate.

As you can see in https://pypi.python.org/pypi/isort, there a several way to format this multiple import statements. At worst, at least I recommende to enforce the sort of this multi import lines so there is no ambiguity where to place any "import" (and isort will correct the change from the developer)

gsemet · 2016-08-09T17:10:50Z

Rebased, sorry I had to force push this PR.

rxin · 2016-08-09T17:57:39Z

Can you create a JIRA ticket for this? This is too large to go in without a JIRA ticket.

rxin · 2016-08-09T18:02:36Z

BTW this is actually a non-trivial change and would require very careful look, since Python imports are not side effect free.

srowen · 2016-08-10T09:31:34Z

@stibbons connect this to your JIRA in the title
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

gsemet · 2016-08-10T11:15:53Z

Done. Linked to #14180. Is it possible to test this PR with SparkPullRequestBuilder ?

srowen · 2016-08-10T13:00:39Z

Jenkins test this please

srowen · 2016-08-10T13:00:43Z

Jenkins add to whitelist

SparkQA · 2016-08-10T13:03:44Z

Test build #63533 has finished for PR 14567 at commit 067ab4a.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-10T13:18:40Z

Test build #63534 has finished for PR 14567 at commit 6f17382.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-10T15:33:27Z

Test build #63536 has finished for PR 14567 at commit 239344b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-10T18:08:04Z

Test build #63544 has finished for PR 14567 at commit b3a176c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-08-11T02:31:08Z

python/pep8rc

There is another pep8config file at ./dev/toxi.ini - seems like it would be good to have a single file (also unify the ignore lists)

Is the rename of tox.ini addressing this -- I wasn't sure how

actually, tox.ini looks more similar to this pep8rc than isort.cfg, but github doesn't show it that way.

Maybe I'm missing something but should they be in the same file or are they separate?

I don't know if they can be merged. Will try it

I'm ignorant of the config here of course, but it seems like we've replaced 1 config file (tox.ini) with 4 in total (pep8rc, isort.cfg, style.yapf.ini, .editorconfig). Maybe it would help to explain what the differences are, because maybe only some of them are really worth dealing with.

tox.ini kind of becames this pep8rc, indeed.

isort only deals with sorting "import" statement, not checked by pep8, but pylint has some checks on it

style.yapf.ini is the configuration for yapf, which is a more aggressive version of 'autopep8' made by google for their chrominum projects, to strictly format all their python files. I don't recommend to enforce the usage of it, but if one want to use it on its python file in the Spark project, he can execute it and review the style before commiting. In an ideal wold, this would be a mandatory formatting (on post commit hook for example), but this may take some times before reaching this point

.editorconfig only deals with tab space, like I said, might be put outside of this PR if you want

I am not sure if yapf can sort imports like isort do

SparkQA · 2016-08-12T14:08:32Z

Test build #63689 has finished for PR 14567 at commit ff549b6.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-12T14:18:35Z

Test build #63690 has finished for PR 14567 at commit be39a83.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-12T15:13:28Z

Test build #63691 has finished for PR 14567 at commit 46e83f2.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-12T15:38:36Z

Test build #63693 has finished for PR 14567 at commit d3eab9f.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

gsemet · 2016-09-22T07:32:52Z

Rebased

SparkQA · 2016-09-22T09:49:12Z

Test build #65762 has finished for PR 14567 at commit b81c38f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-28T18:55:47Z

Test build #66043 has finished for PR 14567 at commit a02f634.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-05T14:36:04Z

Test build #66389 has finished for PR 14567 at commit 4a3cde7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Use a virtualenv for isolation and easy installation. This basically reverts 85a50a6 Might have been a solution to SPARK-9385. Lot of new test disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as apache#14830 for documentation examples, which is part of the effort on code style described in apache#14567). Each subsequent pull request will fix one or more error and reenable the according pylint check. List of new disabled checks: - bad-super-call - consider-iterating-dictionary - consider-using-enumerate - eval-used - exec-used - invalid-length-returned - misplaced-comparison-constant - raising-bad-type - redefined-variable-type - trailing-newlines - trailing-whitespace - ungrouped-imports - unnecessary-pass - unneeded-not - wrong-import-order - wrong-import-position Signed-off-by: Gaetan Semet <[email protected]>

+ quiet mode for sphinx

SparkQA · 2016-10-13T00:13:37Z

Test build #66847 has finished for PR 14567 at commit 1252d08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Use a virtualenv for isolation and easy installation. This basically reverts 85a50a6 Might have been a solution to SPARK-9385. Lot of new test disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as apache#14830 for documentation examples, which is part of the effort on code style described in apache#14567). Each subsequent pull request will fix one or more error and reenable the according pylint check. List of new disabled checks: - bad-super-call - consider-iterating-dictionary - consider-using-enumerate - eval-used - exec-used - invalid-length-returned - misplaced-comparison-constant - raising-bad-type - redefined-variable-type - trailing-newlines - trailing-whitespace - ungrouped-imports - unnecessary-pass - unneeded-not - wrong-import-order - wrong-import-position Signed-off-by: Gaetan Semet <[email protected]>

ueshin · 2017-06-26T23:44:08Z

Hi, are you still working on this?

## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory … Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage. Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key… Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000) Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit… Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer Closes apache#18585 - SPARK-21359 Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala Added: Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I… Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0 Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to … Closes apache#18667 - Fix the simpleString used in error messages Closes apache#18782 - Branch 2.1 Added: Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads Added: Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server Added: Closes apache#18827 - Merge pull request 1 from apache/master ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18780 from HyukjinKwon/close-prs.

gsemet mentioned this pull request Aug 9, 2016

[SPARK-16367][PYSPARK] Support for deploying Anaconda and Virtualenv environments in Spark Executors #14180

Closed

srowen reviewed Aug 9, 2016
View reviewed changes

gsemet force-pushed the python_import_reorg branch from 84320dd to 5590796 Compare August 9, 2016 17:03

gsemet force-pushed the python_import_reorg branch from 5590796 to 9d52994 Compare August 9, 2016 17:16

gsemet changed the title ~~Python import reorg~~ [SPARK-16992] Python Pep8 formatting and import reorganisation Aug 10, 2016

gsemet changed the title ~~[SPARK-16992] Python Pep8 formatting and import reorganisation~~ [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Aug 10, 2016

gsemet force-pushed the python_import_reorg branch from 9d52994 to 067ab4a Compare August 10, 2016 11:24

holdenk reviewed Aug 11, 2016
View reviewed changes

gsemet force-pushed the python_import_reorg branch from b3a176c to ff549b6 Compare August 12, 2016 14:03

gsemet force-pushed the python_import_reorg branch from ff549b6 to be39a83 Compare August 12, 2016 14:13

gsemet force-pushed the python_import_reorg branch from be39a83 to 46e83f2 Compare August 12, 2016 15:11

gsemet force-pushed the python_import_reorg branch from 46e83f2 to d3eab9f Compare August 12, 2016 15:33

gsemet force-pushed the python_import_reorg branch from d3eab9f to 500b659 Compare August 12, 2016 15:41

gsemet force-pushed the python_import_reorg branch from 1a2aa35 to b81c38f Compare September 22, 2016 07:31

gsemet force-pushed the python_import_reorg branch from b81c38f to a02f634 Compare September 28, 2016 16:35

gsemet force-pushed the python_import_reorg branch from a02f634 to 4a3cde7 Compare October 5, 2016 12:17

gsemet added 4 commits October 12, 2016 23:44

better usage of debug mode

5aa020f

Leave anaconda environment in lint-pylint

cd7184a

new pylint ignore after rebase

1252d08

+ quiet mode for sphinx

gsemet force-pushed the python_import_reorg branch from 4a3cde7 to 1252d08 Compare October 12, 2016 21:45

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

asfgit closed this in 3a45c7f Aug 5, 2017

[SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation #14567

[SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation #14567

Uh oh!

Conversation

gsemet commented Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Aug 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsemet Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsemet commented Aug 9, 2016

Uh oh!

rxin commented Aug 9, 2016

Uh oh!

rxin commented Aug 9, 2016

Uh oh!

srowen commented Aug 10, 2016

Uh oh!

gsemet commented Aug 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Aug 10, 2016

Uh oh!

srowen commented Aug 10, 2016

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsemet Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 12, 2016

Uh oh!

SparkQA commented Aug 12, 2016

Uh oh!

SparkQA commented Aug 12, 2016

Uh oh!

SparkQA commented Aug 12, 2016

Uh oh!

gsemet commented Sep 22, 2016

Uh oh!

SparkQA commented Sep 22, 2016

Uh oh!

SparkQA commented Sep 28, 2016

Uh oh!

SparkQA commented Oct 5, 2016

Uh oh!

SparkQA commented Oct 13, 2016

Uh oh!

gsemet commented Aug 9, 2016 •

edited

Loading

gsemet Aug 9, 2016 •

edited

Loading

gsemet commented Aug 10, 2016 •

edited

Loading

gsemet Aug 31, 2016 •

edited

Loading