Skip to content

Conversation

@gsemet
Copy link
Contributor

@gsemet gsemet commented Aug 9, 2016

What changes were proposed in this pull request?

This patch adds a code style validation script named dev/py-validate.sh to enforce pep8 recommendations.

The execution of this script will format a lot of files. I did not put them into this PR. I plan to execute this smoothly and reformat the code in various Pull Requests to ease reviews.

First one is automatic formatting of the examples in the documentations: See #14830.

Features of the current patch:

  • add a python/.editconfig file (Scala files use 2 space indentation, while Python files uses 4) for compatible editors (almost every editors has a plugin to support .editconfig file)
  • use autopep8 to fix basic pep8 mistakes
  • use isort to automatically split "import" statement and organise them into logically linked order (see doc here. The most important thing is that it split import statements that imports more than one object into several lines. This will increase the number of line of the file, but this facilitates a lot file maintainance and file merges if needed.
  • add a py-validate.sh script in order to automatise the correction (need isort and autopep8 installed inside a virtualenv that has the right version of the tools, this is documented in the script)

You can see similar script in prod in the Buildbot project

How was this patch tested?

Simple tests on my machines has been done (local mode only). There should not have any regression or feature change at all with this pull request.

@srowen
Copy link
Member

srowen commented Aug 9, 2016

Although there has generally been some resistance to large style-only changes, we do enforce import order in Scala/Java including checks. So it seems pretty reasonable to do the same in one big go for Python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding these multiple imports seems counter-productive. We don't do it in Scala (and in Java you can only import one thing or everything). Is this important/canonical for PEP8?

Copy link
Contributor Author

@gsemet gsemet Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed this is a deviation from Pep recommendation. I encourage this behavior since it simplifies a lot file maintainance over its lifetime, since once this "isort" stuff is setup, developers does not have any more latitude in the placement of the "import" statements.

On our Buildbot based project, we use to have lot of conflict involving changes on import statement: on 2 differents branches (say: prod and main), we often had to import either the same import (so merge might add this line twice when the two developers have placed them in two different places) or not so easy to solve conflict (when two developers add theo import two different object from the same module).
Once we setup the "one import per line" rule, no more conflict on this lines, ever. This helped us a lot automatizing an auto merger from the release branches to the "master" branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we get that in the Scala code for sure, although you get conflicts even if you change merely adjacent lines anyway, so it only saves so much conflict. It's a decent open question, I wonder what others think? the downside is the extra duplication of import boilerplate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see in https://pypi.python.org/pypi/isort, there a several way to format this multiple import statements. At worst, at least I recommende to enforce the sort of this multi import lines so there is no ambiguity where to place any "import" (and isort will correct the change from the developer)

@gsemet gsemet force-pushed the python_import_reorg branch from 84320dd to 5590796 Compare August 9, 2016 17:03
@gsemet
Copy link
Contributor Author

gsemet commented Aug 9, 2016

Rebased, sorry I had to force push this PR.

@gsemet gsemet force-pushed the python_import_reorg branch from 5590796 to 9d52994 Compare August 9, 2016 17:16
@rxin
Copy link
Contributor

rxin commented Aug 9, 2016

Can you create a JIRA ticket for this? This is too large to go in without a JIRA ticket.

@rxin
Copy link
Contributor

rxin commented Aug 9, 2016

BTW this is actually a non-trivial change and would require very careful look, since Python imports are not side effect free.

@srowen
Copy link
Member

srowen commented Aug 10, 2016

@stibbons connect this to your JIRA in the title
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

@gsemet gsemet changed the title Python import reorg [SPARK-16992] Python Pep8 formatting and import reorganisation Aug 10, 2016
@gsemet
Copy link
Contributor Author

gsemet commented Aug 10, 2016

Done. Linked to #14180. Is it possible to test this PR with SparkPullRequestBuilder ?

@gsemet gsemet changed the title [SPARK-16992] Python Pep8 formatting and import reorganisation [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Aug 10, 2016
@gsemet gsemet force-pushed the python_import_reorg branch from 9d52994 to 067ab4a Compare August 10, 2016 11:24
@srowen
Copy link
Member

srowen commented Aug 10, 2016

Jenkins test this please

@srowen
Copy link
Member

srowen commented Aug 10, 2016

Jenkins add to whitelist

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63533 has finished for PR 14567 at commit 067ab4a.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63534 has finished for PR 14567 at commit 6f17382.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63536 has finished for PR 14567 at commit 239344b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63544 has finished for PR 14567 at commit b3a176c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

python/pep8rc Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another pep8config file at ./dev/toxi.ini - seems like it would be good to have a single file (also unify the ignore lists)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the rename of tox.ini addressing this -- I wasn't sure how

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, tox.ini looks more similar to this pep8rc than isort.cfg, but github doesn't show it that way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but should they be in the same file or are they separate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if they can be merged. Will try it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ignorant of the config here of course, but it seems like we've replaced 1 config file (tox.ini) with 4 in total (pep8rc, isort.cfg, style.yapf.ini, .editorconfig). Maybe it would help to explain what the differences are, because maybe only some of them are really worth dealing with.

Copy link
Contributor Author

@gsemet gsemet Aug 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • tox.ini kind of becames this pep8rc, indeed.
  • isort only deals with sorting "import" statement, not checked by pep8, but pylint has some checks on it
  • style.yapf.ini is the configuration for yapf, which is a more aggressive version of 'autopep8' made by google for their chrominum projects, to strictly format all their python files. I don't recommend to enforce the usage of it, but if one want to use it on its python file in the Spark project, he can execute it and review the style before commiting. In an ideal wold, this would be a mandatory formatting (on post commit hook for example), but this may take some times before reaching this point
  • .editorconfig only deals with tab space, like I said, might be put outside of this PR if you want

I am not sure if yapf can sort imports like isort do

@gsemet gsemet force-pushed the python_import_reorg branch from b3a176c to ff549b6 Compare August 12, 2016 14:03
@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63689 has finished for PR 14567 at commit ff549b6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from ff549b6 to be39a83 Compare August 12, 2016 14:13
@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63690 has finished for PR 14567 at commit be39a83.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from be39a83 to 46e83f2 Compare August 12, 2016 15:11
@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63691 has finished for PR 14567 at commit 46e83f2.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from 46e83f2 to d3eab9f Compare August 12, 2016 15:33
@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63693 has finished for PR 14567 at commit d3eab9f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from d3eab9f to 500b659 Compare August 12, 2016 15:41
@gsemet gsemet force-pushed the python_import_reorg branch from 1a2aa35 to b81c38f Compare September 22, 2016 07:31
@gsemet
Copy link
Contributor Author

gsemet commented Sep 22, 2016

Rebased

@SparkQA
Copy link

SparkQA commented Sep 22, 2016

Test build #65762 has finished for PR 14567 at commit b81c38f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from b81c38f to a02f634 Compare September 28, 2016 16:35
@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66043 has finished for PR 14567 at commit a02f634.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from a02f634 to 4a3cde7 Compare October 5, 2016 12:17
@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66389 has finished for PR 14567 at commit 4a3cde7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

gsemet added a commit to gsemet/spark that referenced this pull request Oct 12, 2016
Use a virtualenv for isolation and easy installation.

This basically reverts 85a50a6

Might have been a solution to SPARK-9385.

Lot of new test disabled. I propose to fix issues in various
pull requests (obviously most 'import' order errors should be
fixed by my other pull requests such as apache#14830 for documentation
examples, which is part of the effort on code style described
in apache#14567). Each subsequent pull request will fix one or more
error and reenable the according pylint check.

List of new disabled checks:
- bad-super-call
- consider-iterating-dictionary
- consider-using-enumerate
- eval-used
- exec-used
- invalid-length-returned
- misplaced-comparison-constant
- raising-bad-type
- redefined-variable-type
- trailing-newlines
- trailing-whitespace
- ungrouped-imports
- unnecessary-pass
- unneeded-not
- wrong-import-order
- wrong-import-position

Signed-off-by: Gaetan Semet <[email protected]>
Use a virtualenv for isolation and easy installation.

This basically reverts 85a50a6

Might have been a solution to SPARK-9385.

Lot of new test disabled. I propose to fix issues in various
pull requests (obviously most 'import' order errors should be
fixed by my other pull requests such as apache#14830 for documentation
examples, which is part of the effort on code style described
in apache#14567). Each subsequent pull request will fix one or more
error and reenable the according pylint check.

List of new disabled checks:
- bad-super-call
- consider-iterating-dictionary
- consider-using-enumerate
- eval-used
- exec-used
- invalid-length-returned
- misplaced-comparison-constant
- raising-bad-type
- redefined-variable-type
- trailing-newlines
- trailing-whitespace
- ungrouped-imports
- unnecessary-pass
- unneeded-not
- wrong-import-order
- wrong-import-position

Signed-off-by: Gaetan Semet <[email protected]>
+ quiet mode for sphinx
@gsemet gsemet force-pushed the python_import_reorg branch from 4a3cde7 to 1252d08 Compare October 12, 2016 21:45
@SparkQA
Copy link

SparkQA commented Oct 13, 2016

Test build #66847 has finished for PR 14567 at commit 1252d08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

gsemet added a commit to gsemet/spark that referenced this pull request Dec 22, 2016
Use a virtualenv for isolation and easy installation.

This basically reverts 85a50a6

Might have been a solution to SPARK-9385.

Lot of new test disabled. I propose to fix issues in various
pull requests (obviously most 'import' order errors should be
fixed by my other pull requests such as apache#14830 for documentation
examples, which is part of the effort on code style described
in apache#14567). Each subsequent pull request will fix one or more
error and reenable the according pylint check.

List of new disabled checks:
- bad-super-call
- consider-iterating-dictionary
- consider-using-enumerate
- eval-used
- exec-used
- invalid-length-returned
- misplaced-comparison-constant
- raising-bad-type
- redefined-variable-type
- trailing-newlines
- trailing-whitespace
- ungrouped-imports
- unnecessary-pass
- unneeded-not
- wrong-import-order
- wrong-import-position

Signed-off-by: Gaetan Semet <[email protected]>
@ueshin
Copy link
Member

ueshin commented Jun 26, 2017

Hi, are you still working on this?

@asfgit asfgit closed this in 3a45c7f Aug 5, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close stale PRs, mostly the same instances with apache#18017

Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory …
Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage.
Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation
Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers
Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key…
Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples
Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python
Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage
Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins
Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP]
Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job
Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable
Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator
Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset
Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns
Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work
Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly
Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column
Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService
Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication
Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone
Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000)
Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos
Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table
Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit…
Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex
Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable
Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting
Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages
Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery
Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException
Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer
Closes apache#18585 - SPARK-21359
Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala

Added:
Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I…
Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0
Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to …
Closes apache#18667 - Fix the simpleString used in error messages
Closes apache#18782 - Branch 2.1

Added:
Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads

Added:
Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread
Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable
Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server

Added:
Closes apache#18827 - Merge pull request 1 from apache/master

## How was this patch tested?

N/A

Author: hyukjinkwon <[email protected]>

Closes apache#18780 from HyukjinKwon/close-prs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants