Skip to content

[SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version#17885

Closed
holdenk wants to merge 2 commits intoapache:masterfrom
holdenk:SPARK-20627-remove-pip-local-version-string
Closed

[SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version#17885
holdenk wants to merge 2 commits intoapache:masterfrom
holdenk:SPARK-20627-remove-pip-local-version-string

Conversation

@holdenk
Copy link
Copy Markdown
Contributor

@holdenk holdenk commented May 7, 2017

What changes were proposed in this pull request?

Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.

How was this patch tested?

Ran make-distribution locally

@holdenk
Copy link
Copy Markdown
Contributor Author

holdenk commented May 7, 2017

I'll target this for master, branch-2.2, branch-2.1.

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 7, 2017

Test build #76535 has finished for PR 17885 at commit 99414d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" | sed -r "s/-/./" | sed -r "s/SNAPSHOT/dev0/"`
# Write out the VERSION to PySpark version info we rewrite the - into a . and SNAPSHOT
# to dev0 to be closer to PEP440.
PYSPARK_VERSION=`echo "$SPARK_VERSION" | sed -r "s/-/./" | sed -r "s/SNAPSHOT/dev0/"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also affects the pyspark-*.tgz artifact name. It seems like this means the same file name will be used for different flavors of the release. If they're identical anyway it's just redundant, but are they? I don't know this part well so might be misunderstanding what this would do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we currently only package Python for one Hadoop version. If we start doing multiple Hadoop versions for Python we can figure out how to handle that again.

@holdenk
Copy link
Copy Markdown
Contributor Author

holdenk commented May 9, 2017

If there are no other comments I'm going to merge this tomorrow.

@gatorsmile
Copy link
Copy Markdown
Member

Are you referring to https://www.python.org/dev/peps/pep-0440/ ?

@gatorsmile
Copy link
Copy Markdown
Member

Could you post the changes you made in the PR description and explain why it resolves PEP-0440? It might help more people understand the impacts of this PR by reading the PR description. Thanks!

@holdenk
Copy link
Copy Markdown
Contributor Author

holdenk commented May 9, 2017

Updated with more explanation of what we changed in the PR description.

asfgit pushed a commit that referenced this pull request May 9, 2017
…hon version

## What changes were proposed in this pull request?

Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.

## How was this patch tested?

Ran `make-distribution` locally

Author: Holden Karau <holden@us.ibm.com>

Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string.

(cherry picked from commit 1b85bcd)
Signed-off-by: Holden Karau <holden@us.ibm.com>
@holdenk
Copy link
Copy Markdown
Contributor Author

holdenk commented May 9, 2017

Merged to master, branch-2.2, and branch-2.1.

asfgit pushed a commit that referenced this pull request May 9, 2017
…hon version

## What changes were proposed in this pull request?

Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.

## How was this patch tested?

Ran `make-distribution` locally

Author: Holden Karau <holden@us.ibm.com>

Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string.

(cherry picked from commit 1b85bcd)
Signed-off-by: Holden Karau <holden@us.ibm.com>
@asfgit asfgit closed this in 1b85bcd May 9, 2017
@gatorsmile
Copy link
Copy Markdown
Member

Could you post the original section about local versions should not be used when publishing up-stream?

It sounds like PEP0440 does not encourage it. Below is what I found

The inclusion of the local version label makes it possible to differentiate upstream releases from potentially altered rebuilds by downstream integrators. The use of a local version identifier does not affect the kind of a release but, when applied to a source distribution, does indicate that it may not contain the exact same code as the corresponding upstream release.

lycplus pushed a commit to lycplus/spark that referenced this pull request May 24, 2017
…hon version

## What changes were proposed in this pull request?

Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.

## How was this patch tested?

Ran `make-distribution` locally

Author: Holden Karau <holden@us.ibm.com>

Closes apache#17885 from holdenk/SPARK-20627-remove-pip-local-version-string.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants