-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31167][BUILD] Refactor how we track Python test/build dependencies #27928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 4 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
7a20608
missing test dependencies
nchammas 7273851
add docs requirements file
nchammas 0a9cbbc
reference new docs requirements file
nchammas 9c3a7af
pin doc gem requirements in github workflow
nchammas de9fbbd
Merge branch 'master' of https://github.com/apache/spark into SPARK-3…
nchammas 1086d81
merge docs reqs into general dev reqs
nchammas cd05e25
merged reqs
nchammas 4326933
open python reqs to latest patch version
nchammas 75682a6
Unidecode 04 -> 4
nchammas bb83262
Merge branch 'master' of https://github.com/apache/spark into SPARK-3…
nchammas 839669d
add requirements and pinned requirements
nchammas 473c85d
requirements.txt -> requirements-pinned.txt
nchammas 4b540a9
run pip-tools from within dev/
nchammas b088d15
add dev readme
nchammas 9704c60
Merge branch 'master' of https://github.com/apache/spark into SPARK-3…
nchammas 7df0040
Merge branch 'master' of https://github.com/apache/spark into SPARK-3…
nchammas bae2c0c
Merge branch 'master' of https://github.com/apache/spark into SPARK-3…
nchammas e21d97d
Merge branch 'master' of https://github.com/apache/spark into SPARK-3…
nchammas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,7 @@ | ||
| flake8==3.5.0 | ||
| pycodestyle==2.5.0 | ||
| flake8==3.7.9 | ||
| jira==1.0.3 | ||
| PyGithub==1.26.0 | ||
| Unidecode==0.04.19 | ||
| sphinx | ||
| sphinx==2.3.1 | ||
| numpy==1.18.1 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| sphinx==2.3.1 | ||
| mkdocs==1.0.4 | ||
| numpy==1.18.1 | ||
nchammas marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quickly googled and skimmed other projects about the requirements here and here. I still think it's more usual to specify the range, rather than the specific version.
I am still not sure if it's right to pin the version yet. There's trade-off on pinning. I think Spark community is big enough to handle these issue from using the latest versions too. I would only pin the version when an issue difficult to fix is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc somewhat arbitrary committers who might be interested in here: @srowen, @holdenk, @dongjoon-hyun, @BryanCutler. WDYT about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the trade-off mentioned by @HyukjinKwon . Actually, I tried to pin once before and dropped my PR due to that. +1 for @HyukjinKwon 's advice,
I would only pin the version when an issue difficult to fix is found..The only exception is
spark-rm/Dockerfile. We need to use specific version and to manager explicitly there.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon - When looking at project dependencies, there is an important distinction between projects that are used as libraries and projects that are used as stand-alone applications.
If your project is a library, then you know others are importing you alongside other dependencies too. To minimize the chance of transitive dependency conflicts, you want to be flexible in how you specify your dependencies.
When your project is a stand-alone application, you don't have to worry about such things. You can pin every dependency to a specific version to get the most predictable and reliable build and runtime behavior.
In our case, the Spark build environment is more akin to a stand-alone application than a library. We don't need to worry about downstream users struggling with dependency conflicts. We can get the most stable build behavior by pinning everything, and there is no downside as far as I can tell.
I'll use Trio as an example again to illustrate my point:
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's the trade-off here as well - it's the most stable but there might be multiple bugs existing fixed in the upstream. It will still mess developer's environment. Arguably there are not so many Python-dedicated dev people in Spark community who fluently use pyenv, virtualenv or conda. I think most of them just use pip and local installation.
I quickly skimmed requirements.txt for dev or docs at here. I skimmed top 20 projects, and found 6 instances.
I agree the dev envs more tend to specify the versions; however, I think it's still prevailing to don't pin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help then if we expanded
dev/README.mdto show how to setup a virtual environment? I'm willing to do that.If we don't want to ask devs to use virtual environments at all, then perhaps we need to fork
dev/requirements.txtand have a version that pins everything, for use in CI and releases, and a version that pins nothing, for use by devs who don't use virtual environments.Another alternative is the compromise currently standing in this PR, with some versions specified as
major.minor.*.And yet another alternative (which I personally wouldn't favor, but I know it's common) is to Dockerize the whole development environment, but that's a lot of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest a note in README to ensure people know how they can isolate this env if needed, and pinning minor version only? is that close enough for consensus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll do that. I'll recommend that users create a virtual environment in
dev/README.mdand demonstrate how to do that. I'll also update the version specifiers to all be in the form ofmajor.minor.*.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, on second thought, perhaps the README should be left to a separate PR. We already have advice on setting up a development environment in at least a couple of places, like Useful Developer Tools and
docs/README.md.Perhaps we should consolidate that advice over on the Useful Developer Tools page, since it fits in with the information already there.
Either that, or let's agree on some other approach to take. But I think we can defer any dev documentation changes to a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also would prefer not to pin specific versions and agree with #27928 (comment). It is good to have the community try with the latest to surface any issues, but we should be very clear what versions have been used in our CI, which could be from a virtualenv or just in a readme, so there is always an obvious fallback version.