Skip to content
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ jobs:
run: ./dev/lint-java
- name: Python
run: |
pip install flake8 sphinx numpy
pip install -r ./dev/requirements.txt
./dev/lint-python
- name: License
run: ./dev/check-license
Expand Down Expand Up @@ -147,8 +147,8 @@ jobs:
sudo apt-get install -y r-base r-base-dev libcurl4-openssl-dev pandoc
- name: Install packages
run: |
pip install sphinx mkdocs numpy
gem install jekyll jekyll-redirect-from rouge
pip install -r ./docs/requirements.txt
gem install jekyll:4.0.0 jekyll-redirect-from:0.16.0 rouge:3.15.0
sudo Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')"
- name: Run jekyll build
run: |
Expand Down
3 changes: 3 additions & 0 deletions dev/create-release/do-release-docker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@

set -e
SELF=$(cd $(dirname $0) && pwd)
SPARK_ROOT="$SELF/../.."
. "$SELF/release-util.sh"

function usage {
Expand Down Expand Up @@ -91,6 +92,8 @@ for f in "$SELF"/*; do
fi
done

cp "$SPARK_ROOT/docs/requirements.txt" "$WORKDIR/docs-requirements.txt"

GPG_KEY_FILE="$WORKDIR/gpg.key"
fcreate_secure "$GPG_KEY_FILE"
$GPG --export-secret-key --armor "$GPG_KEY" > "$GPG_KEY_FILE"
Expand Down
4 changes: 2 additions & 2 deletions dev/create-release/spark-rm/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true
# These arguments are just for reuse and not really meant to be customized.
ARG APT_INSTALL="apt-get install --no-install-recommends -y"

ARG PIP_PKGS="sphinx==2.3.1 mkdocs==1.0.4 numpy==1.18.1"
COPY ./docs-requirements.txt /docs-requirements.txt
ARG GEM_PKGS="jekyll:4.0.0 jekyll-redirect-from:0.16.0 rouge:3.15.0"

# Install extra needed repos and refresh.
Expand Down Expand Up @@ -61,7 +61,7 @@ RUN pyenv global 3.7.6
RUN python --version
RUN pip install --upgrade pip
RUN pip --version
RUN pip install $PIP_PKGS
RUN pip install -r docs-requirements.txt

ENV PATH "$PATH:/root/.rbenv/bin:/root/.rbenv/shims"
RUN curl -fsSL https://github.com/rbenv/rbenv-installer/raw/108c12307621a0aa06f19799641848dde1987deb/bin/rbenv-installer | bash
Expand Down
6 changes: 4 additions & 2 deletions dev/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
flake8==3.5.0
pycodestyle==2.5.0
flake8==3.7.9
jira==1.0.3
PyGithub==1.26.0
Unidecode==0.04.19
sphinx
sphinx==2.3.1
numpy==1.18.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly googled and skimmed other projects about the requirements here and here. I still think it's more usual to specify the range, rather than the specific version.

I am still not sure if it's right to pin the version yet. There's trade-off on pinning. I think Spark community is big enough to handle these issue from using the latest versions too. I would only pin the version when an issue difficult to fix is found.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc somewhat arbitrary committers who might be interested in here: @srowen, @holdenk, @dongjoon-hyun, @BryanCutler. WDYT about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the trade-off mentioned by @HyukjinKwon . Actually, I tried to pin once before and dropped my PR due to that. +1 for @HyukjinKwon 's advice, I would only pin the version when an issue difficult to fix is found..

The only exception is spark-rm/Dockerfile. We need to use specific version and to manager explicitly there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon - When looking at project dependencies, there is an important distinction between projects that are used as libraries and projects that are used as stand-alone applications.

If your project is a library, then you know others are importing you alongside other dependencies too. To minimize the chance of transitive dependency conflicts, you want to be flexible in how you specify your dependencies.

When your project is a stand-alone application, you don't have to worry about such things. You can pin every dependency to a specific version to get the most predictable and reliable build and runtime behavior.

In our case, the Spark build environment is more akin to a stand-alone application than a library. We don't need to worry about downstream users struggling with dependency conflicts. We can get the most stable build behavior by pinning everything, and there is no downside as far as I can tell.

I'll use Trio as an example again to illustrate my point:

  • Trio is a library that others will typically import alongside many other dependencies. So in Trio's setup.py they are very flexible in how they specify their dependencies.
  • Trio's test environment, on the other hand, is only used by Trio contributors. So Trio locks down every test requirement using pip-tools.

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's the trade-off here as well - it's the most stable but there might be multiple bugs existing fixed in the upstream. It will still mess developer's environment. Arguably there are not so many Python-dedicated dev people in Spark community who fluently use pyenv, virtualenv or conda. I think most of them just use pip and local installation.

I quickly skimmed requirements.txt for dev or docs at here. I skimmed top 20 projects, and found 6 instances.

  • 3 of them were not quite pinned.
  • 2 of them were partially pinned
  • 1 of them was completely pinned.

I agree the dev envs more tend to specify the versions; however, I think it's still prevailing to don't pin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help then if we expanded dev/README.md to show how to setup a virtual environment? I'm willing to do that.

If we don't want to ask devs to use virtual environments at all, then perhaps we need to fork dev/requirements.txt and have a version that pins everything, for use in CI and releases, and a version that pins nothing, for use by devs who don't use virtual environments.

Another alternative is the compromise currently standing in this PR, with some versions specified as major.minor.*.

And yet another alternative (which I personally wouldn't favor, but I know it's common) is to Dockerize the whole development environment, but that's a lot of work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a note in README to ensure people know how they can isolate this env if needed, and pinning minor version only? is that close enough for consensus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll do that. I'll recommend that users create a virtual environment in dev/README.md and demonstrate how to do that. I'll also update the version specifiers to all be in the form of major.minor.*.

Copy link
Contributor Author

@nchammas nchammas Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, on second thought, perhaps the README should be left to a separate PR. We already have advice on setting up a development environment in at least a couple of places, like Useful Developer Tools and docs/README.md.

Perhaps we should consolidate that advice over on the Useful Developer Tools page, since it fits in with the information already there.

Either that, or let's agree on some other approach to take. But I think we can defer any dev documentation changes to a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also would prefer not to pin specific versions and agree with #27928 (comment). It is good to have the community try with the latest to surface any issues, but we should be very clear what versions have been used in our CI, which could be from a virtualenv or just in a readme, so there is always an obvious fallback version.

2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Note: Other versions of roxygen2 might work in SparkR documentation generation b
To generate API docs for any language, you'll need to install these libraries:

```sh
pip install sphinx==2.3.1 mkdocs==1.0.4 numpy==1.18.1
pip install -r ./docs/requirements.txt
```

## Generating the Documentation HTML
Expand Down
3 changes: 3 additions & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sphinx==2.3.1
mkdocs==1.0.4
numpy==1.18.1