Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -250,10 +250,12 @@ jobs:
with:
python-version: '3.9'
architecture: x64
cache: 'pip'
cache-dependency-path: 'dev/py-tests/requirements-sql.txt'
- name: Install Python packages (Python 3.9)
if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) || contains(matrix.modules, 'connect')
run: |
python3.9 -m pip install 'numpy>=1.20.0' pyarrow pandas scipy unittest-xml-reporting 'lxml==4.9.4' 'grpcio==1.59.3' 'grpcio-status==1.59.3' 'protobuf==4.25.1'
python3.9 -m pip install -r dev/py-tests/requirements-sql.txt
python3.9 -m pip list
# Run the tests.
- name: Run tests
Expand Down
11 changes: 11 additions & 0 deletions dev/py-tests/requirements-sql.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# PySpark dependencies for SQL tests

numpy==1.26.2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirements file format allows for specifying dependency versions using logical operators (for example chardet>=3.0.4) or specifying dependencies without any versions. In this case the pip install -r requirements.txt command will always try to install the latest available package version. To be sure that the cache will be used, please stick to a specific dependency version and update it manually if necessary.

specify versions according to the suggestion in https://github.com/actions/setup-python?tab=readme-ov-file#caching-packages-dependencies

actually, I think maybe we should always specify the versions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related prior discussion on pinning development dependencies: #27928 (review)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I think maybe we should always specify the versions

I agree with this, and this is something I tried to do in the PR I linked to just above, but several committers were against it.

When I look at the number of PRs related to pinning dev dependencies over the past three years, I wonder if committers still feel the same way today.

Not pinning development dependencies creates constant breakages that can pop up whenever an upstream library releases a new version. When we pin dependencies, by contrast, we choose when to upgrade and deal with the potential breakage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I don't have a great solution on this. We could have CI dedicated dep files maybe ... because now we have too many dependencies in CI with too many matrix ... but not sure .. At least, now I am not super against this idea..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be open to my making another attempt at the approach in #27928? (@zhengruifeng can also take this on if they prefer, of course.)

Basically, we have two sets of development dependencies:

  • requirements.txt: direct dependencies only that are as flexible as possible; this is what devs install on their laptops
  • requirements-pinned.txt: pinned dependencies derived automatically from requirements.txt using pip-tools; this is used for CI

I know this adds a new tool that non-Python developers may not be familiar with (pip-tools), but it's extremely easy to use, has been around a long time, and is in use by many large projects, the most notable of which is Warehouse, the project that backs PyPI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nchammas I just notice the previous discussion #27928.

I personally prefer using requirements.txt files with pinned versions, one reason is that the dependency is actually cached in docker file, and I was confused about the version used in CI from time to time, e.g.
we used the cached RUN python3.9 -m pip install numpy pyarrow ... before, and when pyarrow 13 released at 2023-8-23, I didn't know this release broke PySpark before the cached image was refreshed (at 2023-9-13).

But I don't feel very strong about it and defer to @HyukjinKwon and @dongjoon-hyun on this.

pyarrow==14.0.2
pandas==2.1.4
scipy==1.11.4
unittest-xml-reporting==3.2.0
lxml==4.9.4
grpcio==1.59.3
grpcio-status==1.59.3
protobuf==4.25.1