Skip to content

Use latest pickle#5826

Merged
crusaderky merged 3 commits intodask:mainfrom
jakirkham:use_latest_pickle
Nov 5, 2022
Merged

Use latest pickle#5826
crusaderky merged 3 commits intodask:mainfrom
jakirkham:use_latest_pickle

Conversation

@jakirkham
Copy link
Member

@jakirkham jakirkham commented Feb 16, 2022

This was specified as Python 3.7 had native pickle protocol 4 support with the option for pickle protocol 5 via a backport package. If there were issues with the availability of the backport package, using the latest pickle protocol might run into issues.

However Python 3.8+ has native pickle protocol 5 support. No backport required. So the kinds of issues Python 3.7 had won't occur. Given this, just use the latest pickle protocol.


  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

@jakirkham jakirkham mentioned this pull request Feb 16, 2022
@jakirkham jakirkham force-pushed the use_latest_pickle branch 2 times, most recently from eee6afc to 9b9e85d Compare February 17, 2022 00:04
@jakirkham jakirkham marked this pull request as ready for review February 17, 2022 00:04
@jakirkham
Copy link
Member Author

cc @jrbourbeau

@github-actions
Copy link
Contributor

github-actions bot commented Feb 17, 2022

Unit Test Results

       12 files  ±0         12 suites  ±0   7h 22m 24s ⏱️ + 7m 33s
  2 621 tests ±0    2 539 ✔️ ±0    81 💤 ±0  1 ±0 
15 650 runs  ±0  14 760 ✔️ ±0  889 💤 ±0  1 ±0 

For more details on these failures, see this check.

Results for commit df66b2d. ± Comparison against base commit 94ebd57.

♻️ This comment has been updated with latest results.

@jakirkham
Copy link
Member Author

@dask/maintenance any thoughts on this? 🙂

@martindurant
Copy link
Member

Isn't the default protocol level for pickle.dumps 4 even in py3.8+?

@jakirkham
Copy link
Member Author

If we are using stdlib, yes.

In many cases though we use our own pickle.dumps, which defaults to HIGHEST_PROTOCOL.

@martindurant
Copy link
Member

OK, sorry - never mind. In that case, we might fall foul of any protocol 6 that might come along!

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might fall foul of any protocol 6 that might come along

Yeah, I wonder if we should have a compatiblity variable we use to control these protocol levels. So instead of protocol=4 we have something like protocol=distributed.compatibility._PICKLE_PROTOCOL to centralize control a bit more

@jakirkham
Copy link
Member Author

Think that is somewhat unlikely.

Pickle protocol 4 was added in Python 3.4, which came out in Mar 2014.

Pickle protocol 5 was added in Python 3.8, which came out in Oct 2019.

So it took 5 years between those protocols and involved a PEP, reaching out to the community, a backport package, NumPy integration, etc.. Think we have plenty of time to respond if we are concerned (as was the case even around pickle protocol 5). Likely we would be engaged directly since we are seen as a group that cares about performant pickle serialization.

Doubt there will be as big of a change as pickle protocol 5 (namely out-of-band communication). Though we may see coverage expanded to more builtin types.


That all being said, we do a handshake to determine what the maximum commonly supported pickle version is, which is used to set the protocol in things like pickle_dumps. Maybe there are more places we want to do this?

This was specified as Python 3.7 had native pickle protocol 4 support
with the option for pickle protocol 5 via a backport package. If there
were issues with the availability of the backport package, using the
latest pickle protocol might run into issues.

However Python 3.8+ has native pickle protocol 5 support. No backport
required. So the kinds of issues Python 3.7 had won't occur. Given this,
just use the latest pickle protocol.
@jakirkham
Copy link
Member Author

Putting this another way, this should not be an issue as long as users have the same Python version used across Workers as they will all have access to the same pickle protocols.

The only reason this was an issue before is we adopted a backport package that may or may not be installed. We are no longer in that situation and are unlikely to be again.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 19m 1s ⏱️ -4s
  3 169 tests ±0    3 084 ✔️  - 1    84 💤 +1  1 ±0 
23 448 runs  ±0  22 543 ✔️  - 3  902 💤 +1  3 +2 

For more details on these failures, see this check.

Results for commit ac44705. ± Comparison against base commit 2dab85a.

@crusaderky
Copy link
Collaborator

crusaderky commented Nov 5, 2022

Sorry for the extremely long wait. This LGTM.

That all being said, we do a handshake to determine what the maximum commonly supported pickle version is, which is used to set the protocol in things like pickle_dumps. Maybe there are more places we want to do this?

Are we even willing to support mismatched major Python versions between clients, scheduler, and workers? Everything will fall apart as soon as you hit a mismatched serializer/deserializer couple, a renamed implementation module, etc. etc.
Also, having to dynamically downgrade the pickle protocol depending on the interlocutor will cause big headaches in #5996.

@crusaderky crusaderky merged commit 821ecd8 into dask:main Nov 5, 2022
@jakirkham jakirkham deleted the use_latest_pickle branch November 6, 2022 01:48
@jakirkham
Copy link
Member Author

Thanks @crusaderky! 🙏

Sorry for the extremely long wait. This LGTM.

Not at all. Was unsure whether we were ready for this change based on the discussion so far. Happy to hear that we are.

Are we even willing to support mismatched major Python versions between clients, scheduler, and workers? Everything will fall apart as soon as you hit a mismatched serializer/deserializer couple, a renamed implementation module, etc. etc. Also, having to dynamically downgrade the pickle protocol depending on the interlocutor will cause big headaches in #5996.

Yeah think we were being more careful while still supporting Python 3.7 and optionally pickle5. Given we are passed that point (since Python 3.8 is a minimum requirement and pickle protocol 5 is builtin), agree the risk here is pretty low.

It's a good point on sendfile.

On a related point, even pickling directly to disk via protocol 5 sees some performance improvements (better memory usage) ( pandas-dev/pandas#37056 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments