[Storage] Fix spill/restore error when using Arrow S3FS #24196

frank-lsf · 2022-04-26T00:42:30Z

Why are these changes needed?

This PR fixes two issues.

Occasionally when restoring objects from S3 we get ValueError: Obtained data has a size of 100102955, although it is supposed to have the size of 100115855. This is usually because the offset in the object URI is wrong.

The fix is twofold: 1. manually call f.flush() because Arrow's S3FS is not guaranteed to flush the output when closing (https://github.com/apache/arrow/blob/master/python/pyarrow/io.pxi#L99), and 2. replace f.tell() with manually adding the written bytes to the offset, because f.write() is asynchronous in S3FS and as such the tell() position is not advanced at this point in the code.

When restoring objects from S3, the restore worker makes a lot of small reads which are currently not buffered in ExternalStorageRayStorageImpl. This causes unnecessary extra S3 requests, costing time and money.

By opening the input stream with a buffer size set (https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L618), Arrow will buffer the reads and increase I/O efficiency.

The solution is not very pretty right now because we use a private Arrow API. But this is not avoidable until Arrow provides a seek() method for BufferedInputStream. I opened an issue here https://issues.apache.org/jira/projects/ARROW/issues/ARROW-16351.

UPDATE: Reverted (2) for now since it is not running stably.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

stephanie-wang

Nice find! I think you might be able to test this in CI with the ray_storage_object_spilling_config fixture. Can you try that out on one of the object spilling tests? If not, it's okay for now.

frank-lsf · 2022-04-27T19:02:44Z

@stephanie-wang I think this is already covered by existing tests? The issue occurs quite rarely even when using S3. I can add a test for ray_storage + S3 since it looks like that path is not tested yet.

stephanie-wang · 2022-04-27T19:04:08Z

@stephanie-wang I think this is already covered by existing tests? The issue occurs quite rarely even when using S3. I can add a test for ray_storage + S3 since it looks like that path is not tested yet.

Hmm what do you mean that it was already covered? Do you mean we have existing CI tests that failed with the same error?

Probably we don't want to depend on S3 in CI tests, but I think I saw this error before using ray_storage and the local filesystem.

frank-lsf · 2022-04-27T19:07:22Z

@stephanie-wang I see. I can add a test using the local filesystem then. I originally thought the tell() and flush() issues only happen with Arrow S3FS since it is asynchronous; but looks like Arrow's local filesystem might see this issue too.

ericl · 2022-04-28T22:14:31Z

Triggering rebuild.

ericl · 2022-04-30T19:36:16Z

Test_object_spilling seems to be failing on the windows build. Maybe either try to fix or disable it there?

buffered input for S3

9736099

frank-lsf changed the title ~~Buffered input for S3 for Ray Storage~~ [Storage] Buffered input for S3 for Ray Storage Apr 26, 2022

frank-lsf marked this pull request as draft April 26, 2022 00:59

fix

9ad2d98

frank-lsf changed the title ~~[Storage] Buffered input for S3 for Ray Storage~~ [Storage] Fix spill/restore error when using Arrow S3FS Apr 27, 2022

frank-lsf marked this pull request as ready for review April 27, 2022 05:10

frank-lsf mentioned this pull request Apr 27, 2022

[Core] Restore worker silently fails and the program is stuck #24248

Open

scv119 assigned rkooo567 Apr 27, 2022

stephanie-wang self-assigned this Apr 27, 2022

stephanie-wang approved these changes Apr 27, 2022

View reviewed changes

stephanie-wang assigned ericl Apr 27, 2022

rkooo567 approved these changes Apr 28, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 30, 2022

frank-lsf added 4 commits May 3, 2022 21:25

Merge branch 'master' of github.com:ray-project/ray into arrow-s3-fix

621b722

Undo buffering fix since it is not stable

d456b0e

more undo

9086dd6

Merge branch 'master' of github.com:ray-project/ray into arrow-s3-fix

2d794c8

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 4, 2022

ericl merged commit af1684a into ray-project:master May 5, 2022

frank-lsf deleted the arrow-s3-fix branch May 5, 2022 02:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Storage] Fix spill/restore error when using Arrow S3FS #24196

[Storage] Fix spill/restore error when using Arrow S3FS #24196

frank-lsf commented Apr 26, 2022 •

edited

Loading

stephanie-wang left a comment

frank-lsf commented Apr 27, 2022

stephanie-wang commented Apr 27, 2022

frank-lsf commented Apr 27, 2022

ericl commented Apr 28, 2022

ericl commented Apr 30, 2022 •

edited

Loading

[Storage] Fix spill/restore error when using Arrow S3FS #24196

[Storage] Fix spill/restore error when using Arrow S3FS #24196

Conversation

frank-lsf commented Apr 26, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang left a comment

Choose a reason for hiding this comment

frank-lsf commented Apr 27, 2022

stephanie-wang commented Apr 27, 2022

frank-lsf commented Apr 27, 2022

ericl commented Apr 28, 2022

ericl commented Apr 30, 2022 • edited Loading

frank-lsf commented Apr 26, 2022 •

edited

Loading

ericl commented Apr 30, 2022 •

edited

Loading