Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Fix spill/restore error when using Arrow S3FS #24196

Merged
merged 6 commits into from
May 5, 2022

Conversation

frank-lsf
Copy link
Contributor

@frank-lsf frank-lsf commented Apr 26, 2022

Why are these changes needed?

This PR fixes two issues.

  1. Occasionally when restoring objects from S3 we get ValueError: Obtained data has a size of 100102955, although it is supposed to have the size of 100115855. This is usually because the offset in the object URI is wrong.

The fix is twofold: 1. manually call f.flush() because Arrow's S3FS is not guaranteed to flush the output when closing (https://github.com/apache/arrow/blob/master/python/pyarrow/io.pxi#L99), and 2. replace f.tell() with manually adding the written bytes to the offset, because f.write() is asynchronous in S3FS and as such the tell() position is not advanced at this point in the code.

  1. When restoring objects from S3, the restore worker makes a lot of small reads which are currently not buffered in ExternalStorageRayStorageImpl. This causes unnecessary extra S3 requests, costing time and money.

By opening the input stream with a buffer size set (https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L618), Arrow will buffer the reads and increase I/O efficiency.

The solution is not very pretty right now because we use a private Arrow API. But this is not avoidable until Arrow provides a seek() method for BufferedInputStream. I opened an issue here https://issues.apache.org/jira/projects/ARROW/issues/ARROW-16351.

UPDATE: Reverted (2) for now since it is not running stably.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@frank-lsf frank-lsf changed the title Buffered input for S3 for Ray Storage [Storage] Buffered input for S3 for Ray Storage Apr 26, 2022
@frank-lsf frank-lsf marked this pull request as draft April 26, 2022 00:59
@frank-lsf frank-lsf changed the title [Storage] Buffered input for S3 for Ray Storage [Storage] Fix spill/restore error when using Arrow S3FS Apr 27, 2022
@frank-lsf frank-lsf marked this pull request as ready for review April 27, 2022 05:10
@stephanie-wang stephanie-wang self-assigned this Apr 27, 2022
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! I think you might be able to test this in CI with the ray_storage_object_spilling_config fixture. Can you try that out on one of the object spilling tests? If not, it's okay for now.

@frank-lsf
Copy link
Contributor Author

@stephanie-wang I think this is already covered by existing tests? The issue occurs quite rarely even when using S3. I can add a test for ray_storage + S3 since it looks like that path is not tested yet.

@stephanie-wang
Copy link
Contributor

@stephanie-wang I think this is already covered by existing tests? The issue occurs quite rarely even when using S3. I can add a test for ray_storage + S3 since it looks like that path is not tested yet.

Hmm what do you mean that it was already covered? Do you mean we have existing CI tests that failed with the same error?

Probably we don't want to depend on S3 in CI tests, but I think I saw this error before using ray_storage and the local filesystem.

@frank-lsf
Copy link
Contributor Author

@stephanie-wang I see. I can add a test using the local filesystem then. I originally thought the tell() and flush() issues only happen with Arrow S3FS since it is asynchronous; but looks like Arrow's local filesystem might see this issue too.

@ericl
Copy link
Contributor

ericl commented Apr 28, 2022

Triggering rebuild.

@ericl
Copy link
Contributor

ericl commented Apr 30, 2022

Test_object_spilling seems to be failing on the windows build. Maybe either try to fix or disable it there?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 30, 2022
@stephanie-wang stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 4, 2022
@ericl ericl merged commit af1684a into ray-project:master May 5, 2022
@frank-lsf frank-lsf deleted the arrow-s3-fix branch May 5, 2022 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants