Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Decrease the number of GET requests GCSIO makes #28398

Closed
2 of 15 tasks
BjornPrime opened this issue Sep 11, 2023 · 5 comments
Closed
2 of 15 tasks

[Feature Request]: Decrease the number of GET requests GCSIO makes #28398

BjornPrime opened this issue Sep 11, 2023 · 5 comments

Comments

@BjornPrime
Copy link
Contributor

What would you like to happen?

The new implementation of GCSIO (see PR #28079, if not already merged), contains several uses of GET requests to GCS when it is possible that simply instantiating the objects of interest would provide better performance, since GCS has some capability of figuring things out once a potent request (be it copy, delete, etc.) is made, based on bucket and blob names.

See here for an example of how this can be done.

Using this pattern was attempted during the initial migration to the GCS client, but floundered on concerns that confusing errors may be thrown in some instances. In particular, attempting to access a non-existent object through an operation involving a instantiated version sometimes threw permissions errors instead of 404s. To resolve this issue investigate this design pattern more thoroughly and see if the errors issues are real and can't be overcome. If they can be, swap out any unnecessary GETs for instantiations to improve performance. If they can't, connect with the GCS client team to share these concerns and see if they have any further info or workarounds.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@BjornPrime
Copy link
Contributor Author

.take-issue

@lgeiger
Copy link

lgeiger commented Feb 2, 2024

@BjornPrime It would be great if the the amount of GET requests in GCSIO could be reduced.

In particular the code seems to often call client.get_bucket() or client.lookup_bucket() which will do an unnecessary GET request compared to just using client.bucket() or storage.Blob.from_string() where possible. Often this doubles the amount of requests needed to e.g. read a blob which is significantly slower.

Furthermore get_bucket and lookup_bucket require storage.buckets.get permissions. These are not part of the "Storage Object Viewer" role but only available in the "Storage Object Admin" role which also grants write permission. This means people running in environments that only allow read access to certain buckets will not be able use the new GCSIO implementation to read blobs. This currently broke one of our pipelines when trying to upgrade from 2.52 to 2.53.

As far as I can tell switching from get_bucket to bucket is also recommended in the docs, so not sure where the permission error that you mention is coming from.

Out of interest, are there any advantages of using the beams builtin GCSIO over a generic implementation like fsspec in pipeline?

@shunping
Copy link
Contributor

shunping commented Feb 3, 2024

.take-issue

@shunping
Copy link
Contributor

shunping commented Feb 3, 2024

@lgeiger, do you mind taking a look at #30205 and see if it satisfies your request?

@lgeiger
Copy link

lgeiger commented Feb 4, 2024

@lgeiger, do you mind taking a look at #30205 and see if it satisfies your request?

Looks great, thanks for the fast fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants