Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet/async: Default to suffix requests on supporting readers/object stores #6157

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

H-Plus-Time
Copy link
Contributor

Which issue does this PR close?

Closes #5979 .

Rationale for this change

Described in #5979, but more or less the point is to make parquet::async::AsyncFileReader agnostic toward the size of the file (when suffix requests are supported, notably excluding Azure blob storage).

What changes are included in this PR?

  • get_bytes range parameter type change -> GetRange
  • MetadataLoader::load file_size parameter removed (file_size aware load stubbed out as load_absolute, pending a better name)
  • fetch_parquet_metadata file_size parameter shifted to an Option
  • ParquetObjectReader ObjectMeta swapped for object_store::Path (likely ObjectMeta will be re-added as an optional property provided via the builder interface)

Are there any user-facing changes?

More or less all of the above.

…requests, swap get_bytes to take a GetRange (equivalent to object-store's)
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 30, 2024
@H-Plus-Time H-Plus-Time changed the base branch from master to 53.0.0-dev July 30, 2024 09:40
@H-Plus-Time H-Plus-Time changed the base branch from 53.0.0-dev to master July 30, 2024 09:41
@H-Plus-Time
Copy link
Contributor Author

H-Plus-Time commented Jul 30, 2024

Aspects that I suspect are just bad:

  • GetRange - not exactly a fitting name 🤷
  • get_byte_ranges - no longer Vec wrt get_bytes. Intended to avoid dealing with throwing out GetRange::Suffix before passing into ObjectStore::get_ranges / reimplementing that.

Aspects that need consensus:

  • which of the two options (suffix, or non-suffix) gets the load method name (i.e. which is the default), and what the non-default method name should be.
  • How explicit should switching on (or if it's the default, off) suffix requests be?

Copy link
Contributor

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this @H-Plus-Time!

}
};
// TODO: figure out a better alternative for Offset ranges
let mut buffer = Vec::with_capacity(to_read.unwrap_or(1_024usize));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut buffer = Vec::with_capacity(to_read.unwrap_or(1_024usize));
let mut buffer = Vec::with_capacity(to_read.unwrap_or(1024));

The usize type here can probably be inferred?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it can, oops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's worth checking - is there ever a situation where it's sensible/vaguely useful for a parquet reader to do an unbounded read (the GetRange::Offset variant)?

I would think you'd always have a known upper bound, since knowing where to start a read (e.g. read all row groups starting from row group N) implies you have the FileMetadata already.

@tustvold tustvold added the api-change Changes to the arrow API label Jul 31, 2024
@H-Plus-Time H-Plus-Time force-pushed the parquet-metadata-suffix-requests branch 2 times, most recently from 9e573ab to 059b4e4 Compare August 1, 2024 10:46
@H-Plus-Time H-Plus-Time force-pushed the parquet-metadata-suffix-requests branch from 059b4e4 to e9c2800 Compare August 1, 2024 10:54
@kylebarron
Copy link
Contributor

@H-Plus-Time do you need any help with this? Or is this at a state where it needs API decisions from maintainers?

@H-Plus-Time
Copy link
Contributor Author

@H-Plus-Time do you need any help with this? Or is this at a state where it needs API decisions from maintainers?

Yep, the main one I suppose is whether Offset ranges are ever used in parquet readers - if they're not, there's a few variants and optionals we can throw out.

Other than that, naming and the get_bytes vs get_byte_ranges deviations.

@kylebarron
Copy link
Contributor

Maybe it's worth taking this out of draft mode then?

@H-Plus-Time H-Plus-Time marked this pull request as ready for review August 21, 2024 03:58
@@ -52,7 +51,44 @@ impl<F: MetadataFetch> MetadataLoader<F> {
/// Create a new [`MetadataLoader`] by reading the footer information
///
/// See [`fetch_parquet_metadata`] for the meaning of the individual parameters
pub async fn load(mut fetch: F, file_size: usize, prefetch: Option<usize>) -> Result<Self> {
pub async fn load(mut fetch: F, prefetch: Option<usize>) -> Result<Self> {
let suffix = fetch.fetch(GetRange::Suffix(prefetch.unwrap_or(8))).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks a lot for this effort. However, using suffix fetch by default may break users on storage services that don't support this, like azblob. Have you considered not changing load, but adding a new API called load_without_size, load_with_suffix_fetch, or something similar?

Copy link
Contributor

@kylebarron kylebarron Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described here, I believe the default implementation for azure is to make two requests: one for the the length and another for the suffix data. But then this is much more performant on all other platforms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described here, I believe the default implementation for azure is to make two requests: one for the the length and another for the suffix data. But then this is much more performant on all other platforms

Hi, I disagree with this because, in most cases, we already have the file size from ListObjects or other metadata services. So, this change doesn't perform better on other platforms and has a negative effect on azblob.

Would you reconsider the choice by adding a new function instead of changing the existing one? This would also allow us to include it in the next minor version, avoiding a breaking change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference myself whether the default load uses a suffix request or not. As @H-Plus-Time noted above, we're looking for consensus on this.

Aspects that need consensus:

  • which of the two options (suffix, or non-suffix) gets the load method name (i.e. which is the default), and what the non-default method name should be.

Separately,

in most cases, we already have the file size from ListObjects or other metadata services

this seems to depend heavily on your use case. In my case I rarely have this information already.

include it in the next minor version, avoiding a breaking change

This is moot anyways, because the next release is breaking, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moot anyways, because the next release is breaking, right?

Thanks for pointing out this, let's ignore this part.

@tustvold
Copy link
Contributor

tustvold commented Aug 24, 2024

I'm pretty apprehensive about changing the default behaviour to use suffix requests given there are stores that are known to not support them.

I wonder what you think of instead performing the initial suffix read of the prefetch size manually, and then constructing an AsyncFileReader with these bytes cached? This initial fetch request would tell you the total size of the object and so no changes would be necessary to traits, allowing us to keep the current simple interface and avoid potential compatibility footguns, but also allow you to avoid making additional HEAD requests for your use case?

@H-Plus-Time
Copy link
Contributor Author

H-Plus-Time commented Aug 26, 2024

I'm pretty apprehensive about changing the default behaviour to use suffix requests given there are stores that are known to not support them.

I wonder what you think of instead performing the initial suffix read of the prefetch size manually, and then constructing an AsyncFileReader with these bytes cached? This initial fetch request would tell you the total size of the object and so no changes would be necessary to traits, allowing us to keep the current simple interface and avoid potential compatibility footguns, but also allow you to avoid making additional HEAD requests for your use case?

I agree with defaulting to non-suffix requests (consensus seems to be reached on that).

The manual approach isn't an option in my main use-case (browser side fetch as the underlying IO method) - usually resources are cross-origin, and Content-Range is not CORS-safelisted (the target origin must include it in Access-Control-Expose-Headers), so you can't determine the object's size from a ranged GET request alone without considerable intervention.

Shifting the suffix request path to a secondary load_without_size, and probably re-emphasising the file_size parameter (that is, make it an explicit choice on the users' part to not give the reader that information), I think that would be the least disruptive change.

Perhaps the following would be sufficient:

/// store.rs
impl ParquetObjectReader {
  pub fn new(store: Arc<dyn ObjectStore>, meta: ObjectMeta) -> Self
  pub fn new_without_size(store: Arc<dyn ObjectStore>, location: Path) -> Self
}

/// metadata.rs

impl<F: MetadataFetch> MetadataLoader<F> {
  pub async fn load(mut fetch: F, file_size: usize, prefetch: Option<usize>) -> Result<Self>
  pub async fn load_without_size(mut fetch: F, prefetch: Option<usize>) -> Result<Self>
}

// fetch_parquet_metadata elided for brevity - file_size: usize -> Option<usize>

With the above, the remaining footguns (calling fetch_parquet_metadata with file_size=None, calling load_without_size, or supplying a suffix range directly to get_bytes) would require quite a lot of intent.

@tustvold
Copy link
Contributor

Can you not add the header to https://docs.aws.amazon.com/AmazonS3/latest/userguide/ManageCorsUsing.html#cors-expose-headers. Given you have to configure CORS for anything to work, it seems requiring an additional header isn't that burdensome?

@H-Plus-Time
Copy link
Contributor Author

These are typically origins not controlled by the requester, configured with boilerplate CORS - unlikely to ever change.

@tustvold
Copy link
Contributor

So long as we don't break existing workloads, I don't see an issue. However, I will observe that the code commonality between a suffix and range based MetadataLoader will likely be relatively small... Rather than adding code to support what is a fairly niche use-case, it might be worth this living in a repository dedicated to WASM support? Given the first-party S3 implementation doesn't support WASM32, let alone CORS-filtered responses that are missing expected headers, I suspect you must be rolling your own anyway?

@kylebarron
Copy link
Contributor

kylebarron commented Aug 26, 2024

I don't quite understand why non-WASM APIs wouldn't want first-class suffix-based loading support as well? I'm building APIs for both WASM/JS and Python, and in Python it seems a suffix-based loading approach would always be preferable when the user passes in a string URL or object store path (whenever the API doesn't already have the content length).

  • For all object stores that support suffix requests, we save one request by getting the content length and metadata length at the same time.
  • In the Azure case that doesn't support suffix requests, the ObjectStore fallback does a head then range, which is no worse than now whenever the ObjectMeta is not already known.

@tustvold
Copy link
Contributor

tustvold commented Aug 26, 2024

I don't quite understand why non-WASM APIs wouldn't want first-class suffix-based loading support as well

This is already supported as I articulated above - #6157 (comment), WASM is a particular unique case where CORS enforcement strips the Content-Range header from the response

@H-Plus-Time
Copy link
Contributor Author

Downstream consumers in Rust or Rust+PyO3 using the cache-populating approach you (@tustvold) suggested would still need to roll their own cache, right?

As for the wasm32 usecase - yes, the object_store implementation(/s) are 3rd party, but not the parquet machinery. The volume of code required to do what this PR proposes externally is significant - a full clone of ParquetObjectReader, more or less full reimplementation of the MetadataLoader (due to a combination of remainder being (correctly) private, and the bumpalloc-like nature of wasm's linear memory).

@tustvold
Copy link
Contributor

tustvold commented Aug 27, 2024

Right, I'll try to find some time to work out a better way to proceed here over the next few days. It seems unfortunate to add suffix fetches to AsynFileReader when they're only applicable to metadata fetching, which is already a separate method

Edit: see also #6002 for some other discussion on parquet metadata

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read Parquet metadata via suffix requests
5 participants