Skip to content

Conversation

@RussellSpitzer
Copy link
Member

There is a worry that certain Azure file location will not be parsed correctly by the Java URI methods so we are reverting this PR pending more test cases. WASB Support can occur with a different implementation while we wait

Reverts #11294

@danielcweeks
Copy link
Contributor

I'm not actually opposed to the WASB path support, just concerned about the introduction of the URI class for parsing locations. Is it possible to just revert to the old parsing or is it too closely tied to the URI handling?

@RussellSpitzer
Copy link
Member Author

@mrcnc Can chime in on that, but I think that's fine. Can we have some examples though to guard against these changes in the future? Strings that won't parse correctly?

@mrcnc
Copy link
Contributor

mrcnc commented Oct 17, 2024

I'll follow up with another PR that doesn't use java.net.URI for parsing. I'm happy to have more 👀 on this

@danielcweeks
Copy link
Contributor

danielcweeks commented Oct 17, 2024

Can we have some examples though to guard against these changes in the future? Strings that won't parse correctly?

There are a lot of subtle issues like hashCode and equality being inconsistent for escaped characters, handling of casing in escaped characters, it can represent things that are technically not URIs and then you get inconsistent behaviors for the raw values. One of the main ones is hostname handling, which is a problem for GCS (not sure if Azure is affected as well) because many systems allow _ in the hostname (bucket/container) but then it cannot be parsed accurately (host is included as part of path).

I don't have a comprehensive list of the issues, but have run into them enough be vary wary of relying on the URI implementation, which is why we specifically avoid its usage as it leads to unsafe/incompatible edge cases. You might look at the Trino's TestAzureLocations as they have a similar approach to handling URIs. I know their S3 tests have one or two examples that don't parse with Java's URI class.

@nastra nastra merged commit fd06438 into main Oct 18, 2024
@nastra nastra deleted the revert-11294-support-wasb-for-adlsfileio branch October 18, 2024 09:31
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants