Skip to content

Conversation

@mrcnc
Copy link
Contributor

@mrcnc mrcnc commented Oct 25, 2024

After reviewing the concerns raised in #11344 about using java.net.URI for parsing in ADLSLocation, I contrived an example of a location that does not parse correctly. It also fails in the current implementation, so this PR adds a test and fix for the parsing code. Additionally it removes test cases that are invalid, since they don't test valid ABFS syntax

Motivation

The main reason to avoid using java.net.URI is that it parses according to RFC 2396 but object storage providers do not strictly follow this specification. Specifically, in standard URI syntax, the question mark ? separates the path component from the query component. However, Azure Blob Storage allows question marks in blob/file names, making these names incompatible with the RFC 2396 URI specification.

Another important point is that Azure Storage APIs are accessed via HTTP APIs, so the abfs and wasb location syntax serve as identifiers to blobs accessed through HTTP URLs. This is the motivation behind removing the tests that included query and fragment components, since they would only be used in the HTTP URLs and not in the ABFS URI-like syntax.

@github-actions github-actions bot added the AZURE label Oct 25, 2024
@mrcnc mrcnc marked this pull request as ready for review October 25, 2024 14:46
@RussellSpitzer
Copy link
Member

LGTM. @danielcweeks This adds in that test I was looking for where URI would fail, although looks like we have a bug in the current implementation anyway.

@danielcweeks
Copy link
Contributor

Thanks @mrcnc , though overall it's really unfortunate that we have notably different behavior between S3 and ADLS in the URI handling. S3 allows for query params (though they're not considered part of the key) them while ADLS appears to have a non-standard handling.

The one think I'm not clear about is the linked documentation doesn't actually go into what the valid path characters are. Is that documented somewhere that we can reference? It would be great to include that in the javadoc for future reference.

Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me avoiding using URI/split there.

@bryanck
Copy link
Contributor

bryanck commented Oct 28, 2024

I originally stripped off the query params to be on the safe side and to be consistent w/ S3FileIO, but given query params aren't specified in the URI format this change looks OK to me, +1 to Dan's comment about adding a Javadoc w/ the URI spec reference.

@mrcnc mrcnc force-pushed the fix-adls-location branch from dab1b25 to b39772b Compare October 29, 2024 01:29
@RussellSpitzer
Copy link
Member

@danielcweeks Could you check to see if the Javadoc change is what you were looking for?

@danielcweeks danielcweeks merged commit 9be7f00 into apache:main Nov 5, 2024
bryanck pushed a commit that referenced this pull request Nov 19, 2024
* Azure: Fix ADLSLocation file parsing

* Azure: Remove invalid test cases from ADLSLocationTest

* Update Javadocs with reference to ADLS URI
@bryanck bryanck added this to the Iceberg 1.7.1 milestone Nov 19, 2024
@bryanck bryanck mentioned this pull request Nov 19, 2024
bryanck pushed a commit that referenced this pull request Nov 20, 2024
* Azure: Fix ADLSLocation file parsing

* Azure: Remove invalid test cases from ADLSLocationTest

* Update Javadocs with reference to ADLS URI
bryanck pushed a commit that referenced this pull request Nov 20, 2024
* Azure: Fix ADLSLocation file parsing

* Azure: Remove invalid test cases from ADLSLocationTest

* Update Javadocs with reference to ADLS URI
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
* Azure: Fix ADLSLocation file parsing

* Azure: Remove invalid test cases from ADLSLocationTest

* Update Javadocs with reference to ADLS URI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants