-
Notifications
You must be signed in to change notification settings - Fork 3k
Support wasb[s] paths in ADLSFileIO #11294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support wasb[s] paths in ADLSFileIO #11294
Conversation
228b731 to
3095ff6
Compare
| * <p>Locations follow the conventions used by Hadoop's Azure support, i.e. | ||
| * | ||
| * <pre>{@code abfs[s]://[<container>@]<storage account host>/<file path>}</pre> | ||
| * <pre>{@code abfs[s]://[<container>@]<storageAccount>.dfs.core.windows.net/<path>}</pre> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a change? or is Storage Account Host equivalent to storageAccount.dfs.core.windows.net
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think storageAccount is more commonly used than storageHost. (ref)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the ambiguity here so I changed the variable to use storageEndpoint to clarify. Previously we were using the storageAccount variable to store the entire endpoint's hostname and not the storage account's name
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSLocation.java
Outdated
Show resolved
Hide resolved
azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSLocationTest.java
Show resolved
Hide resolved
|
@ashvina Can you please review as well? |
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSLocation.java
Outdated
Show resolved
Hide resolved
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSLocation.java
Outdated
Show resolved
Hide resolved
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSLocation.java
Outdated
Show resolved
Hide resolved
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me although I have a few nits, I am not a ADLS expert though so I'm hoping we can get someone with more expertise in that system to give a confirmation that this is correct.
| * <p>Locations follow the conventions used by Hadoop's Azure support, i.e. | ||
| * | ||
| * <pre>{@code abfs[s]://[<container>@]<storage account host>/<file path>}</pre> | ||
| * <pre>{@code abfs[s]://[<container>@]<storageAccount>.dfs.core.windows.net/<path>}</pre> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think storageAccount is more commonly used than storageHost. (ref)
| // storage account name is the first part of the host | ||
| int accountSplit = uri.getHost().indexOf('.'); | ||
| String storageAccountName = uri.getHost().substring(0, accountSplit); | ||
| this.storageAccount = String.format("%s.dfs.core.windows.net", storageAccountName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For wasb, the original host URL is a blob.core.windows.net endpoint (see the sample above). Could you clarify if the change from blob to dfs is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this is required b/c it appears the blob host would be changed in the ADLS client when setting the endpoint here. I'll make an update for this b/c I think we may also want to change some of the nomenclature as well
|
@ashvina could you do another pass? |
ashvina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
I ran basic tests on my storage account (wasb) with the changes in this PR, and they all passed successfully.
| uriPath = uriPath == null ? "" : uriPath.startsWith("/") ? uriPath.substring(1) : uriPath; | ||
| this.path = uriPath.split("\\?", -1)[0].split("#", -1)[0]; | ||
| try { | ||
| URI uri = new URI(location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using java.net.URI before was intentional, in the past we have avoided using java.net.URI for parsing as it can cause some unexpected results. For example, underscores in host names...
jshell> new java.net.URI("abfs://my_endpoint/path").getHost()
$1 ==> null
There are some other quirks also IIRC.
(cc @rdblue since you've pointed this out to me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine, but if this is the case we need to add in some explicit tests for this scenario. @mrcnc would you like to add a follow-on pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we will not encounter this error b/c underscores aren't valid characters for storage accounts or container names. But maybe we should add more input validation or update the regex before attempting to use use java.net.URI? I was thinking using URI would be more idiomatic but if we have reasons to avoid using it then we can revert or refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we had caught this before now, but I think we need to revert the URI additions. There are a number of issues related to the Java implementation which is why we specifically avoid using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have some examples of where this doesn't work?
|
Hi @mrcnc, @RussellSpitzer . To confirm this PR solution to 10127: any existing (or new) iceberg tables stored in azure with That being said, does this mean iceberg api supports tables with possibly mixed schemes? (i.e. some paths in a table's metadata were written with 'wasbs' while others were with 'abfss')? |
This reverts commit 11a8a78.
The path for new files in existing tables is determined by the table's base location or LocationProvider so tables that have paths using the |
…" (apache#11344) This reverts commit 11a8a78.
Resolves #10127 by delegating wasb[s] paths to the ADLSFileIO via ResolvingFileIO.
Additionally, I refactored the parsing logic to use java.net.URI because I believe it's clearer. This also led me to change the invalid URI exception to
IllegalArgumentExceptioninstead of aValidationExceptionb/c this seems more appropriate.