Skip to content

Conversation

@mrcnc
Copy link
Contributor

@mrcnc mrcnc commented Oct 10, 2024

Resolves #10127 by delegating wasb[s] paths to the ADLSFileIO via ResolvingFileIO.

Additionally, I refactored the parsing logic to use java.net.URI because I believe it's clearer. This also led me to change the invalid URI exception to IllegalArgumentException instead of a ValidationException b/c this seems more appropriate.

@mrcnc mrcnc marked this pull request as draft October 10, 2024 12:47
@mrcnc mrcnc force-pushed the support-wasb-for-adlsfileio branch from 228b731 to 3095ff6 Compare October 10, 2024 12:53
@mrcnc mrcnc marked this pull request as ready for review October 10, 2024 14:25
* <p>Locations follow the conventions used by Hadoop's Azure support, i.e.
*
* <pre>{@code abfs[s]://[<container>@]<storage account host>/<file path>}</pre>
* <pre>{@code abfs[s]://[<container>@]<storageAccount>.dfs.core.windows.net/<path>}</pre>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a change? or is Storage Account Host equivalent to storageAccount.dfs.core.windows.net

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think storageAccount is more commonly used than storageHost. (ref)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the ambiguity here so I changed the variable to use storageEndpoint to clarify. Previously we were using the storageAccount variable to store the entire endpoint's hostname and not the storage account's name

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Oct 10, 2024

@ashvina Can you please review as well?

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me although I have a few nits, I am not a ADLS expert though so I'm hoping we can get someone with more expertise in that system to give a confirmation that this is correct.

* <p>Locations follow the conventions used by Hadoop's Azure support, i.e.
*
* <pre>{@code abfs[s]://[<container>@]<storage account host>/<file path>}</pre>
* <pre>{@code abfs[s]://[<container>@]<storageAccount>.dfs.core.windows.net/<path>}</pre>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think storageAccount is more commonly used than storageHost. (ref)

// storage account name is the first part of the host
int accountSplit = uri.getHost().indexOf('.');
String storageAccountName = uri.getHost().substring(0, accountSplit);
this.storageAccount = String.format("%s.dfs.core.windows.net", storageAccountName);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For wasb, the original host URL is a blob.core.windows.net endpoint (see the sample above). Could you clarify if the change from blob to dfs is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is required b/c it appears the blob host would be changed in the ADLS client when setting the endpoint here. I'll make an update for this b/c I think we may also want to change some of the nomenclature as well

@nastra nastra requested a review from bryanck October 11, 2024 10:26
@RussellSpitzer
Copy link
Member

@ashvina could you do another pass?

Copy link

@ashvina ashvina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
I ran basic tests on my storage account (wasb) with the changes in this PR, and they all passed successfully.

@RussellSpitzer RussellSpitzer merged commit 11a8a78 into apache:main Oct 16, 2024
@RussellSpitzer
Copy link
Member

Thanks @mrcnc for the PR and @ashvina for review!

uriPath = uriPath == null ? "" : uriPath.startsWith("/") ? uriPath.substring(1) : uriPath;
this.path = uriPath.split("\\?", -1)[0].split("#", -1)[0];
try {
URI uri = new URI(location);
Copy link
Contributor

@bryanck bryanck Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using java.net.URI before was intentional, in the past we have avoided using java.net.URI for parsing as it can cause some unexpected results. For example, underscores in host names...

jshell> new java.net.URI("abfs://my_endpoint/path").getHost()
$1 ==> null

There are some other quirks also IIRC.

(cc @rdblue since you've pointed this out to me)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, but if this is the case we need to add in some explicit tests for this scenario. @mrcnc would you like to add a follow-on pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we will not encounter this error b/c underscores aren't valid characters for storage accounts or container names. But maybe we should add more input validation or update the regex before attempting to use use java.net.URI? I was thinking using URI would be more idiomatic but if we have reasons to avoid using it then we can revert or refactor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had caught this before now, but I think we need to revert the URI additions. There are a number of issues related to the Java implementation which is why we specifically avoid using it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have some examples of where this doesn't work?

@Jordano-Dremio
Copy link
Contributor

Jordano-Dremio commented Oct 16, 2024

Hi @mrcnc, @RussellSpitzer . To confirm this PR solution to 10127: any existing (or new) iceberg tables stored in azure with wasbs + .blob. will be interpreted interchangeably with abfss prefix + .dfs. endpoint, regardless of whether the table's existing metadata displays wasbs + .blob. vs the new abfss + dfs?

That being said, does this mean iceberg api supports tables with possibly mixed schemes? (i.e. some paths in a table's metadata were written with 'wasbs' while others were with 'abfss')?

@mrcnc
Copy link
Contributor Author

mrcnc commented Oct 17, 2024

Hi @mrcnc, @RussellSpitzer . To confirm this PR solution to 10127: any existing (or new) iceberg tables stored in azure with wasbs + .blob. will be interpreted interchangeably with abfss prefix + .dfs. endpoint, regardless of whether the table's existing metadata displays wasbs + .blob. vs the new abfss + dfs?

That being said, does this mean iceberg api supports tables with possibly mixed schemes? (i.e. some paths in a table's metadata were written with 'wasbs' while others were with 'abfss')?

The path for new files in existing tables is determined by the table's base location or LocationProvider so tables that have paths using the wasbs scheme would still use the same scheme for new files. For new tables, the location will default to the default location for the warehouse or namespace but each table can explicitly provide a location when it is created. It seems possible that there could be mixed schemes (for example you changed the implementation of a custom LocationProvider) so I think it's worthwhile to understand if mixed paths are supported but I'm not sure currently. The intent of this PR was to use the ADLSFileIO for wasbs paths by default, instead of delegating these to HadoopFileIO as a fallback

@mrcnc mrcnc deleted the support-wasb-for-adlsfileio branch October 17, 2024 21:59
nastra pushed a commit that referenced this pull request Oct 18, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable reading WASB and WASBS file paths with ABFS and ABFSS

6 participants