Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 21 additions & 22 deletions hadoop-tools/hadoop-azure/src/site/markdown/abfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ with Hierarchical Namespaces.
## <a name="namespaces"></a> Hierarchical Namespaces (and WASB Compatibility)

A key aspect of ADLS Gen 2 is its support for
[hierachical namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace)
[hierarchical namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace)
These are effectively directories and offer high performance rename and delete operations
—something which makes a significant improvement in performance in query engines
writing data to, including MapReduce, Spark, Hive, as well as DistCp.
Expand Down Expand Up @@ -297,7 +297,7 @@ This is shown in the Authentication section.

## <a name="authentication"></a> Authentication

Authentication for ABFS is ultimately granted by [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-scenarios).
Authentication for ABFS is ultimately granted by [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-scenarios) (now Microsoft Entra ID).

The concepts covered there are beyond the scope of this document to cover;
developers are expected to have read and understood the concepts therein
Expand Down Expand Up @@ -332,7 +332,7 @@ possible

### <a name="aad-token-fetch-retry-logic"></a> AAD Token fetch retries

The exponential retry policy used for the AAD token fetch retries can be tuned
The exponential retry policy used for the AAD (now Entra ID) token fetch retries can be tuned
with the following configurations.
* `fs.azure.oauth.token.fetch.retry.max.retries`: Sets the maximum number of
retries. Default value is 5.
Expand Down Expand Up @@ -652,8 +652,7 @@ CustomDelegationTokenManager interface.
<value>{fully-qualified-class-name-for-implementation-of-CustomDelegationTokenManager-interface}</value>
</property>
```
In case delegation token is enabled, and the config `fs.azure.delegation.token
.provider.type` is not provided then an IlleagalArgumentException is thrown.
In case delegation token is enabled, and the config `fs.azure.delegation.token.provider.type` is not provided then an IllegalArgumentException is thrown.

### Shared Access Signature (SAS) Token Provider

Expand All @@ -663,7 +662,7 @@ To know more about how SAS Authentication works refer to
[Grant limited access to Azure Storage resources using shared access signatures (SAS)](https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview)

There are three types of SAS supported by Azure Storage:
- [User Delegation SAS](https://learn.microsoft.com/en-us/rest/api/storageservices/create-user-delegation-sas): Recommended for use with ABFS Driver with HNS Enabled ADLS Gen2 accounts. It is Identity based SAS that works at blob/directory level)
- [User Delegation SAS](https://learn.microsoft.com/en-us/rest/api/storageservices/create-user-delegation-sas): Recommended for use with ABFS Driver with HNS Enabled ADLS Gen2 accounts. It is an identity-based SAS that works at blob/directory level)
- [Service SAS](https://learn.microsoft.com/en-us/rest/api/storageservices/create-service-sas): Global and works at container level.
- [Account SAS](https://learn.microsoft.com/en-us/rest/api/storageservices/create-account-sas): Global and works at account level.

Expand Down Expand Up @@ -754,16 +753,16 @@ requests. User can specify them as fixed SAS Token to be used across all the req
</property>
```

1. Fixed SAS Token:
```xml
<property>
<name>fs.azure.sas.fixed.token</name>
<value>FIXED_SAS_TOKEN</value>
</property>
```
2. Account SAS (Fixed SAS Token at Account Level):
```xml
<property>
<name>fs.azure.sas.fixed.token</name>
<value>FIXED_SAS_TOKEN</value>
</property>
```

Replace `FIXED_SAS_TOKEN` with fixed Account/Service SAS. You can also
generate SAS from Azure portal. Account -> Security + Networking -> Shared Access Signature
- Replace `FIXED_SAS_TOKEN` with fixed Account/Service SAS. You can also
generate SAS from Azure portal. Account -> Security + Networking -> Shared Access Signature

- **Security**: Account/Service SAS requires account keys to be used which makes
them less secure. There is no scope of having delegated access to different users.
Expand Down Expand Up @@ -864,16 +863,16 @@ Azure OAuth tokens.
Consult the source in `org.apache.hadoop.fs.azurebfs.extensions`
and all associated tests to see how to make use of these extension points.

_Warning_ These extension points are unstable.
_Warning_ : These extension points are unstable.

### <a href="networking"></a>Networking Layer:

ABFS Driver can use the following networking libraries:
- ApacheHttpClient:
- <a href = "https://hc.apache.org/httpcomponents-client-4.5.x/index.html">Library Documentation</a>.
- Default networking library.
- JDK networking library:
- <a href="https://docs.oracle.com/javase/8/docs/api/java/net/HttpURLConnection.html">Library documentation</a>.
- Default networking library.

The networking library can be configured using the configuration `fs.azure.networking.library`
while initializing the filesystem.
Expand Down Expand Up @@ -1007,13 +1006,13 @@ greater than or equal to 0.
retries of IO operations. Currently this is used only for the server call retry
logic. Used within `AbfsClient` class as part of the ExponentialRetryPolicy. This
value indicates the smallest interval (in milliseconds) to wait before retrying
an IO operation. The default value is 3000 (3 seconds).
an IO operation. The default value is 500 milliseconds.

`fs.azure.io.retry.max.backoff.interval`: Sets the maximum backoff interval for
retries of IO operations. Currently this is used only for the server call retry
logic. Used within `AbfsClient` class as part of the ExponentialRetryPolicy. This
value indicates the largest interval (in milliseconds) to wait before retrying
an IO operation. The default value is 30000 (30 seconds).
an IO operation. The default value is 25000 (25 seconds).

`fs.azure.io.retry.backoff.interval`: Sets the default backoff interval for
retries of IO operations. Currently this is used only for the server call retry
Expand All @@ -1023,7 +1022,7 @@ value. This random delta is then multiplied by an exponent of the current IO
retry number (i.e., the default is multiplied by `2^(retryNum - 1)`) and then
contstrained within the range of [`fs.azure.io.retry.min.backoff.interval`,
`fs.azure.io.retry.max.backoff.interval`] to determine the amount of time to
wait before the next IO retry attempt. The default value is 3000 (3 seconds).
wait before the next IO retry attempt. The default value is 500 milliseconds.

`fs.azure.write.request.size`: To set the write buffer size. Specify the value
in bytes. The value should be between 16384 to 104857600 both inclusive (16 KB
Expand Down Expand Up @@ -1361,9 +1360,9 @@ Operation failed: "Server failed to authenticate the request.
Causes include:

* Your credentials are incorrect.
* Your shared secret has expired. in Azure, this happens automatically
* Your shared secret has expired. In Azure, this happens automatically.
* Your shared secret has been revoked.
* host/VM clock drift means that your client's clock is out of sync with the
* Host/VM clock drift means that your client's clock is out of sync with the
Azure servers —the call is being rejected as it is either out of date (considered a replay)
or from the future. Fix: Check your clocks, etc.

Expand Down
12 changes: 6 additions & 6 deletions hadoop-tools/hadoop-azure/src/site/markdown/blobEndpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# Azure Blob Storage REST API (Blob Endpoint)

## Introduction
The REST API for Blob Storage defines HTTP operations against the storage account, containers(filesystems), and blobs.(files)
The REST API for Blob Storage defines HTTP operations against the storage account, containers(filesystems), and blobs(files).
The API includes the operations listed in the following table.

| Operation | Resource Type | Description |
Expand All @@ -27,8 +27,8 @@ The API includes the operations listed in the following table.
| [List Blobs](#list-blobs) | Filesystem | Lists the paths under the specified directory inside container acting as hadoop filesystem. |
| [Put Blob](#put-blob) | Path | Creates a new path or updates an existing path under the specified filesystem (container). |
| [Lease Blob](#lease-blob) | Path | Establishes and manages a lease on the specified path. |
| [Put Block](#put-block) | Path | Appends Data to an already created blob at specified path. |
| [Put Block List](#put-block-list) | Path | Flushes The Appended Data to the blob at specified path. |
| [Put Block](#put-block) | Path | Appends data to an already created blob at specified path. |
| [Put Block List](#put-block-list) | Path | Flushes the appended data to the blob at specified path. |
| [Set Blob Metadata](#set-blob-metadata) | Path | Sets the user-defined attributes of the blob at specified path. |
| [Get Blob Properties](#get-blob-properties) | Path | Gets the user-defined attributes of the blob at specified path. |
| [Get Blob](#get-blob) | Path | Reads data from the blob at specified path. |
Expand All @@ -43,7 +43,7 @@ already exists, the operation fails.
Rest API Documentation: [Create Container](https://docs.microsoft.com/en-us/rest/api/storageservices/create-container)

## Delete Container
The Delete Container operation marks the specified container for deletion. The container and any blobs contained within it.
The Delete Container operation marks the specified container and any blobs contained within it for deletion.
Rest API Documentation: [Delete Container](https://docs.microsoft.com/en-us/rest/api/storageservices/delete-container)

## Set Container Metadata
Expand All @@ -67,7 +67,7 @@ Partial updates are not supported with Put Blob
Rest API Documentation: [Put Blob](https://docs.microsoft.com/en-us/rest/api/storageservices/put-blob)

## Lease Blob
The Lease Blob operation creates and manages a lock on a blob for write and delete operations. The lock duration can be 15 to 60 seconds, or can be infinite.
The Lease Blob operation creates and manages a lock on a blob for creating file, opening file for write and rename operations. The lock duration can be 15 to 60 seconds, or can be infinite.
Rest API Documentation: [Lease Blob](https://docs.microsoft.com/en-us/rest/api/storageservices/lease-blob)

## Put Block
Expand Down Expand Up @@ -104,4 +104,4 @@ Rest API Documentation: [Copy Blob](https://docs.microsoft.com/en-us/rest/api/st

## Append Block
The Append Block operation commits a new block of data to the end of an existing append blob.
Rest API Documentaion: [Append Block](https://learn.microsoft.com/en-us/rest/api/storageservices/append-block)
Rest API Documentation: [Append Block](https://learn.microsoft.com/en-us/rest/api/storageservices/append-block)
59 changes: 26 additions & 33 deletions hadoop-tools/hadoop-azure/src/site/markdown/fns_blob.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
The ABFS driver is recommended to be used only with HNS Enabled ADLS Gen-2 accounts
for big data analytics because of being more performant and scalable.

However, to enable users of legacy WASB Driver to migrate to ABFS driver without
needing them to upgrade their general purpose V2 accounts (HNS-Disabled), Support
However, to allow users of legacy WASB Driver to migrate to ABFS driver without
requiring them to upgrade their general purpose V2 accounts (HNS-Disabled), support
for FNS accounts is being added to ABFS driver.
Refer to [WASB Deprication](./wasb.html) for more details.
Refer to [WASB Deprecation](./wasb.html) documentation for more details.

## Azure Service Endpoints Used by ABFS Driver
Azure Services offers two set of endpoints for interacting with storage accounts:
Expand All @@ -38,7 +38,7 @@ HNS Enabled accounts will still use DFS Endpoint which continues to be the
recommended stack based on performance and feature capabilities.

## Configuring ABFS Driver for FNS Accounts
Following configurations will be introduced to configure ABFS Driver for FNS Accounts:
Following configurations have been introduced to configure ABFS Driver for FNS Accounts:
1. Account Type: Must be set to `false` to indicate FNS Account
```xml
<property>
Expand All @@ -47,31 +47,31 @@ Following configurations will be introduced to configure ABFS Driver for FNS Acc
</property>
```

2. Account Url: It is the URL used to initialize the file system. It is either passed
directly to file system or configured as default uri using "fs.DefaultFS" configuration.
In both the cases the URL used must be the blob endpoint url of the account.
2. Account Url: It is the URL used to initialize the file system. It is either be passed
directly to the file system or configured as the default URI using "fs.DefaultFS" configuration.
In both cases the URL used must be the blob endpoint url of the account.
```xml
<property>
<name>fs.defaultFS</name>
<value>abfss://CONTAINER_NAME@ACCOUNT_NAME.blob.core.windows.net</value>
</property>
```
3. Service Type for FNS Accounts: This will allow an override to choose service
type specially in cases where any local DNS resolution is set for the account and driver is
unable to detect the intended endpoint from above configured URL. If this is set
to blob for HNS Enabled Accounts, FS init will fail with InvalidConfiguration error.
3. Service Type for FNS Accounts: This allows an override to choose the service
type especially in cases where local DNS resolution is set for the account and the driver is
unable to detect the intended endpoint from above configured URL. If this is set
to blob for HNS-enabled accounts, FS initialization will fail with InvalidConfiguration error.
```xml
<property>
<name>fs.azure.fns.account.service.type</name>
<value>BLOB</value>
</property>
```

4. Service Type for Ingress Operations: This will allow an override to choose service
type only for Ingress Related Operations like [Create](./blobEndpoint.html#put-blob),
[Append](./blobEndpoint.html#put-block),
and [Flush](./blobEndpoint.html#put-block-list). All other operations will still use the
configured service type.
4. Service Type for Ingress Operations: This allows an override to choose service
type only for Ingress related operations like [Create](./blobEndpoint.html#put-blob),
[Append](./blobEndpoint.html#put-block),
and [Flush](./blobEndpoint.html#put-block-list). All other operations will still use the
configured service type.
```xml
<property>
<name>fs.azure.ingress.service.type</name>
Expand Down Expand Up @@ -106,40 +106,33 @@ The following configs are related to rename and delete operations.
- `fs.azure.blob.copy.max.wait.millis`: Maximum time to wait for a blob copy
operation to complete. The default value is 5 minutes.

- `fs.azure.blob.atomic.rename.lease.refresh.duration`: Blob rename lease
refresh
- `fs.azure.blob.atomic.rename.lease.refresh.duration`: Blob rename lease refresh
duration in milliseconds. This setting ensures that the lease on the blob is
periodically refreshed during a rename operation to prevent other operations
periodically refreshed during a rename operation preventing other operations
from interfering.
The default value is 60 seconds.

- `fs.azure.blob.dir.list.producer.queue.max.size`: Maximum number of blob
entries
- `fs.azure.blob.dir.list.producer.queue.max.size`: Maximum number of blob entries
enqueued in memory for rename or delete orchestration. The default value is 2
times the default value of list max results, which is 5000, making the current
value 10000.

- `fs.azure.blob.dir.list.consumer.max.lag`: It sets a limit on how much blob
information can be waiting to be processed (consumer lag) during a blob
listing
operation. If the amount of unprocessed blob information exceeds this limit,
the
producer will pause until the consumer catches up and the lag becomes
listing operation. If the amount of unprocessed blob information exceeds this limit,
the producer will pause until the consumer catches up and the lag becomes
manageable. The default value is equal to the value of default value of list
max
results which is 5000 currently.
max results which is 5000 currently.

- `fs.azure.blob.dir.rename.max.thread`: Maximum number of threads per blob
rename
orchestration. The default value is 5.
rename orchestration. The default value is 5.

- `fs.azure.blob.dir.delete.max.thread`: Maximum number of thread per
blob-delete
orchestration. The default value currently is 5.
- `fs.azure.blob.dir.delete.max.thread`: Maximum number of thread per blob
delete orchestration. The default value currently is 5.

## Features currently not supported

1. **User Delegation SAS** feature is currently not supported but we
1. **User Delegation SAS** feature is currently not supported, but we
plan to bring support for it in the future.
Jira to track this
workitem : https://issues.apache.org/jira/browse/HADOOP-19406.
Expand Down
5 changes: 2 additions & 3 deletions hadoop-tools/hadoop-azure/src/site/markdown/wasb.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

## Introduction
WASB Driver is a legacy Hadoop File System driver that was developed to support
[FNS(FlatNameSpace) Azure Storage accounts](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction)
[FNS (FlatNameSpace) Azure Storage accounts](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction)
that do not honor File-Folder syntax.
HDFS Folder operations hence are mimicked at client side by WASB driver and
certain folder operations like Rename and Delete can lead to a lot of IOPs with
Expand Down Expand Up @@ -93,5 +93,4 @@ Refer to [ABFS Authentication](abfs.html/authentication) for more details.

### ABFS Features Not Available for migrating Users
Certain features of ABFS Driver will be available only to users using HNS accounts with ABFS driver.
1. ABFS Driver's SAS Token Provider plugin for UserDelegation SAS and Fixed SAS.
2. Client Provided Encryption Key (CPK) support for Data ingress and egress.
1. Client Provided Encryption Key (CPK) support for Data ingress and egress.