Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1522,6 +1522,19 @@
</description>
</property>

<property>
<name>fs.s3a.metadatastore.fail.on.write.error</name>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use authoritative mode instead of adding a new protperty for this.
There's no case when you want this off when auth mode is on, and on if auth mode is off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at it a little differently. Primarily, I've been struggling with whether anyone would really ever want this setting off (authoritative mode or not). Why bother using S3Guard if you're ok with the idea that it won't even necessarily prevent seeing list-after-create inconsistencies? You're effectively getting the same behavior as if you didn't use S3Guard - most of the time you see all the files and sometimes you don't.

I considered crafting this patch without there even being a configuration for this or with the default to true, but figured current behavior must have been intentional and there is an argument I'm not seeing. Can you guys help me see the side of this where someone would want this false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ben - I think you are right. That "swallow the failures" probably follows on from the strategy of handling failures in the delete fake directories phase, where we don't want a failure there to escalate. Here we do as it is a sign that the store has gone inconsistent.

If we mandate that update operations always fail the job, well, at least you get to find out when something has gone wrong. In which case: no need for an option, no need for extra complications in testing.

Is there anyone watching this who disagrees? If so, please justify

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use authoritative mode instead of adding a new protperty for this.
There's no case when you want this off when auth mode is on, and on if auth mode is off.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use authoritative mode instead of adding a new protperty for this.
There's no case when you want this off when auth mode is on, and on if auth mode is off.

<value>true</value>
<description>
When true (default), FileSystem write operations generate
org.apache.hadoop.fs.s3a.MetadataPersistenceException if the metadata
cannot be saved to the metadata store. When false, failures to save to
metadata store are logged at ERROR level, but the overall FileSystem
write operation succeeds.
</description>
</property>


<property>
<name>fs.s3a.s3guard.cli.prune.age</name>
<value>86400000</value>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,17 @@ private Constants() {
public static final String S3_METADATA_STORE_IMPL =
"fs.s3a.metadatastore.impl";

/**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here as in core-default. We don't need a new property for this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here as in core-default. We don't need a new property for this.

* Whether to fail when there is an error writing to the metadata store.
*/
public static final String FAIL_ON_METADATA_WRITE_ERROR =
"fs.s3a.metadatastore.fail.on.write.error";

/**
* Default value ({@value}) for FAIL_ON_METADATA_WRITE_ERROR.
*/
public static final boolean FAIL_ON_METADATA_WRITE_ERROR_DEFAULT = true;

/** Minimum period of time (in milliseconds) to keep metadata (may only be
* applied when a prune command is manually run).
*/
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.hadoop.fs.s3a;

import org.apache.hadoop.fs.PathIOException;

/**
* Indicates the metadata associated with the given Path could not be persisted
* to the metadata store (e.g. S3Guard / DynamoDB). When this occurs, the
* file itself has been successfully written to S3, but the metadata may be out
* of sync. The metadata can be corrected with the "s3guard import" command
* provided by {@link org.apache.hadoop.fs.s3a.s3guard.S3GuardTool}.
*/
public class MetadataPersistenceException extends PathIOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an entry for this in org.apache.hadoop.fs.s3a.S3ARetryPolicy for whatever policy we think matters. For now, set it to fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


/**
* Constructs a MetadataPersistenceException.
* @param path path of the affected file
* @param cause cause of the issue
*/
public MetadataPersistenceException(String path, Throwable cause) {
super(path, cause);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,22 @@
import org.apache.hadoop.classification.InterfaceStability;

/**
* Declaration of retry policy for documentation only.
* This is purely for visibility in source and is currently package-scoped.
* Compare with {@link org.apache.hadoop.io.retry.AtMostOnce}
* and {@link org.apache.hadoop.io.retry.Idempotent}; these are real
* markers used by Hadoop RPC.
* <p>
* Annotations to inform the caller of an annotated method whether
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to improve the documentation of this class here. I'm curious if I have captured the real intended meaning/purpose of these annotations? If so, hopefully it will help the next developer understand the goal better than I did initially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though I'd be stricter about the wrapping of retries. You mustn't retry around a retrying operation as you'll just cause things to take so long that operations will time out anyway. The raw/translated flags are markers about whether to expect AWS-library Runtime Exceptions or translated stuff, or a mixture. It's less critical, but once everything is pure IOE, there's no need to catch and convert again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tweaked in ef91a67

* the method performs retries and/or exception translation internally.
* Callers should use this information to inform their own decisions about
* performing retries or exception translation when calling the method. For
* example, if a method is annotated {@code RetryTranslated}, the caller
* MUST NOT perform another layer of retries. Similarly, the caller shouldn't
* perform another layer of exception translation.
* </p>
* <p>
* Declaration for documentation only.
* This is purely for visibility in source and is currently package-scoped.
* Compare with {@link org.apache.hadoop.io.retry.AtMostOnce}
* and {@link org.apache.hadoop.io.retry.Idempotent}; these are real
* markers used by Hadoop RPC.
* </p>
*/
@InterfaceAudience.Private
@InterfaceStability.Unstable
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,7 @@ public class S3AFileSystem extends FileSystem implements StreamCapabilities,
LoggerFactory.getLogger("org.apache.hadoop.fs.s3a.S3AFileSystem.Progress");
private LocalDirAllocator directoryAllocator;
private CannedAccessControlList cannedACL;
private boolean failOnMetadataWriteError;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here as in core-default. We don't need a new property for this.


/**
* This must never be null; until initialized it just declares that there
Expand Down Expand Up @@ -306,6 +307,9 @@ public void initialize(URI name, Configuration originalConf)
onRetry);
writeHelper = new WriteOperationHelper(this, getConf());

failOnMetadataWriteError = conf.getBoolean(FAIL_ON_METADATA_WRITE_ERROR,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here as in core-default. We don't need a new property for this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here as in core-default. We don't need a new property for this.

FAIL_ON_METADATA_WRITE_ERROR_DEFAULT);

maxKeys = intOption(conf, MAX_PAGING_KEYS, DEFAULT_MAX_PAGING_KEYS, 1);
listing = new Listing(this);
partSize = getMultipartSizeProperty(conf,
Expand Down Expand Up @@ -1784,10 +1788,13 @@ public UploadInfo putObject(PutObjectRequest putObjectRequest) {
* @param putObjectRequest the request
* @return the upload initiated
* @throws AmazonClientException on problems
* @throws MetadataPersistenceException if metadata about the write could
* not be saved to the metadata store and
* fs.s3a.metadatastore.fail.on.write.error=true
*/
@Retries.OnceRaw("For PUT; post-PUT actions are RetriesExceptionsSwallowed")
@Retries.OnceRaw("For PUT; post-PUT actions are RetryTranslated")
PutObjectResult putObjectDirect(PutObjectRequest putObjectRequest)
throws AmazonClientException {
throws AmazonClientException, MetadataPersistenceException {
long len = getPutRequestLength(putObjectRequest);
LOG.debug("PUT {} bytes to {}", len, putObjectRequest.getKey());
incrementPutStartStatistics(len);
Expand Down Expand Up @@ -2710,11 +2717,14 @@ private void innerCopyFromLocalFile(boolean delSrc, boolean overwrite,
* @param progress optional progress callback
* @return the upload result
* @throws InterruptedIOException if the blocking was interrupted.
* @throws MetadataPersistenceException if metadata about the write could
* not be saved to the metadata store and
* fs.s3a.metadatastore.fail.on.write.error=true
*/
@Retries.OnceRaw("For PUT; post-PUT actions are RetriesExceptionsSwallowed")
@Retries.OnceRaw("For PUT; post-PUT actions are RetryTranslated")
UploadResult executePut(PutObjectRequest putObjectRequest,
Progressable progress)
throws InterruptedIOException {
throws InterruptedIOException, MetadataPersistenceException {
String key = putObjectRequest.getKey();
UploadInfo info = putObject(putObjectRequest);
Upload upload = info.getUpload();
Expand Down Expand Up @@ -3034,10 +3044,15 @@ private Optional<SSECustomerKey> generateSSECustomerKey() {
* </ol>
* @param key key written to
* @param length total length of file written
* @throws MetadataPersistenceException if metadata about the write could
* not be saved to the metadata store and
* fs.s3a.metadatastore.fail.on.write.error=true
*/
@InterfaceAudience.Private
@Retries.RetryExceptionsSwallowed
void finishedWrite(String key, long length) {
@Retries.RetryTranslated("Except if failOnMetadataWriteError=false, in which"
+ " case RetryExceptionsSwallowed")
void finishedWrite(String key, long length)
throws MetadataPersistenceException {
LOG.debug("Finished write to {}, len {}", key, length);
Path p = keyToQualifiedPath(key);
Preconditions.checkArgument(length >= 0, "content length is negative");
Expand All @@ -3053,8 +3068,12 @@ void finishedWrite(String key, long length) {
S3Guard.putAndReturn(metadataStore, status, instrumentation);
}
} catch (IOException e) {
LOG.error("S3Guard: Error updating MetadataStore for write to {}:",
key, e);
if (failOnMetadataWriteError) {
throw new MetadataPersistenceException(p.toString(), e);
} else {
LOG.error("S3Guard: Error updating MetadataStore for write to {}",
p, e);
}
instrumentation.errorIgnored();
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,10 @@ protected Map<Class<? extends Exception>, RetryPolicy> createExceptionMap() {
policyMap.put(FileNotFoundException.class, fail);
policyMap.put(InvalidRequestException.class, fail);

// metadata stores should do retries internally when it makes sense
// so there is no point doing another layer of retries after that
policyMap.put(MetadataPersistenceException.class, fail);

// once the file has changed, trying again is not going to help
policyMap.put(RemoteFileChangedException.class, fail);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -247,22 +247,21 @@ private CompleteMultipartUploadResult finalizeMultipartUpload(
throw new IOException(
"No upload parts in multipart upload to " + destKey);
}
return invoker.retry("Completing multipart commit", destKey,
CompleteMultipartUploadResult uploadResult = invoker.retry("Completing multipart commit", destKey,
true,
retrying,
() -> {
// a copy of the list is required, so that the AWS SDK doesn't
// attempt to sort an unmodifiable list.
CompleteMultipartUploadResult result =
owner.getAmazonS3Client().completeMultipartUpload(
new CompleteMultipartUploadRequest(bucket,
destKey,
uploadId,
new ArrayList<>(partETags)));
owner.finishedWrite(destKey, length);
return result;
return owner.getAmazonS3Client().completeMultipartUpload(
new CompleteMultipartUploadRequest(bucket,
destKey,
uploadId,
new ArrayList<>(partETags)));
}
);
owner.finishedWrite(destKey, length);
return uploadResult;
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.s3a.Retries.RetryTranslated;

/**
* {@code MetadataStore} defines the set of operations that any metadata store
Expand Down Expand Up @@ -165,6 +166,7 @@ void move(Collection<Path> pathsToDelete,
* @param meta the metadata to save
* @throws IOException if there is an error
*/
@RetryTranslated
void put(PathMetadata meta) throws IOException;

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.s3a.Retries;
import org.apache.hadoop.fs.s3a.Retries.RetryTranslated;
import org.apache.hadoop.fs.s3a.S3AFileStatus;
import org.apache.hadoop.fs.s3a.S3AInstrumentation;
import org.apache.hadoop.fs.s3a.Tristate;
Expand Down Expand Up @@ -144,6 +145,7 @@ static Class<? extends MetadataStore> getMetadataStoreClass(
* @return The same status as passed in
* @throws IOException if metadata store update failed
*/
@RetryTranslated
public static S3AFileStatus putAndReturn(MetadataStore ms,
S3AFileStatus status,
S3AInstrumentation instrumentation) throws IOException {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,10 @@ This offers no metadata storage, and effectively disables S3Guard.

More settings will may be added in the future.
Currently the only Metadata Store-independent setting, besides the
implementation class above, is the *allow authoritative* flag.
implementation class above, are the *allow authoritative* and *fail-on-error*
flags.

#### Allow Authoritative

The _authoritative_ expression in S3Guard is present in two different layers, for
two different reasons:
Expand Down Expand Up @@ -183,6 +186,46 @@ removed on `S3AFileSystem` level.
</property>
```

#### Fail on Error

By default, S3AFileSystem write operations will fail when updates to
S3Guard metadata fail. S3AFileSystem first writes the file to S3 and then
updates the metadata in S3Guard. If the metadata write fails,
`MetadataPersistenceException` is thrown. The file in S3 **is not** rolled
back.

If the write operation cannot be programmatically retried, the S3Guard metadata
for the given file can be corrected with a command like the following:

```bash
hadoop s3guard import [-meta URI] s3a://my-bucket/file-with-bad-metadata
```

Programmatic retries of the original operation would require overwrite=true.
Suppose the original operation was FileSystem.create(myFile, overwrite=false).
If this operation failed with `MetadataPersistenceException` a repeat of the
same operation would result in `FileAlreadyExistsException` since the original
operation successfully created the file in S3 and only failed in writing the
metadata to S3Guard.

Metadata update failures can be downgraded to ERROR logging instead of exception
by setting the following configuration:

```xml
<property>
<name>fs.s3a.metadatastore.fail.on.write.error</name>
<value>false</value>
</property>
```

Setting this false is dangerous as it could result in the type of issue S3Guard
is designed to avoid. For example, a reader may see an inconsistent listing
after a recent write since S3Guard may not contain metadata about the recently
written file due to a metadata write error.

As with the default setting, the new/updated file is still in S3 and **is not**
rolled back. The S3Guard metadata is likely to be out of sync.

### 3. Configure the Metadata Store.

Here are the `DynamoDBMetadataStore` settings. Other Metadata Store
Expand Down Expand Up @@ -1152,7 +1195,7 @@ java.io.IOException: Invalid region specified "iceland-2":

The region specified in `fs.s3a.s3guard.ddb.region` is invalid.

# "Neither ReadCapacityUnits nor WriteCapacityUnits can be specified when BillingMode is PAY_PER_REQUEST"
### "Neither ReadCapacityUnits nor WriteCapacityUnits can be specified when BillingMode is PAY_PER_REQUEST"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the h1 level heading here was a mistake made when this section was added. I fixed it when addressing the merge conflict with my changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's happened in a couple of places. thanks


```
ValidationException; One or more parameter values were invalid:
Expand All @@ -1164,6 +1207,14 @@ ValidationException; One or more parameter values were invalid:
On-Demand DynamoDB tables do not have any fixed capacity -it is an error
to try to change it with the `set-capacity` command.

### `MetadataPersistenceException`

A filesystem write operation failed to persist metadata to S3Guard. The file was
successfully written to S3 and now the S3Guard metadata is likely to be out of
sync.

See [Fail on Error](#fail-on-error) for more detail.

## Other Topics

For details on how to test S3Guard, see [Testing S3Guard](./testing.html#s3guard)
Loading