Possible corrupted table cache info on application start #3520

Sussumu · 2024-10-19T17:13:46Z

Describe the bug

We recently faced a bug in production where the application would not load any document, stating that the number of hash keys was different than one. This application has been running for a few months with no changes whatsoever so we thought this was some kind of unwanted infrastructure change. After a restart, everything came back to normal.

I didn't put a lot of time investigating the AWS SDK code, but from what I could see, the code checks for the number of hash keys declared by the application which has to be exactly one. It gets this data from a previously cached value which may come from a DescribeTable call or from the code itself depending on the value of the DisableFetchingTableMetadata. Our code didn't explicitly set this attribute so it may have come from a DescribeTable call. Please correct if I'm wrong.

Is it possible that this call may have corrupt data?

Regression Issue

Select this option if this issue appears to be a regression.

Expected Behavior

The application was supposed to query an document from its partition key and sort key as it was doing for a few months.

Current Behavior

System.InvalidOperationException: Must have one hash key defined for the table <<TABLE_NAME>>
at Amazon.DynamoDBv2.DataModel.DynamoDBContext.MakeKey(Object hashKey, Object rangeKey, ItemStorageConfig storageConfig, DynamoDBFlatConfig flatConfig)
at Amazon.DynamoDBv2.DataModel.DynamoDBContext.LoadHelperAsync[T](Object hashKey, Object rangeKey, DynamoDBOperationConfig operationConfig, CancellationToken cancellationToken)
at Amazon.DynamoDBv2.DataModel.DynamoDBContext.LoadAsync[T](Object hashKey, Object rangeKey, DynamoDBOperationConfig operationConfig, CancellationToken cancellationToken)

We inject a IDynamoDBContext and load the document like this:

await _context.LoadAsync<TModel>(partitionKey, sortKey, configuration);

Since the restart we didn't face any more errors like this.

Reproduction Steps

I've just copied the most important parts. There's nothing special about this configuration and we basically copy/paste to another projects with no problem. I can't reproduce it now. Maybe if some background call like the DescribeTable that I've mentioned is altered we can get the same error.

using Amazon.DynamoDBv2;
using Amazon.DynamoDBv2.DataModel;
using Amazon.Runtime;

const string serviceUrl = "http://localhost:8000/";
const string authenticationRegion = "us-west-1";

var localstackCredentials = new BasicAWSCredentials("local", "local");
var dynamoDbConfig = new AmazonDynamoDBConfig
{
    ServiceURL = serviceUrl,
    AuthenticationRegion = authenticationRegion
};

var dynamoDbClient = new AmazonDynamoDBClient(localstackCredentials, dynamoDbConfig);
var dynamoDbContext = new DynamoDBContext(dynamoDbClient);

var configuration = new DynamoDBOperationConfig
{
    Conversion = DynamoDBEntryConversion.V2,
    ConsistentRead = true,
    RetrieveDateTimeInUtc = true
};

// Exception is thrown here
// The document exists in dynamo
// Query is on the database itself, not in any GSI or LSI
var document = await dynamoDbContext.LoadAsync<Model>("partitionKey", "sortKey", configuration);

// Table name is correct
// We don't configure any other attribute like [DynamoDBHashKey]
[DynamoDBTable(TableNames.SOME_CONSTANT)]
public class Model
{
    public string PartitionKey { get; set; }
    public string SortKey { get; set; }
}

Possible Solution

As I said, I think it's related to the underlying DescribeTable. I assume that disabling DisableFetchingTableMetadata and manually specifying the keys may correct this since it's one less moving part.

Additional Information/Context

The bug started after a Kubernetes pod restart after a node change. All other pods including other ones that query DynamoDb on the same account restarted but only this one got the bug.

AWS .NET SDK and/or Package version used

AWSSDK.DynamoDBv2 Version="3.7.300.12"
AWSSDK.Extensions.NETCore.Setup Version="3.7.300"
AWSSDK.SecretsManager Version="3.7.301.11"
AWSSDK.SecurityToken Version="3.7.300.22"

Targeted .NET Platform

.NET 7.0

Operating System and version

Custom Alpine x64 image

The text was updated successfully, but these errors were encountered:

Sussumu · 2024-10-23T14:59:04Z

Unfortunately we got the same error on another table.. This application was ported from ECS to EKS last week and it's allocated on another K8s namespace.

Versions:
AWSSDK.Core 3.7.303.27
AWSSDK.Core.SecretsManager 3.7.301
AWSSDK.SecurityToken 3.7.300.27
AWSSDK.DynamoDBv2 3.7.300.26
AWSSDK.Extensions.NETCore.Setup 3.7.1

bhoradc · 2024-10-23T17:59:36Z

Hello @Sussumu,

Thank you for reporting the issue. I tried to reproduce it at my end using the code snippet you provided, but unfortunately, I was unable to do so.

However, you rightly pointed out that the issue appears to be non-reproducible at will. I will discuss and review this further with the team to understand the root cause of the problem.

Thanks again for bringing this to our attention.

Regards,
Chaitanya

normj · 2024-11-08T20:07:14Z

@Sussumu Your first example looks like it is using LocalStack, was your second incident also using LocalStack? You are correct by default the DescribeTable call is used and it is expect the table won't change the metadata is cached. Never heard of DynamoDB service ever returning in accurate information from the DescribeTable call. I'm wondering if there was some issue on the LocalStack side.

Sussumu · 2024-11-18T20:46:33Z

@normj Oh it was just an example, in production we are using a regular AWS account for both cases. We're still experiencing this issue though it's not common. Maybe less than once a week.

Sussumu added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Oct 19, 2024

bhoradc added needs-reproduction This issue needs reproduction. dynamodb p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Oct 21, 2024

bhoradc self-assigned this Oct 21, 2024

bhoradc added needs-review and removed needs-reproduction This issue needs reproduction. labels Oct 23, 2024

bhoradc removed their assignment Oct 23, 2024

dscpinheiro added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-review labels Nov 8, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible corrupted table cache info on application start #3520

Possible corrupted table cache info on application start #3520

Sussumu commented Oct 19, 2024

Sussumu commented Oct 23, 2024

bhoradc commented Oct 23, 2024

normj commented Nov 8, 2024

Sussumu commented Nov 18, 2024

Possible corrupted table cache info on application start #3520

Possible corrupted table cache info on application start #3520

Comments

Sussumu commented Oct 19, 2024

Describe the bug

Regression Issue

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

AWS .NET SDK and/or Package version used

Targeted .NET Platform

Operating System and version

Sussumu commented Oct 23, 2024

bhoradc commented Oct 23, 2024

normj commented Nov 8, 2024

Sussumu commented Nov 18, 2024