AWS: add DynamoDB implementation of Glue lock manager #2034

jackye1995 · 2021-01-06T00:12:31Z

As discussed, the DynamoDB part of #1823
Also updates the documentation for LockManager.release().

aws/src/main/java/org/apache/iceberg/aws/glue/LockManager.java

aws/src/main/java/org/apache/iceberg/aws/glue/DynamoLockManager.java

rdblue · 2021-01-06T01:37:16Z

aws/src/main/java/org/apache/iceberg/aws/glue/DynamoLockManager.java

+          .key(toKey(entityId))
+          .conditionExpression(CONDITION_LOCK_ID_MATCH)
+          .expressionAttributeValues(toLockIdValues(entityId, ownerId))
+          .build());


I think this method is a good candidate to use ExceptionUtil.runSafely.

Cancelling the heartbeat should be done in a finally block in case there is a connection issue that causes this to throw an exception. Otherwise, the heartbeat thread could accidentally hold the lock forever if it succeeds.

The reason I would say to use runSafely is that I think it would be better to automatically suppress any exceptions in the finally block and that's what runSafely does. I doubt that heartbeat.cancel(false) would throw any exceptions, but it's good to be careful in case the cancellation code changes in the future.

I think we cannot always cancel heartbeat in a final block. If the caller has a different owner, the heartbeat should not be canceled. Therefore, the correct behavior should be to check the owner of the lock when trying to release, so that:

if release succeeds, then cancel heartbeat

if release fails and heartbeat owner is the same, cancel heartbeat. There should be something wrong with Dynamo, and canceling the heartbeat should allow the lock to expire

if release fails and heartbeat owner is different, the caller is wrong and we should not cancel the ongoing heartbeat.

To achieve this, I have introduced an internal class DynamoHeartbeat to also save the owner in the heartbeats map.

Also, I added retry in release using Tasks to retry on some retryable exceptions.

This looks good. Nice catch with the owner check.

rdblue · 2021-01-06T01:43:48Z

aws/src/main/java/org/apache/iceberg/aws/glue/DynamoLockManager.java

+        .item(toNewItem(entityId, ownerId))
+        .conditionExpression(CONDITION_LOCK_ID_MATCH)
+        .expressionAttributeValues(toLockIdValues(entityId, ownerId))
+        .build()), 0, heartbeatIntervalMs(), TimeUnit.MILLISECONDS));


I think this task needs to handle exceptions and keep running.

If there is a connection issue that causes this to fail, then the task will not be rescheduled. Instead, the future will contain the error. Then other concurrent processes could hold the lock even though this process thinks it holds the lock.

At a minimum, I think that this thread should log any exception thrown by dynamo and keep trying to update its lease.

It would be great if we could also detect if the lock is no longer held (any put fails with ConditionalCheckFailedException) and stop any commit attempts, but I don't see a way to do that. This should definitely log an error if ConditionalCheckFailedException is thrown and note that two processes may think they both hold the lock.

Yes, that is definitely a risk. I also thought about what you described, but I think the only way to achieve that is to allow the lock manager caller to explicitly call a heartbeat method to emit heartbeat, and this method needs to be passed all the way to every place of a commit to ensure heartbeat does not timeout.

So comparing with the current Hive lock, we are basically choosing between:

if we lock without heartbeat, a super long timeout is needed to ensure the commit must succeed, and if lock dies, all processes would be blocked for that long time.

if we lock with heartbeat, we don't need a long timeout. But we need to either choose to:

heartbeat in the background, which makes the interface more decoupled, but would lead to this issue of potential idle heartbeat thread

heartbeat by the caller, which would require changes the commit interface all over the place to pass the heartbeat around.

I think so far 2.1 is still the best approach to go for now, since we have already made sure release must be called after a commit, so heartbeat will be killed. If it is not called, the process must have crashed and the heartbeat thread would also be dead.

rdblue · 2021-01-07T00:33:57Z

aws/src/main/java/org/apache/iceberg/aws/glue/DynamoLockManager.java

+      } catch (ConditionalCheckFailedException e) {
+        LOG.error("Fail to heartbeat for entity: {}, owner: {} due to conditional check failure, " +
+                "unsafe concurrent commits might be going on", entityId, ownerId, e);
+      } catch (DynamoDbException e) {


This is one of the few times I think that we actually need to broaden the catch clause to Exception or RuntimeException instead of DynamoDbException. If anything happens that causes the heartbeat to fail, like an unrelated NullPointerException inside the AWS SDK, then we want this to keep trying as long as the process thinks it holds the lock. Catching RuntimeException ensures that no exceptions should stop this from getting scheduled again.

rdblue · 2021-01-07T02:03:46Z

Looks great, thanks @jackye1995!

github-actions bot added AWS build labels Jan 6, 2021

rdblue reviewed Jan 6, 2021

View reviewed changes

aws/src/main/java/org/apache/iceberg/aws/glue/LockManager.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 6, 2021

View reviewed changes

aws/src/main/java/org/apache/iceberg/aws/glue/DynamoLockManager.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 6, 2021

View reviewed changes

aws/src/main/java/org/apache/iceberg/aws/glue/DynamoLockManager.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 6, 2021

View reviewed changes

rdblue reviewed Jan 7, 2021

View reviewed changes

Jack Ye added 2 commits January 6, 2021 17:19

AWS: add DynamoDB implementation of Glue lock manager

3055d43

fix assume role factory

e5dc3e5

rdblue merged commit ef570ef into apache:master Jan 7, 2021

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021

AWS: add DynamoDB implementation of Glue lock manager (apache#2034)

405421b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AWS: add DynamoDB implementation of Glue lock manager #2034

AWS: add DynamoDB implementation of Glue lock manager #2034

Uh oh!

jackye1995 commented Jan 6, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue Jan 6, 2021

Uh oh!

jackye1995 Jan 6, 2021

Uh oh!

rdblue Jan 7, 2021

Uh oh!

rdblue Jan 6, 2021

Uh oh!

jackye1995 Jan 6, 2021

Uh oh!

rdblue Jan 7, 2021

Uh oh!

rdblue commented Jan 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AWS: add DynamoDB implementation of Glue lock manager #2034

AWS: add DynamoDB implementation of Glue lock manager #2034

Uh oh!

Conversation

jackye1995 commented Jan 6, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 7, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 7, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants