-
Notifications
You must be signed in to change notification settings - Fork 3k
AWS: add DynamoDB implementation of Glue lock manager #2034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aws/src/main/java/org/apache/iceberg/aws/glue/DynamoLockManager.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/glue/DynamoLockManager.java
Outdated
Show resolved
Hide resolved
| .key(toKey(entityId)) | ||
| .conditionExpression(CONDITION_LOCK_ID_MATCH) | ||
| .expressionAttributeValues(toLockIdValues(entityId, ownerId)) | ||
| .build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method is a good candidate to use ExceptionUtil.runSafely.
Cancelling the heartbeat should be done in a finally block in case there is a connection issue that causes this to throw an exception. Otherwise, the heartbeat thread could accidentally hold the lock forever if it succeeds.
The reason I would say to use runSafely is that I think it would be better to automatically suppress any exceptions in the finally block and that's what runSafely does. I doubt that heartbeat.cancel(false) would throw any exceptions, but it's good to be careful in case the cancellation code changes in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we cannot always cancel heartbeat in a final block. If the caller has a different owner, the heartbeat should not be canceled. Therefore, the correct behavior should be to check the owner of the lock when trying to release, so that:
- if release succeeds, then cancel heartbeat
- if release fails and heartbeat owner is the same, cancel heartbeat. There should be something wrong with Dynamo, and canceling the heartbeat should allow the lock to expire
- if release fails and heartbeat owner is different, the caller is wrong and we should not cancel the ongoing heartbeat.
To achieve this, I have introduced an internal class DynamoHeartbeat to also save the owner in the heartbeats map.
Also, I added retry in release using Tasks to retry on some retryable exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Nice catch with the owner check.
| .item(toNewItem(entityId, ownerId)) | ||
| .conditionExpression(CONDITION_LOCK_ID_MATCH) | ||
| .expressionAttributeValues(toLockIdValues(entityId, ownerId)) | ||
| .build()), 0, heartbeatIntervalMs(), TimeUnit.MILLISECONDS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this task needs to handle exceptions and keep running.
If there is a connection issue that causes this to fail, then the task will not be rescheduled. Instead, the future will contain the error. Then other concurrent processes could hold the lock even though this process thinks it holds the lock.
At a minimum, I think that this thread should log any exception thrown by dynamo and keep trying to update its lease.
It would be great if we could also detect if the lock is no longer held (any put fails with ConditionalCheckFailedException) and stop any commit attempts, but I don't see a way to do that. This should definitely log an error if ConditionalCheckFailedException is thrown and note that two processes may think they both hold the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is definitely a risk. I also thought about what you described, but I think the only way to achieve that is to allow the lock manager caller to explicitly call a heartbeat method to emit heartbeat, and this method needs to be passed all the way to every place of a commit to ensure heartbeat does not timeout.
So comparing with the current Hive lock, we are basically choosing between:
- if we lock without heartbeat, a super long timeout is needed to ensure the commit must succeed, and if lock dies, all processes would be blocked for that long time.
- if we lock with heartbeat, we don't need a long timeout. But we need to either choose to:
- heartbeat in the background, which makes the interface more decoupled, but would lead to this issue of potential idle heartbeat thread
- heartbeat by the caller, which would require changes the commit interface all over the place to pass the heartbeat around.
I think so far 2.1 is still the best approach to go for now, since we have already made sure release must be called after a commit, so heartbeat will be killed. If it is not called, the process must have crashed and the heartbeat thread would also be dead.
| } catch (ConditionalCheckFailedException e) { | ||
| LOG.error("Fail to heartbeat for entity: {}, owner: {} due to conditional check failure, " + | ||
| "unsafe concurrent commits might be going on", entityId, ownerId, e); | ||
| } catch (DynamoDbException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the few times I think that we actually need to broaden the catch clause to Exception or RuntimeException instead of DynamoDbException. If anything happens that causes the heartbeat to fail, like an unrelated NullPointerException inside the AWS SDK, then we want this to keep trying as long as the process thinks it holds the lock. Catching RuntimeException ensures that no exceptions should stop this from getting scheduled again.
|
Looks great, thanks @jackye1995! |
As discussed, the DynamoDB part of #1823
Also updates the documentation for
LockManager.release().