Skip to content

Comments

Prevent race conditions in concurrent MLS commit requests#2525

Merged
stefanwire merged 15 commits intodevelopfrom
FS-437-atomic-commits
Jul 1, 2022
Merged

Prevent race conditions in concurrent MLS commit requests#2525
stefanwire merged 15 commits intodevelopfrom
FS-437-atomic-commits

Conversation

@stefanwire
Copy link
Contributor

https://wearezeta.atlassian.net/browse/FS-437

Checklist

  • The PR Title explains the impact of the change.
  • The PR description provides context as to why the change should occur and what the code contributes to that effect. This could also be a link to a JIRA ticket or a Github issue, if there is one.
  • If a cassandra schema migration has been added, I ran make git-add-cassandra-schema to update the cassandra schema documentation.
  • changelog.d contains the following bits of information (details):
    • A file with the changelog entry in one or more suitable sub-sections. The sub-sections are marked by directories inside changelog.d.
    • If a cassandra schema migration is backwards incompatible (see also these docs), measures to be taken by on-premise instance operators are explained.

@stefanwire stefanwire temporarily deployed to cachix June 30, 2022 09:00 Inactive
@stefanwire stefanwire force-pushed the FS-437-atomic-commits branch from 3e4487c to 5ce9adc Compare June 30, 2022 09:32
@stefanwire stefanwire temporarily deployed to cachix June 30, 2022 09:32 Inactive
. zUser claimant
)

-- TODO(SB) generalise such that two prepared commits can be sent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO left

@stefanwire stefanwire temporarily deployed to cachix June 30, 2022 12:20 Inactive
Copy link
Contributor

@pcapriotti pcapriotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good if CI agrees. Minor comments follow.

lockAcquired <- acquireCommitLock gid epoch ttl
when (lockAcquired == NotAcquired) $
throwS @'MLSStaleMessage
bracket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not acquire the lock on the opening side of the bracket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

addRemoteMLSClients = "update member_remote_user set mls_clients = mls_clients + ? where conv = ? and user_remote_domain = ? and user_remote_id = ?"

acquireCommitLock :: PrepQuery W (GroupId, Epoch, Int32) Row
acquireCommitLock = "insert into mls_commit_locks (group_id, epoch) values (?, ?) if not exists using ttl ?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that this works. I was under the impression that you can't have TTL values as "question mark" parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried it locally by temporarily removing the finishing part of the bracket and observed the entries appearing and finally disappearing after the TTL expired. Would this manual test count or should I look deeper into it?

@pcapriotti
Copy link
Contributor

Ah, one more thing. What happens if the TTL expires before commit processing is finished? I think in that case there's a race between the thread that started the lock, and some potential threads receiving a competing commit, since the former behaves as if it is still holding the lock. For extra safety, maybe we should add a timeout T to the commit processing block, and correspondingly set the TTL for the lock to slightly more than T. This should minimise the probability of such failures.

@stefanwire stefanwire temporarily deployed to cachix June 30, 2022 13:47 Inactive
@smatting
Copy link
Contributor

Not sure if this improves things. If we add a timeout and the commit processing can't meet it, then it's quite likely we're left with a broken state. If we don't add a timeout and the commit processing doesn't finish in time we're losing the guarantee the lock gives, but then there's still the chance that no commit arrives before the commit processing is finished.

Alternatively we could increase the TTL of the lock to make this problem less likely to occur.

@stefanwire stefanwire force-pushed the FS-437-atomic-commits branch from 9f420de to 06bd07d Compare June 30, 2022 14:47
@stefanwire stefanwire temporarily deployed to cachix June 30, 2022 14:49 Inactive
@stefanwire stefanwire merged commit 4024071 into develop Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants