Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: mempool locking mechanism in v1 and cat #1582

Merged
merged 8 commits into from
Jan 8, 2025

Conversation

cmwaters
Copy link
Contributor

@cmwaters cmwaters commented Jan 2, 2025

This is residual of #1553

The problem is now even more subtle. Because the mempool mutexes weren't over both CheckTx and the actual adding of the transaction to the mempool we occasionally hit a situation as follows:

  • CheckTx a tx with nonce 2
  • Before saving it to the store, collect all transactions and recheck tx. This excludes the last tx with nonce 2, thus the nonce in the state machine is still 1
  • Save the tx to the pool
  • New tx comes in with nonce 3. The application is at 1 so rejects it expecting the next to be 2.

This PR fixes this pattern, however this won't be watertight for the CAT pool until we can order both on gas fee (priority) and nonce.

@cmwaters cmwaters marked this pull request as ready for review January 2, 2025 20:43
@cmwaters cmwaters requested a review from a team as a code owner January 2, 2025 20:43
@cmwaters cmwaters requested review from rootulp and rach-id and removed request for a team January 2, 2025 20:43
@rootulp rootulp requested review from ninabarbakadze and removed request for rach-id January 2, 2025 20:51
@ninabarbakadze ninabarbakadze changed the title fix mempool locking mechanism in v1 and v2 fix mempool locking mechanism in v1 and cat Jan 3, 2025
@rootulp rootulp changed the title fix mempool locking mechanism in v1 and cat fix: mempool locking mechanism in v1 and cat Jan 3, 2025
rootulp
rootulp previously approved these changes Jan 3, 2025
Copy link
Collaborator

@rootulp rootulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with two comments:

  1. Consider removing the panic
  2. Consider a unit test for deleteOrderedTx

mempool/cat/pool.go Outdated Show resolved Hide resolved
mempool/cat/pool.go Outdated Show resolved Hide resolved
@@ -58,6 +63,9 @@ func (s *store) remove(txKey types.TxKey) bool {
return false
}
s.bytes -= tx.size()
if err := s.deleteOrderedTx(tx); err != nil {
panic(err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the panic here. Should we consider:

  1. logging this error and return false instead?
  2. modifying the remove function signature to propagate the error?

Copy link
Contributor Author

@cmwaters cmwaters Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panic is in place because this would imply a developer error i.e. this shouldn't error so long as the developer has correctly implemented this method (the tx should always be in the ordered list if it exists in the map)

if idx >= len(s.orderedTxs) || s.orderedTxs[idx] != tx {
return fmt.Errorf("transaction %X not found in ordered list", tx.key)
}
s.orderedTxs = append(s.orderedTxs[:idx], s.orderedTxs[idx+1:]...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the idx is the last transaction in the ordered list, I expect s.orderedTxs[idx+1:] to fail.

Perhaps this needs a unit test where deleteOrderedTx is invoked on the last tx in the ordered list to verify there is no bug here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestStoreSimple actually tests this by adding a tx and removing it. Since it's the only transaction, it is defacto the last tx in the ordered list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the idx is the last transaction, s.orderedTxs[idx+1:] will be empty.

Copy link
Member

@ninabarbakadze ninabarbakadze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few questions:

  • should we be upstreaming this?
  • did you bump celestia-core in celestia-app to verify that the failed test passes?
  • should we move the tests catching concurrency bugs to core?

require.Empty(t, store.txs)
}

func TestStoreOrdering(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be testing this with higher transaction count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage

if idx >= len(s.orderedTxs) || s.orderedTxs[idx] != tx {
return fmt.Errorf("transaction %X not found in ordered list", tx.key)
}
s.orderedTxs = append(s.orderedTxs[:idx], s.orderedTxs[idx+1:]...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the idx is the last transaction, s.orderedTxs[idx+1:] will be empty.

Copy link
Member

@ninabarbakadze ninabarbakadze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve once I understand exactly how sorting works in the store since go slices can be deceitful lol more unit tests would be useful to make it fool proof

@cmwaters
Copy link
Contributor Author

cmwaters commented Jan 6, 2025

did you bump celestia-core in celestia-app to verify that the failed test passes?

Yes I did (locally)

should we be upstreaming this?

Upstream doesn't have a version of cat or the v1 priority mempool. This locking mechanism isn't an issue in v0

should we move the tests catching concurrency bugs to core?

No I don't think so. The test can catch a wider net of bugs in celestia-app

ninabarbakadze
ninabarbakadze previously approved these changes Jan 6, 2025
rootulp
rootulp previously approved these changes Jan 6, 2025
Copy link
Collaborator

@rootulp rootulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another test case for deleteOrderedTx

@rootulp rootulp enabled auto-merge (squash) January 7, 2025 20:01
@celestia-bot celestia-bot requested a review from rootulp January 7, 2025 20:01
@Wondertan Wondertan disabled auto-merge January 7, 2025 21:50
@Wondertan
Copy link
Member

Removed the auto-merge by accident

@rootulp rootulp merged commit bb7ec10 into v0.34.x-celestia Jan 8, 2025
18 of 20 checks passed
@rootulp rootulp deleted the cal/fix-locking-in-mempool branch January 8, 2025 14:38
mergify bot pushed a commit to celestiaorg/celestia-app that referenced this pull request Jan 13, 2025
Upgrade to
https://github.com/celestiaorg/celestia-core/releases/tag/v1.44.2-tm-v0.34.35
to pick up the fix in
celestiaorg/celestia-core#1582

(cherry picked from commit 885c317)

# Conflicts:
#	test/interchain/go.mod
#	test/interchain/go.sum
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants