Skip to content

Fix race condition in InternalResourceManager initialization#19067

Merged
rschlussel merged 1 commit intoprestodb:masterfrom
rschlussel:resource-manager-race
Feb 16, 2023
Merged

Fix race condition in InternalResourceManager initialization#19067
rschlussel merged 1 commit intoprestodb:masterfrom
rschlussel:resource-manager-race

Conversation

@rschlussel
Copy link
Contributor

@rschlussel rschlussel commented Feb 16, 2023

Throw an exception if a query tries to create resource groups before the configuration manager has been loaded. This fixes a race condition where a query can start constructing resource groups before the ResourceGroupConfigurationManager is loaded. This would cause some resource groups to get created in the legacy manager and never recreated when the new resource manager is loaded. This is one theory about why we are seeing some clusters get into a state on restart where all queries fail with a NullPointerException in ReloadingResourceGroupConfigurationManager.

Test plan - CI, new unit tests, validated fix on test cluster

== RELEASE NOTES ==

General Changes
* Fix a bug where if a query was submitted before the resource group configuration had been loaded, all queries to the cluster would fail with a NullPointerException

@rschlussel rschlussel force-pushed the resource-manager-race branch 3 times, most recently from 9939ea3 to 1a25304 Compare February 16, 2023 17:07
@rschlussel rschlussel marked this pull request as ready for review February 16, 2023 18:14
@rschlussel rschlussel requested a review from a team as a code owner February 16, 2023 18:14
@jainxrohit jainxrohit self-requested a review February 16, 2023 18:21
@swapsmagic
Copy link
Contributor

Can we add a test to verify query failure behavior when configuration manager is not fully initialized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this local variable? initializingConfigurationManager
We should be able to initialize member variable initializingConfigurationManager directly?

@rschlussel rschlussel force-pushed the resource-manager-race branch from 1a25304 to 6cd3177 Compare February 16, 2023 19:25
@rschlussel
Copy link
Contributor Author

added tests

@rschlussel rschlussel force-pushed the resource-manager-race branch from 6cd3177 to bca80f2 Compare February 16, 2023 19:41
@rschlussel
Copy link
Contributor Author

rschlussel commented Feb 16, 2023

validated on a test cluster that throws lots of queries at the cluster right a way. a bunch of queries failed with "Presto server is still initializing", and then had a successful query and future queries succeeded.

Copy link
Contributor

@jainxrohit jainxrohit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ajaygeorge ajaygeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % a few nits

Throw an exception if a query tries to create resource groups before the
configuration manager has been loaded.  This fixes a race condition
where a query can start constructing resource groups before the
ResourceGroupConfigurationManager is loaded.  This would cause some
resource groups to get created in the legacy manager and never recreated
when the new resource manager is loaded. This is one theory about why we
are seeing some clusters get into a state on restart where all queries
fail with a NullPointerException in
ReloadingResourceGroupConfigurationManager.
@rschlussel rschlussel force-pushed the resource-manager-race branch from bca80f2 to 4ce055e Compare February 16, 2023 22:22
@rschlussel rschlussel merged commit cd6a84d into prestodb:master Feb 16, 2023
@wanglinsong wanglinsong mentioned this pull request Feb 25, 2023
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants