Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #38169 - Fix operatingsystem race condition with concurrent registrations #10425

Merged

Conversation

jeremylenz
Copy link
Contributor

@jeremylenz jeremylenz commented Jan 28, 2025

While testing concurrent Katello registrations in batches we see the following error leading to some hosts (2, 3 usually) with the same OS version failing to register:

PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_operatingsystems_on_title"
DETAIL: Key (title)=(RedHat 9.2) already exists.

After these errors, the rest of registrations work without problems.
The error occurs with 50 concurrent registrations but does not occur with >120 concurrent registrations.

This PR switches from find_or_create_by to create_or_find_by. Per the Active Record documentation:

Attempts to create a record with the given attributes in a table that has a unique database constraint on one or several of its columns. If a row already exists with one or several of these unique constraints, the exception such an insertion would normally raise is caught, and the existing record with those attributes is found using #find_by!.

This is similar to #find_or_create_by, but avoids the problem of stale reads between the SELECT and the INSERT, as that method needs to first query the table, then attempt to insert a row if none is found.

In our testing we found create_or_find_by would successfully avoid the unique constraint error, but then because

This method will return a record if all given attributes are covered by unique constraints (unless the INSERT -> DELETE -> SELECT race condition is triggered), but if creation was attempted and failed due to validation errors it won’t be persisted, you get what #create returns in such situation.

the Rails validation for :name would cause it to return an unpersisted object with errors, and cause

ERROR: ERF42-2587 [Foreman::Exception]: Must provide an operating system 

We get around this with the second code change here, doing another find_by.

With these two changes, Pablo says he ran the test 15 times and did not get any errors.

Considerations

The tradeoff with this solution is that you may end up with some (harmless) errors in your Postgres logs.

Another possible approach is to use active_record_retry from Katello: https://github.com/Katello/katello/blob/882b257afce348d1795db28fce81611543287b9e/app/lib/katello/util/support.rb#L96

But that would require moving that code to Foreman, and we have not yet tested that it would work as well as this.

@jeremylenz
Copy link
Contributor Author

I also just noticed this:

The primary key may auto-increment on each create, even if it fails. This can accelerate the problem of running out of integers, if the underlying table is still stuck on a primary key of type int (note: All Rails apps since 5.1+ have defaulted to bigint, which is not liable to this problem).

If this solution is accepted, I probably will need to add a migration to change the operatingsystem id column to bigint.

Copy link
Contributor

@parthaa parthaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APJ with a test and migration

@jeremylenz jeremylenz force-pushed the 38169-operatingsystem-concurrent branch from 9df388e to 6c6953a Compare January 31, 2025 20:05
@jeremylenz
Copy link
Contributor Author

Updated with test and also with an additional find_by as Partha and I discussed offline. This should hopefully avoid the need for a migration.

@jeremylenz jeremylenz force-pushed the 38169-operatingsystem-concurrent branch from e245b4e to d298858 Compare January 31, 2025 21:44
@ofedoren
Copy link
Member

ofedoren commented Feb 3, 2025

The test failures are related :/

@jeremylenz jeremylenz force-pushed the 38169-operatingsystem-concurrent branch from d298858 to 0db5fb2 Compare February 3, 2025 15:37
@jeremylenz
Copy link
Contributor Author

🍏

Copy link
Contributor

@parthaa parthaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

Copy link
Contributor

@parthaa parthaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@jeremylenz jeremylenz force-pushed the 38169-operatingsystem-concurrent branch from 0db5fb2 to 4af85c4 Compare February 4, 2025 14:54
Copy link
Member

@ofedoren ofedoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jeremylenz and @parthaa !

@ofedoren ofedoren merged commit ef8fec5 into theforeman:develop Feb 4, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants