-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PP-1491] Introduce retries to prevent ObjectDeletedError #1948
[PP-1491] Introduce retries to prevent ObjectDeletedError #1948
Conversation
src/palace/manager/api/axis.py
Outdated
def process_book( | ||
self, bibliographic: Metadata, circulation: CirculationData | ||
) -> tuple[Edition, LicensePool]: | ||
if self._db.dirty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this check here, but I couldn't think of a cleaner way to do it and still use tenacity
which is a nice library. Any ideas to make this cleaner would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this catches all the conditions you want to catch here... dirty just means that changes haven't been flushed, not that there are changes within the transaction.
Does it make more sense to take a similar approach to overdrive and not bring in the tenacity
library here?
434e5b4
to
f10a544
Compare
src/palace/manager/api/overdrive.py
Outdated
progress.exception = e | ||
else: | ||
time.sleep(1) | ||
self.log.warning( | ||
f"retrying book {book} (attempt {attempt} of {OverdriveCirculationMonitor.MAXIMUM_BOOK_RETRIES})" | ||
f"retrying book {book} (attempt {attempt} of {max_retries})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why take a different approach to retrying here then in the other two monitors?
src/palace/manager/api/axis.py
Outdated
def process_book( | ||
self, bibliographic: Metadata, circulation: CirculationData | ||
) -> tuple[Edition, LicensePool]: | ||
if self._db.dirty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this catches all the conditions you want to catch here... dirty just means that changes haven't been flushed, not that there are changes within the transaction.
Does it make more sense to take a similar approach to overdrive and not bring in the tenacity
library here?
src/palace/manager/core/monitor.py
Outdated
items = self.fetch_batch(offset).all() | ||
if items: | ||
self.process_items(items) | ||
self._db.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about the performance impact of all the additional commits is going to be here.
src/palace/manager/core/monitor.py
Outdated
# I've put the rollback here rather than in a @retry before_sleep hook because I could neither | ||
# figure out how to cleanly reference the db session from within class or module level method on | ||
# the one hand, nor pass an instance method reference to the hook. | ||
self._db.rollback() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm especially concerned with the rollback we are doing here being in the base monitor class. It would be really easy for this to roll back work unexpectedly that happened before calling process_batch
.
stop=stop_after_attempt(MAXIMUM_BATCH_RETRIES), | ||
wait=wait_exponential(multiplier=1, min=1, max=60), | ||
reraise=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this code to both the monitor case class and to the axis monitor directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the question. If I understand it the reason was that they don't share the same "process" methods. SweepMonitor and Axis360CirculationMonitor share a common ancestor (CollectionMonitor) but the processing methods that need to retry behavior is handled by subclasses.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1948 +/- ##
=======================================
Coverage 90.53% 90.54%
=======================================
Files 338 338
Lines 39943 39963 +20
Branches 8642 8641 -1
=======================================
+ Hits 36164 36186 +22
+ Misses 2501 2500 -1
+ Partials 1278 1277 -1 ☔ View full report in Codecov by Sentry. |
@jonathangreen : I'm going to rework the PR not to use tenacity as you suggested. I liked tenacity because it seemed like it would be cleaner but it seems to be more complicated in this case. |
@jonathangreen : I decided to stick with tenacity (esp because it offers nice methods for doing exponential backoff and I think it makes the code easier to read). I was able to get around the issue I was running into by using a nested transaction. I believe this solution will perform more or less equally well with what came before. The only dramatic increase in commits will come with the Axis360CirculationMonitor since commits where being batched 50 at a time. I don't think this will result in significant performance degradation because almost all of the latency with Axis appears to be due to http communications with the service (about 2-3 per book). The SweeperMonitor commits are still being batched in groups of 100 and the OverdriveCirculationMonitor still does one commit per book. |
Add test coverage for axis
…attempting to retry.
…CirculationMonitor and SweepMonitor.
… ObjectDeletedError
6bed746
to
b29a4c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Description
This update attempts to solve the problem of occasional monitor failures due to stale data. I believe the problem arises when two different scripts are updating rows at the same time. Because the SweepMonitor processes updates in batches of 50-100 books, a change can be made by one of the circulation monitors to a database resource that is in a member of a batch that has been retrieved from the database but not yet modified. This can results in both a StaleDataError (not observed but possible) or an ObjectDeletedError (observed).
The following monitors that were showing the behavior: OverdriveFormatSweep, AxisCollectionReaper, and the Axis360CirculationMonitor.
This update adds retry functionality to the SweepMonitor (and all of its subclasses not overriding
process_batch
).The retrier will try a maximum of 10 times, each time exponentially backing off until the sleep time hits 60s.
Similarly I added retry functionality to the Axis360CirculationMonitor with the same retry parameters to the
process_book
and removed the batchedcommit
calls since batch the commits complicate the retry logic.Motivation and Context
https://ebce-lyrasis.atlassian.net/browse/PP-1491
How Has This Been Tested?
Checklist