Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PP-1491] Introduce retries to prevent ObjectDeletedError #1948

Merged
merged 8 commits into from
Jul 29, 2024

Conversation

dbernstein
Copy link
Contributor

Description

This update attempts to solve the problem of occasional monitor failures due to stale data. I believe the problem arises when two different scripts are updating rows at the same time. Because the SweepMonitor processes updates in batches of 50-100 books, a change can be made by one of the circulation monitors to a database resource that is in a member of a batch that has been retrieved from the database but not yet modified. This can results in both a StaleDataError (not observed but possible) or an ObjectDeletedError (observed).

The following monitors that were showing the behavior: OverdriveFormatSweep, AxisCollectionReaper, and the Axis360CirculationMonitor.

This update adds retry functionality to the SweepMonitor (and all of its subclasses not overriding process_batch).
The retrier will try a maximum of 10 times, each time exponentially backing off until the sleep time hits 60s.

Similarly I added retry functionality to the Axis360CirculationMonitor with the same retry parameters to the process_book and removed the batched commit calls since batch the commits complicate the retry logic.

Motivation and Context

https://ebce-lyrasis.atlassian.net/browse/PP-1491

How Has This Been Tested?

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

def process_book(
self, bibliographic: Metadata, circulation: CirculationData
) -> tuple[Edition, LicensePool]:
if self._db.dirty:
Copy link
Contributor Author

@dbernstein dbernstein Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this check here, but I couldn't think of a cleaner way to do it and still use tenacity which is a nice library. Any ideas to make this cleaner would be appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this catches all the conditions you want to catch here... dirty just means that changes haven't been flushed, not that there are changes within the transaction.

Does it make more sense to take a similar approach to overdrive and not bring in the tenacity library here?

@dbernstein dbernstein changed the title Pp 1491 introduce retries on object deleted error [PP-1491] Introduce retries to prevent ObjectDeletedError Jul 24, 2024
@dbernstein dbernstein force-pushed the PP-1491-introduce-retries-on-object-deleted-error branch 2 times, most recently from 434e5b4 to f10a544 Compare July 24, 2024 19:35
progress.exception = e
else:
time.sleep(1)
self.log.warning(
f"retrying book {book} (attempt {attempt} of {OverdriveCirculationMonitor.MAXIMUM_BOOK_RETRIES})"
f"retrying book {book} (attempt {attempt} of {max_retries})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why take a different approach to retrying here then in the other two monitors?

def process_book(
self, bibliographic: Metadata, circulation: CirculationData
) -> tuple[Edition, LicensePool]:
if self._db.dirty:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this catches all the conditions you want to catch here... dirty just means that changes haven't been flushed, not that there are changes within the transaction.

Does it make more sense to take a similar approach to overdrive and not bring in the tenacity library here?

items = self.fetch_batch(offset).all()
if items:
self.process_items(items)
self._db.commit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about the performance impact of all the additional commits is going to be here.

# I've put the rollback here rather than in a @retry before_sleep hook because I could neither
# figure out how to cleanly reference the db session from within class or module level method on
# the one hand, nor pass an instance method reference to the hook.
self._db.rollback()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm especially concerned with the rollback we are doing here being in the base monitor class. It would be really easy for this to roll back work unexpectedly that happened before calling process_batch.

stop=stop_after_attempt(MAXIMUM_BATCH_RETRIES),
wait=wait_exponential(multiplier=1, min=1, max=60),
reraise=True,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this code to both the monitor case class and to the axis monitor directly?

Copy link
Contributor Author

@dbernstein dbernstein Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the question. If I understand it the reason was that they don't share the same "process" methods. SweepMonitor and Axis360CirculationMonitor share a common ancestor (CollectionMonitor) but the processing methods that need to retry behavior is handled by subclasses.

Copy link

codecov bot commented Jul 24, 2024

Codecov Report

Attention: Patch coverage is 98.18182% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.54%. Comparing base (d2e0f66) to head (b29a4c9).

Files Patch % Lines
src/palace/manager/api/axis.py 94.11% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1948   +/-   ##
=======================================
  Coverage   90.53%   90.54%           
=======================================
  Files         338      338           
  Lines       39943    39963   +20     
  Branches     8642     8641    -1     
=======================================
+ Hits        36164    36186   +22     
+ Misses       2501     2500    -1     
+ Partials     1278     1277    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dbernstein
Copy link
Contributor Author

dbernstein commented Jul 24, 2024

@jonathangreen : I'm going to rework the PR not to use tenacity as you suggested. I liked tenacity because it seemed like it would be cleaner but it seems to be more complicated in this case.

@dbernstein
Copy link
Contributor Author

@jonathangreen : I decided to stick with tenacity (esp because it offers nice methods for doing exponential backoff and I think it makes the code easier to read). I was able to get around the issue I was running into by using a nested transaction. I believe this solution will perform more or less equally well with what came before. The only dramatic increase in commits will come with the Axis360CirculationMonitor since commits where being batched 50 at a time. I don't think this will result in significant performance degradation because almost all of the latency with Axis appears to be due to http communications with the service (about 2-3 per book). The SweeperMonitor commits are still being batched in groups of 100 and the OverdriveCirculationMonitor still does one commit per book.

@dbernstein dbernstein force-pushed the PP-1491-introduce-retries-on-object-deleted-error branch from 6bed746 to b29a4c9 Compare July 26, 2024 16:58
Copy link
Member

@jonathangreen jonathangreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@dbernstein dbernstein merged commit 2963329 into main Jul 29, 2024
20 checks passed
@dbernstein dbernstein deleted the PP-1491-introduce-retries-on-object-deleted-error branch July 29, 2024 14:31
@dbernstein dbernstein added the bug Something isn't working label Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants