[PP-1491] Introduce retries to prevent ObjectDeletedError #1948

dbernstein · 2024-07-24T18:14:10Z

Description

This update attempts to solve the problem of occasional monitor failures due to stale data. I believe the problem arises when two different scripts are updating rows at the same time. Because the SweepMonitor processes updates in batches of 50-100 books, a change can be made by one of the circulation monitors to a database resource that is in a member of a batch that has been retrieved from the database but not yet modified. This can results in both a StaleDataError (not observed but possible) or an ObjectDeletedError (observed).

The following monitors that were showing the behavior: OverdriveFormatSweep, AxisCollectionReaper, and the Axis360CirculationMonitor.

This update adds retry functionality to the SweepMonitor (and all of its subclasses not overriding process_batch).
The retrier will try a maximum of 10 times, each time exponentially backing off until the sleep time hits 60s.

Similarly I added retry functionality to the Axis360CirculationMonitor with the same retry parameters to the process_book and removed the batched commit calls since batch the commits complicate the retry logic.

Motivation and Context

https://ebce-lyrasis.atlassian.net/browse/PP-1491

How Has This Been Tested?

Checklist

I have updated the documentation accordingly.
All new and existing tests passed.

dbernstein · 2024-07-24T18:19:49Z

src/palace/manager/api/axis.py

    def process_book(
        self, bibliographic: Metadata, circulation: CirculationData
    ) -> tuple[Edition, LicensePool]:
+        if self._db.dirty:


I don't love this check here, but I couldn't think of a cleaner way to do it and still use tenacity which is a nice library. Any ideas to make this cleaner would be appreciated.

I don't think this catches all the conditions you want to catch here... dirty just means that changes haven't been flushed, not that there are changes within the transaction.

Does it make more sense to take a similar approach to overdrive and not bring in the tenacity library here?

jonathangreen · 2024-07-24T18:58:06Z

src/palace/manager/api/overdrive.py

                        progress.exception = e
                    else:
                        time.sleep(1)
                        self.log.warning(
-                            f"retrying book {book} (attempt {attempt} of {OverdriveCirculationMonitor.MAXIMUM_BOOK_RETRIES})"
+                            f"retrying book {book} (attempt {attempt} of {max_retries})"


Why take a different approach to retrying here then in the other two monitors?

jonathangreen · 2024-07-24T19:34:05Z

src/palace/manager/api/axis.py

    def process_book(
        self, bibliographic: Metadata, circulation: CirculationData
    ) -> tuple[Edition, LicensePool]:
+        if self._db.dirty:


I don't think this catches all the conditions you want to catch here... dirty just means that changes haven't been flushed, not that there are changes within the transaction.

Does it make more sense to take a similar approach to overdrive and not bring in the tenacity library here?

jonathangreen · 2024-07-24T19:35:12Z

src/palace/manager/core/monitor.py

        items = self.fetch_batch(offset).all()
        if items:
            self.process_items(items)
+            self._db.commit()


I'm a bit worried about the performance impact of all the additional commits is going to be here.

jonathangreen · 2024-07-24T19:37:47Z

src/palace/manager/core/monitor.py

+            # I've put the rollback here rather than in a @retry before_sleep hook because I could neither
+            # figure out how to cleanly reference the db session from within class or module  level method on
+            # the one hand, nor pass an instance method reference to the hook.
+            self._db.rollback()


I'm especially concerned with the rollback we are doing here being in the base monitor class. It would be really easy for this to roll back work unexpectedly that happened before calling process_batch.

jonathangreen · 2024-07-24T19:37:53Z

src/palace/manager/core/monitor.py

+        stop=stop_after_attempt(MAXIMUM_BATCH_RETRIES),
+        wait=wait_exponential(multiplier=1, min=1, max=60),
+        reraise=True,
+    )


Why add this code to both the monitor case class and to the axis monitor directly?

Not sure I understand the question. If I understand it the reason was that they don't share the same "process" methods. SweepMonitor and Axis360CirculationMonitor share a common ancestor (CollectionMonitor) but the processing methods that need to retry behavior is handled by subclasses.

codecov · 2024-07-24T19:44:25Z

Codecov Report

Attention: Patch coverage is 98.18182% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.54%. Comparing base (d2e0f66) to head (b29a4c9).

Files	Patch %	Lines
src/palace/manager/api/axis.py	94.11%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1948   +/-   ##
=======================================
  Coverage   90.53%   90.54%           
=======================================
  Files         338      338           
  Lines       39943    39963   +20     
  Branches     8642     8641    -1     
=======================================
+ Hits        36164    36186   +22     
+ Misses       2501     2500    -1     
+ Partials     1278     1277    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dbernstein · 2024-07-24T20:12:32Z

@jonathangreen : I'm going to rework the PR not to use tenacity as you suggested. I liked tenacity because it seemed like it would be cleaner but it seems to be more complicated in this case.

dbernstein · 2024-07-25T17:32:41Z

@jonathangreen : I decided to stick with tenacity (esp because it offers nice methods for doing exponential backoff and I think it makes the code easier to read). I was able to get around the issue I was running into by using a nested transaction. I believe this solution will perform more or less equally well with what came before. The only dramatic increase in commits will come with the Axis360CirculationMonitor since commits where being batched 50 at a time. I don't think this will result in significant performance degradation because almost all of the latency with Axis appears to be due to http communications with the service (about 2-3 per book). The SweeperMonitor commits are still being batched in groups of 100 and the OverdriveCirculationMonitor still does one commit per book.

…itor

Add test coverage for axis

…attempting to retry.

…CirculationMonitor and SweepMonitor.

… ObjectDeletedError

jonathangreen

Looks good

dbernstein commented Jul 24, 2024

View reviewed changes

dbernstein changed the title ~~Pp 1491 introduce retries on object deleted error~~ [PP-1491] Introduce retries to prevent ObjectDeletedError Jul 24, 2024

dbernstein requested a review from jonathangreen July 24, 2024 18:40

dbernstein force-pushed the PP-1491-introduce-retries-on-object-deleted-error branch 2 times, most recently from 434e5b4 to f10a544 Compare July 24, 2024 19:35

jonathangreen reviewed Jul 24, 2024

View reviewed changes

dbernstein requested a review from jonathangreen July 25, 2024 22:30

dbernstein added 8 commits July 26, 2024 09:58

[PP-1491] Introduce retries on SweepMonitor and Axis360CirculationMon…

797da6e

…itor

Refactor retries to use tenacity retry library

7e9dc81

Add test coverage for axis

Ensure that sessions containing stale objects are rolled back before …

00479f1

…attempting to retry.

Clean up.

b257474

Update poetry lock.

d6d82a6

Use nested transactions.

427b425

Align OverdriveCirculationMonitor retry approach with that of Axis360…

3390b85

…CirculationMonitor and SweepMonitor.

Fix test, reduce test execution time, add test for StaleDataError and…

b29a4c9

… ObjectDeletedError

dbernstein force-pushed the PP-1491-introduce-retries-on-object-deleted-error branch from 6bed746 to b29a4c9 Compare July 26, 2024 16:58

jonathangreen approved these changes Jul 29, 2024

View reviewed changes

dbernstein merged commit 2963329 into main Jul 29, 2024
20 checks passed

dbernstein deleted the PP-1491-introduce-retries-on-object-deleted-error branch July 29, 2024 14:31

dbernstein added the bug Something isn't working label Aug 1, 2024

jonathangreen mentioned this pull request Sep 18, 2024

Fix uncommitted nested transaction in SweepMonitor (PP-1717) #2074

Merged

2 tasks

jonathangreen mentioned this pull request Nov 21, 2024

[PP-1956] Ensure the context transaction manager is being used for cr… #2181

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PP-1491] Introduce retries to prevent ObjectDeletedError #1948

[PP-1491] Introduce retries to prevent ObjectDeletedError #1948

dbernstein commented Jul 24, 2024

dbernstein Jul 24, 2024 •

edited

Loading

jonathangreen Jul 24, 2024

jonathangreen Jul 24, 2024

jonathangreen Jul 24, 2024

jonathangreen Jul 24, 2024

jonathangreen Jul 24, 2024

jonathangreen Jul 24, 2024

dbernstein Jul 24, 2024 •

edited

Loading

codecov bot commented Jul 24, 2024 •

edited

Loading

dbernstein commented Jul 24, 2024 •

edited

Loading

dbernstein commented Jul 25, 2024

jonathangreen left a comment

[PP-1491] Introduce retries to prevent ObjectDeletedError #1948

[PP-1491] Introduce retries to prevent ObjectDeletedError #1948

Conversation

dbernstein commented Jul 24, 2024

Description

Motivation and Context

How Has This Been Tested?

Checklist

dbernstein Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

jonathangreen Jul 24, 2024

Choose a reason for hiding this comment

jonathangreen Jul 24, 2024

Choose a reason for hiding this comment

jonathangreen Jul 24, 2024

Choose a reason for hiding this comment

jonathangreen Jul 24, 2024

Choose a reason for hiding this comment

jonathangreen Jul 24, 2024

Choose a reason for hiding this comment

jonathangreen Jul 24, 2024

Choose a reason for hiding this comment

dbernstein Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jul 24, 2024 • edited Loading

Codecov Report

dbernstein commented Jul 24, 2024 • edited Loading

dbernstein commented Jul 25, 2024

jonathangreen left a comment

Choose a reason for hiding this comment

dbernstein Jul 24, 2024 •

edited

Loading

dbernstein Jul 24, 2024 •

edited

Loading

codecov bot commented Jul 24, 2024 •

edited

Loading

dbernstein commented Jul 24, 2024 •

edited

Loading