-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk closing many objects may cause dor-services-app to crash #4514
Comments
I've been thinking about how to reproduce this. I still have many batches of 50,000 items to update, so if someone is able to investigate, we can coordinate to see how a few job could be monitored. To avoid disrupting other users, I've been running bulk actions after hours so the errors have been happening mainly at night. |
My read of the code suggests that the Unclear how to deal with this (perhaps something in sidekiq to throttle job execution ?) |
There are many possible changes we can make to deal with an apparent DOS, and I'm wondering what the specific failure mode is on the DSA side. So, I agree with everything ☝🏻 re: the importance of reproducing this. It'd be good to know the root problem first. E.g.:
Based on what we learn, we could consider adding nodes to the load-balancer or the database cluster, or look into robustifying the Apache and/or Passenger configuration to allow more connections, etc. We could also consider changes to the code on either or both the Argo and DSA sides: the Argo job could send over more information in bulk; the DSA work could be made async; we could consider using messaging instead of synchronous HTTP API calls; etc. There are many ways to address this, whatever "this" is. 😄 |
While the close version bulk action involves a lot of activity, it all should be serial / synchronous so this a bit surprising. Agree with @mjgiarlo that we need to reproduce to understand how to best address the problem. Note that recent changes to DSA VersionService (21d760e) make closes more efficient by not requiring the cocina object to be loaded. |
Andrew says he has a test object for this; it is an object with tons of small files. (?) |
Not quite that easy. Testing this needs thousands of objects but the changes being made must be small so that accessioning runs at a high rate. |
A few observations:
Noting:
|
To bulk create objects:
|
I was unable to reproduce this in stage: Given that (1) there have been multiple changes to SDR/DSA since the original problem and (2) there are significant differences between stage and production, it is hard to know what to conclude from this. @andrewjbtw How would you like to proceed? |
Describe the problem
I've been bulk updating objects in batches between 10,000 and 50,000 druids in size. The end of this process involves running the bulk action for closing objects. I've noticed that when I run bulk action close on a batch of 50,000, almost inevitably there will be a problem with dor-services-app at some point in the process.
Usually, this shows up as HB errors in common-accessioning that just say "unable to reach dor-services-app", like these:
https://app.honeybadger.io/projects/52894/faults/86609367
https://app.honeybadger.io/projects/52894/faults/95143946
The bulk action will still complete but then I have to follow up by going through the workflow grid and addressing whatever errors occurred when dor-services-app was "unreachable." I've been trying to run the close actions at night in order to avoid too much disruption to regular accessioning.
Additional context
In the Fedora era, there were similar problems when running a bulk close, but they were triggered by batches much lower in size (less than 5000 druids) and clean up afterwards was much more difficult. So the system has improved a considerable amount. But a sustained high load of accessioning still appears to cause disruptions.
The text was updated successfully, but these errors were encountered: