-
Notifications
You must be signed in to change notification settings - Fork 593
Open
Description
Is your feature request related to a problem? Please describe.
We are running R2R with Hatchet. Under high load from a client, we see repeated "THE TIME TO START THE STEP RUN IS TOO LONG, THE MAIN THREAD MAY BE BLOCKED" messages.
- The Hatchet scheduler is getting swamped by too many CPU-bound parse steps at once;
- When the system tips over, R2R cancels/retries en masse (“on_failure” churn), which makes the queue grow faster and stalls everything (including embeddings/KG).
Describe the solution you'd like
It would be useful having an api call to determine the health of the queue. This could be queried from a client before uploading further documents
Describe alternatives you've considered
- Circuit breaker inside R2R: a 60-second poller that checks “Waiting Steps” from Hatchet; if >N (e.g., 80), flip a process flag that makes ingest endpoints return 429 with Retry-After. This is a 30-line FastAPI middleware.
- Observability: Prometheus counters on step queue length and durations; alert when waiting > 60 for 2 minutes.
Metadata
Metadata
Assignees
Labels
No labels