Skip to content

API call for queue health #2251

@ga-it

Description

@ga-it

Is your feature request related to a problem? Please describe.
We are running R2R with Hatchet. Under high load from a client, we see repeated "THE TIME TO START THE STEP RUN IS TOO LONG, THE MAIN THREAD MAY BE BLOCKED" messages.

  • The Hatchet scheduler is getting swamped by too many CPU-bound parse steps at once;
  • When the system tips over, R2R cancels/retries en masse (“on_failure” churn), which makes the queue grow faster and stalls everything (including embeddings/KG).

Describe the solution you'd like
It would be useful having an api call to determine the health of the queue. This could be queried from a client before uploading further documents

Describe alternatives you've considered

  1. Circuit breaker inside R2R: a 60-second poller that checks “Waiting Steps” from Hatchet; if >N (e.g., 80), flip a process flag that makes ingest endpoints return 429 with Retry-After. This is a 30-line FastAPI middleware.
  2. Observability: Prometheus counters on step queue length and durations; alert when waiting > 60 for 2 minutes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions