Skip to content

QVAC-13536 Fixing benchmark workflow errors (Emdeddings + LLM)#690

Merged
gianni-cor merged 53 commits into
tetherto:mainfrom
maxim-smotrov:fix/benchmark-pipeline-errors
Mar 5, 2026
Merged

QVAC-13536 Fixing benchmark workflow errors (Emdeddings + LLM)#690
gianni-cor merged 53 commits into
tetherto:mainfrom
maxim-smotrov:fix/benchmark-pipeline-errors

Conversation

@maxim-smotrov

Copy link
Copy Markdown
Contributor

Note

Embed datasets TRECCOVID and FiQA2018 are still failing with a generic message with no meaningful server logs. This suggests that a crash may be happening somewhere in the CPP layer (segfault?).

For now, we are disabling these datasets.

🎯 What problem does this PR solve?

  • Benchmark pipeline runs were failing/noisy due to context overflow errors that were not returned in a structured, retryable way.
  • Workflow defaults and results-summary collection made benchmark runs less stable and less informative in CI/manual dispatch.

📝 How does it solve it?

  • Adds structured context-overflow handling in embed benchmark server responses (HTTP 422, code: CONTEXT_OVERFLOW, retryable details including sequence/context metadata).
  • Adds adaptive retry logic in embed benchmark client/model wrapper:
    • catches overflow responses,
    • applies targeted per-sequence truncation when details are available,
    • falls back to global truncation tightening when details are unavailable,
    • retries up to a bounded number of attempts with clearer logging.
  • Normalizes embed benchmark server model config/device handling to avoid inconsistent model-key/config behavior (cpu explicit, otherwise gpu).
  • Updates workflow defaults and summary publishing:
    • embed benchmark default datasets narrowed to ArguAna,NFCorpus,SciFact,SCIDOCS,
    • llm benchmark default model pair switched to Qwen3 values,
    • both workflows now guard summary extraction and append only when fresh markdown results exist.

🧪 How was it tested?

  • Ran the workflows in CI from a temp branch and verified they passed.

maxim-smotrov and others added 29 commits February 28, 2026 16:19
Try #1. Adding tokenizer proxy to provide vocab size.
Improve error handling on the server. Added retry in case of context …
Adding some more checks and limiting the datasets temporarily
@maxim-smotrov maxim-smotrov requested a review from a team as a code owner March 4, 2026 16:50
@maxim-smotrov maxim-smotrov force-pushed the fix/benchmark-pipeline-errors branch from 57ef730 to f9e3d83 Compare March 4, 2026 16:50
@maxim-smotrov maxim-smotrov changed the title QVAC-13536 Fixing benchmark workflow errors QVAC-13536 Fixing benchmark workflow errors (LLM, Emdeddings) Mar 4, 2026
@maxim-smotrov maxim-smotrov changed the title QVAC-13536 Fixing benchmark workflow errors (LLM, Emdeddings) QVAC-13536 Fixing benchmark workflow errors (Emdeddings) Mar 4, 2026
Comment thread packages/qvac-lib-infer-llamacpp-embed/benchmarks/client/model_handler.py Outdated
Comment thread .github/workflows/benchmark-qvac-lib-infer-llamacpp-embed.yml Outdated
Comment thread .github/workflows/benchmark-qvac-lib-infer-llamacpp-llm.yml Outdated
Comment thread .github/workflows/benchmark-qvac-lib-infer-llamacpp-embed.yml Outdated
@maxim-smotrov maxim-smotrov changed the title QVAC-13536 Fixing benchmark workflow errors (Emdeddings) QVAC-13536 Fixing benchmark workflow errors (Emdeddings + LLM) Mar 4, 2026
Comment thread packages/qvac-lib-infer-llamacpp-embed/benchmarks/client/model_handler.py Outdated
@gianni-cor

Copy link
Copy Markdown
Contributor

/review

@github-actions

github-actions Bot commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@gianni-cor

Copy link
Copy Markdown
Contributor

/review

@gianni-cor gianni-cor merged commit 061526d into tetherto:main Mar 5, 2026
46 of 50 checks passed
Proletter pushed a commit that referenced this pull request May 24, 2026
* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Minor fixes for clarity

* Removing unused vars

* Removing unused imports

* Removing unused python deps

---------

Co-authored-by: gianni <gianfranco.cordella@tether.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants