QVAC-13536 Fixing benchmark workflow errors (Emdeddings + LLM) by maxim-smotrov · Pull Request #690 · tetherto/qvac

maxim-smotrov · 2026-03-04T16:50:29Z

Note

Embed datasets TRECCOVID and FiQA2018 are still failing with a generic message with no meaningful server logs. This suggests that a crash may be happening somewhere in the CPP layer (segfault?).

For now, we are disabling these datasets.

🎯 What problem does this PR solve?

Benchmark pipeline runs were failing/noisy due to context overflow errors that were not returned in a structured, retryable way.
Workflow defaults and results-summary collection made benchmark runs less stable and less informative in CI/manual dispatch.

📝 How does it solve it?

Adds structured context-overflow handling in embed benchmark server responses (HTTP 422, code: CONTEXT_OVERFLOW, retryable details including sequence/context metadata).
Adds adaptive retry logic in embed benchmark client/model wrapper:
- catches overflow responses,
- applies targeted per-sequence truncation when details are available,
- falls back to global truncation tightening when details are unavailable,
- retries up to a bounded number of attempts with clearer logging.
Normalizes embed benchmark server model config/device handling to avoid inconsistent model-key/config behavior (cpu explicit, otherwise gpu).
Updates workflow defaults and summary publishing:
- embed benchmark default datasets narrowed to ArguAna,NFCorpus,SciFact,SCIDOCS,
- llm benchmark default model pair switched to Qwen3 values,
- both workflows now guard summary extraction and append only when fresh markdown results exist.

🧪 How was it tested?

Ran the workflows in CI from a temp branch and verified they passed.

Try #1. Adding tokenizer proxy to provide vocab size.

Try #2. More fixes and logs.

Try #3. Limit device to only cpu or gpu.

This reverts commit a461e69.

This reverts commit 9951195.

Fix/benchmark pipeline errors

…overflow.

Improve error handling on the server. Added retry in case of context …

Make retries self-adjustable

Adding some more checks and limiting the datasets temporarily

Test: trying to narrow down the error

Exclude failing datasets from embed benchmark

Clean up the code

Changing bench model for LLM

…overflow.

merge main

…ne-errors

gianni-cor · 2026-03-05T12:05:27Z

/review

github-actions · 2026-03-05T12:05:54Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

gianni-cor · 2026-03-05T12:28:37Z

/review

* Try #1. Adding tokenizer proxy to provide vocab size. * Try #2. More fixes and logs. * Try #3. Limit device to only cpu or gpu. * Revert "Try #2. More fixes and logs." This reverts commit a461e69. * Revert "Try #1. Adding tokenizer proxy to provide vocab size." This reverts commit 9951195. * Fixing pipeline logging * Add more logs * Fixing bench logging * Add more error handling and logging * Improve error handling on the server. Added retry in case of context overflow. * Make retries self-adjustable * Adding some more checks and limiting the datasets temporarily * Test: trying to narrow down the error * Exclude failing datasets from embed benchmark * Clean up the code * Changing bench model for LLM * Try #1. Adding tokenizer proxy to provide vocab size. * Try #2. More fixes and logs. * Try #3. Limit device to only cpu or gpu. * Revert "Try #2. More fixes and logs." This reverts commit a461e69. * Revert "Try #1. Adding tokenizer proxy to provide vocab size." This reverts commit 9951195. * Fixing pipeline logging * Add more logs * Fixing bench logging * Add more error handling and logging * Improve error handling on the server. Added retry in case of context overflow. * Make retries self-adjustable * Adding some more checks and limiting the datasets temporarily * Test: trying to narrow down the error * Exclude failing datasets from embed benchmark * Clean up the code * Changing bench model for LLM * Minor fixes for clarity * Removing unused vars * Removing unused imports * Removing unused python deps --------- Co-authored-by: gianni <gianfranco.cordella@tether.io>

maxim-smotrov and others added 29 commits February 28, 2026 16:19

Try #1. Adding tokenizer proxy to provide vocab size.

9951195

Merge pull request #611 from maxim-smotrov/fix/benchmark-pipeline-errors

c3456ed

Try #1. Adding tokenizer proxy to provide vocab size.

Try #2. More fixes and logs.

a461e69

Merge pull request #612 from maxim-smotrov/fix/benchmark-pipeline-errors

05b303a

Try #2. More fixes and logs.

Try #3. Limit device to only cpu or gpu.

b40ee2a

Merge pull request #614 from maxim-smotrov/fix/benchmark-pipeline-errors

3807b27

Try #3. Limit device to only cpu or gpu.

Revert "Try #2. More fixes and logs."

4a43dcc

This reverts commit a461e69.

Revert "Try #1. Adding tokenizer proxy to provide vocab size."

b85679e

This reverts commit 9951195.

Merge pull request #616 from maxim-smotrov/fix/benchmark-pipeline-errors

6925a42

Fix/benchmark pipeline errors

Fixing pipeline logging

9800050

Add more logs

87ad80b

Merge pull request #617 from maxim-smotrov/fix/benchmark-pipeline-errors

7daea3e

Fix/benchmark pipeline errors

Fixing bench logging

b9461a8

Add more error handling and logging

0359c56

Merge pull request #619 from maxim-smotrov/fix/benchmark-pipeline-errors

8b4a20d

Fix/benchmark pipeline errors

Improve error handling on the server. Added retry in case of context …

67413a1

…overflow.

Merge pull request #624 from maxim-smotrov/fix/benchmark-pipeline-errors

8cceb6e

Improve error handling on the server. Added retry in case of context …

Make retries self-adjustable

ba52896

Merge pull request #626 from maxim-smotrov/fix/benchmark-pipeline-errors

2953a58

Make retries self-adjustable

Adding some more checks and limiting the datasets temporarily

abbbce7

Merge pull request #628 from maxim-smotrov/fix/benchmark-pipeline-errors

3990d03

Adding some more checks and limiting the datasets temporarily

Test: trying to narrow down the error

4556082

Merge pull request #632 from maxim-smotrov/fix/benchmark-pipeline-errors

9ea84f5

Test: trying to narrow down the error

Exclude failing datasets from embed benchmark

f66c755

Merge pull request #683 from maxim-smotrov/fix/benchmark-pipeline-errors

3b57033

Exclude failing datasets from embed benchmark

Clean up the code

7eae7a6

Merge pull request #686 from maxim-smotrov/fix/benchmark-pipeline-errors

4268cd9

Clean up the code

Changing bench model for LLM

57ef730

Merge pull request #688 from maxim-smotrov/fix/benchmark-pipeline-errors

3f519d9

Changing bench model for LLM

maxim-smotrov requested a review from a team as a code owner March 4, 2026 16:50

maxim-smotrov added 9 commits March 4, 2026 17:50

Fixing bench logging

621d7e2

Add more error handling and logging

6158579

Improve error handling on the server. Added retry in case of context …

d477f33

…overflow.

Make retries self-adjustable

828f508

Adding some more checks and limiting the datasets temporarily

dc0e138

Test: trying to narrow down the error

9a92a37

Exclude failing datasets from embed benchmark

a70860d

Clean up the code

711562d

Changing bench model for LLM

f9e3d83

maxim-smotrov force-pushed the fix/benchmark-pipeline-errors branch from 57ef730 to f9e3d83 Compare March 4, 2026 16:50

maxim-smotrov changed the title ~~QVAC-13536 Fixing benchmark workflow errors~~ QVAC-13536 Fixing benchmark workflow errors (LLM, Emdeddings) Mar 4, 2026

maxim-smotrov changed the title ~~QVAC-13536 Fixing benchmark workflow errors (LLM, Emdeddings)~~ QVAC-13536 Fixing benchmark workflow errors (Emdeddings) Mar 4, 2026

gianni-cor requested changes Mar 4, 2026

View reviewed changes

maxim-smotrov and others added 3 commits March 5, 2026 00:41

Minor fixes for clarity

febb229

Merge pull request #696 from tetherto/main

29381c6

merge main

Merge branch 'temp-benchmark-fix-llm-embed' into fix/benchmark-pipeli…

992b740

…ne-errors

maxim-smotrov changed the title ~~QVAC-13536 Fixing benchmark workflow errors (Emdeddings)~~ QVAC-13536 Fixing benchmark workflow errors (Emdeddings + LLM) Mar 4, 2026

gianni-cor and others added 3 commits March 5, 2026 10:59

Merge branch 'main' into fix/benchmark-pipeline-errors

37ebc24

Removing unused vars

9fce290

Merge branch 'main' into fix/benchmark-pipeline-errors

a6e7cde

gianni-cor requested changes Mar 5, 2026

View reviewed changes

Comment thread packages/qvac-lib-infer-llamacpp-embed/benchmarks/client/model_handler.py Outdated

maxim-smotrov added 2 commits March 5, 2026 12:48

Removing unused imports

523f318

Removing unused python deps

5062c10

gianni-cor approved these changes Mar 5, 2026

View reviewed changes

dev-nid approved these changes Mar 5, 2026

View reviewed changes

gianni-cor merged commit 061526d into tetherto:main Mar 5, 2026
46 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-13536 Fixing benchmark workflow errors (Emdeddings + LLM)#690

QVAC-13536 Fixing benchmark workflow errors (Emdeddings + LLM)#690
gianni-cor merged 53 commits into
tetherto:mainfrom
maxim-smotrov:fix/benchmark-pipeline-errors

maxim-smotrov commented Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gianni-cor commented Mar 5, 2026

Uh oh!

github-actions Bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

gianni-cor commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maxim-smotrov commented Mar 4, 2026

🎯 What problem does this PR solve?

📝 How does it solve it?

🧪 How was it tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gianni-cor commented Mar 5, 2026

Uh oh!

github-actions Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

gianni-cor commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Mar 5, 2026 •

edited

Loading