Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths by hnyls2002 · Pull Request #22147 · sgl-project/sglang

hnyls2002 · 2026-04-05T08:36:32Z

Summary

Add dump_metric calls to three eval paths that were missing them: MMMU (lmms-eval), lm-eval harness, and NeMo Skills (mmmu-pro)
All eval paths through run_eval.py already had dump_metric; this covers the remaining ones
dump_metric is silent on failure — no risk to existing tests

This is Phase 2 of the eval unification plan started in #21667 (Phase 1: GSM8K unification). The goal is to ensure all eval paths emit dump_metric output, laying the groundwork for regression detection infrastructure in future phases.

Changes

File	What
`kits/mmmu_vlm_kit.py`	`MMMUMixin.test_mmmu` + `MMMUMultiModelTestBase._run_vlm_mmmu_test`
`kits/lm_eval_kit.py`	`LMEvalMixin.test_lm_eval` per task/metric
`accuracy_test_runner.py`	`_run_nemo_skills_eval` after score parsing

Test plan

CI passes (no functional change — dump_metric is additive and silent-fail)

gemini-code-assist

Code Review

This pull request integrates metric dumping into several test kits, including the accuracy test runner, LM evaluation kit, and MMMU VLM kit, to track evaluation scores. A review comment suggests standardizing the labeling schema in the LM evaluation kit to match the other modules by using the task name for the 'eval' label and adding an 'api' label for the framework.

gemini-code-assist · 2026-04-05T08:38:24Z

python/sglang/test/kits/lm_eval_kit.py

+                    labels={
+                        "model": eval_config.get("model_name", ""),
+                        "eval": "lm-eval",
+                        "task": task["name"],
+                    },


The labeling schema here is inconsistent with the other evaluation paths modified in this PR. In mmmu_vlm_kit.py and accuracy_test_runner.py, the eval label represents the benchmark/dataset name (e.g., "mmmu" or "mmmu-pro") and the api label represents the framework/runner (e.g., "lmms-eval" or "nemo-skills").

In this file, eval is set to "lm-eval" and the benchmark is stored in a separate task label. To maintain consistency across the metrics collected from different kits, consider using the task name for the eval label and adding an api label set to "lm-eval".

Suggested change

labels={

"model": eval_config.get("model_name", ""),

"eval": "lm-eval",

"task": task["name"],

},

labels={

"model": eval_config.get("model_name", ""),

"eval": task["name"],

"api": "lm-eval",

},

hnyls2002 · 2026-04-05T09:02:50Z

/tag-and-rerun-ci

…ject#22147)

Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths

e5ebc4b

github-actions bot added the Multi-modal multi-modal language model label Apr 5, 2026

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

github-actions bot added the run-ci label Apr 5, 2026

hnyls2002 merged commit aeff9fb into main Apr 5, 2026
124 of 161 checks passed

hnyls2002 deleted the lsyin/add-dump-metric branch April 5, 2026 10:23

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths (sgl-pro…

971198f

…ject#22147)

Fridge003 pushed a commit that referenced this pull request Apr 7, 2026

Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths (#22147)

f968cfa

xiezhq-hermann pushed a commit to antgroup/sglang that referenced this pull request Apr 7, 2026

Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths (sgl-pro…

7140ba6

…ject#22147)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths#22147

Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths#22147
hnyls2002 merged 1 commit intomainfrom
lsyin/add-dump-metric

hnyls2002 commented Apr 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

hnyls2002 commented Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hnyls2002 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

hnyls2002 commented Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hnyls2002 commented Apr 5, 2026 •

edited

Loading