Skip to content

Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths#22147

Merged
hnyls2002 merged 1 commit intomainfrom
lsyin/add-dump-metric
Apr 5, 2026
Merged

Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths#22147
hnyls2002 merged 1 commit intomainfrom
lsyin/add-dump-metric

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Apr 5, 2026

Summary

  • Add dump_metric calls to three eval paths that were missing them: MMMU (lmms-eval), lm-eval harness, and NeMo Skills (mmmu-pro)
  • All eval paths through run_eval.py already had dump_metric; this covers the remaining ones
  • dump_metric is silent on failure — no risk to existing tests

This is Phase 2 of the eval unification plan started in #21667 (Phase 1: GSM8K unification). The goal is to ensure all eval paths emit dump_metric output, laying the groundwork for regression detection infrastructure in future phases.

Changes

File What
kits/mmmu_vlm_kit.py MMMUMixin.test_mmmu + MMMUMultiModelTestBase._run_vlm_mmmu_test
kits/lm_eval_kit.py LMEvalMixin.test_lm_eval per task/metric
accuracy_test_runner.py _run_nemo_skills_eval after score parsing

Test plan

  • CI passes (no functional change — dump_metric is additive and silent-fail)

@github-actions github-actions bot added the Multi-modal multi-modal language model label Apr 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates metric dumping into several test kits, including the accuracy test runner, LM evaluation kit, and MMMU VLM kit, to track evaluation scores. A review comment suggests standardizing the labeling schema in the LM evaluation kit to match the other modules by using the task name for the 'eval' label and adding an 'api' label for the framework.

Comment on lines +77 to +81
labels={
"model": eval_config.get("model_name", ""),
"eval": "lm-eval",
"task": task["name"],
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The labeling schema here is inconsistent with the other evaluation paths modified in this PR. In mmmu_vlm_kit.py and accuracy_test_runner.py, the eval label represents the benchmark/dataset name (e.g., "mmmu" or "mmmu-pro") and the api label represents the framework/runner (e.g., "lmms-eval" or "nemo-skills").

In this file, eval is set to "lm-eval" and the benchmark is stored in a separate task label. To maintain consistency across the metrics collected from different kits, consider using the task name for the eval label and adding an api label set to "lm-eval".

Suggested change
labels={
"model": eval_config.get("model_name", ""),
"eval": "lm-eval",
"task": task["name"],
},
labels={
"model": eval_config.get("model_name", ""),
"eval": task["name"],
"api": "lm-eval",
},

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Apr 5, 2026
@hnyls2002 hnyls2002 merged commit aeff9fb into main Apr 5, 2026
124 of 161 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/add-dump-metric branch April 5, 2026 10:23
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant