Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths#22147
Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths#22147
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates metric dumping into several test kits, including the accuracy test runner, LM evaluation kit, and MMMU VLM kit, to track evaluation scores. A review comment suggests standardizing the labeling schema in the LM evaluation kit to match the other modules by using the task name for the 'eval' label and adding an 'api' label for the framework.
| labels={ | ||
| "model": eval_config.get("model_name", ""), | ||
| "eval": "lm-eval", | ||
| "task": task["name"], | ||
| }, |
There was a problem hiding this comment.
The labeling schema here is inconsistent with the other evaluation paths modified in this PR. In mmmu_vlm_kit.py and accuracy_test_runner.py, the eval label represents the benchmark/dataset name (e.g., "mmmu" or "mmmu-pro") and the api label represents the framework/runner (e.g., "lmms-eval" or "nemo-skills").
In this file, eval is set to "lm-eval" and the benchmark is stored in a separate task label. To maintain consistency across the metrics collected from different kits, consider using the task name for the eval label and adding an api label set to "lm-eval".
| labels={ | |
| "model": eval_config.get("model_name", ""), | |
| "eval": "lm-eval", | |
| "task": task["name"], | |
| }, | |
| labels={ | |
| "model": eval_config.get("model_name", ""), | |
| "eval": task["name"], | |
| "api": "lm-eval", | |
| }, |
|
/tag-and-rerun-ci |
Summary
dump_metriccalls to three eval paths that were missing them: MMMU (lmms-eval), lm-eval harness, and NeMo Skills (mmmu-pro)run_eval.pyalready haddump_metric; this covers the remaining onesdump_metricis silent on failure — no risk to existing testsThis is Phase 2 of the eval unification plan started in #21667 (Phase 1: GSM8K unification). The goal is to ensure all eval paths emit
dump_metricoutput, laying the groundwork for regression detection infrastructure in future phases.Changes
kits/mmmu_vlm_kit.pyMMMUMixin.test_mmmu+MMMUMultiModelTestBase._run_vlm_mmmu_testkits/lm_eval_kit.pyLMEvalMixin.test_lm_evalper task/metricaccuracy_test_runner.py_run_nemo_skills_evalafter score parsingTest plan