diff --git a/docs/evaluation/eval-kit.md b/docs/evaluation/eval-kit.md new file mode 100644 index 0000000000..f5658edae3 --- /dev/null +++ b/docs/evaluation/eval-kit.md @@ -0,0 +1,282 @@ +# VLMEvalKit Integration (eval_kit) + +This page explains how to run VLMEvalKit benchmarks through NeMo Skills using the `eval_kit` generation module. This enables evaluating Megatron multimodal models on VLMEvalKit's benchmark collection (MMBench, LibriSpeech, TedLium, etc.) without leaving the NeMo Skills pipeline. + +## Overview + +Two inference modes are available: + +| Mode | How it works | When to use | +|------|-------------|-------------| +| **mcore** | Megatron model loaded in-process via `torchrun` (no HTTP server) | Megatron checkpoints | +| **vllm** | NeMo Skills starts a vLLM server, VLMEvalKit connects as client | HF models served by vLLM | + +Both modes use the same pipeline command — the only difference is the `++model_type` flag. + +## Prerequisites + +Before running eval_kit benchmarks, you need four things set up: + +### 1. VLMEvalKit source code (local) + +The `vlmeval/` directory from VLMEvalKit gets packaged and shipped to the cluster automatically. You need a local clone: + +```bash +# Clone VLMEvalKit (NVIDIA internal fork with MultiModalMCore support) +git clone VLMEvalKitMcore /path/to/VLMEvalKitMcore +``` + +Then set the environment variable **before running any `ns eval` command**: + +```bash +export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore +``` + +!!! important + This path is read **locally at submission time**. The pipeline packages the `vlmeval/` subdirectory and rsyncs it to the cluster. It does NOT need to exist on the cluster. + +### 2. eval_kit container on the cluster + +The eval_kit container must have PyTorch, Megatron, and VLMEvalKit dependencies pre-installed. Add it to your cluster config: +This container can be found in container storage + +```yaml +# cluster_configs/my_cluster.yaml +containers: + eval_kit: /path/to/eval-kit-nemo-skills.sqsh + # ... other containers +``` + +### 3. Megatron path (for mcore mode) + +The container needs access to a Megatron-LM installation. Set it in your cluster config: + +```yaml +env_vars: + - MEGATRON_PATH=/path/to/megatron-lm + - PYTHONPATH=/path/to/megatron-lm +``` + +And ensure the path is mounted: + +```yaml +mounts: + - /host/path/to/megatron-lm:/host/path/to/megatron-lm +``` + +### 4. VLMEvalKit dataset cache (for benchmarks that download from HuggingFace) + +VLMEvalKit downloads benchmark data on first use. Set a persistent cache directory: + +```yaml +env_vars: + - LMUData=/path/to/vlmevalkit_cache +``` + +## Running eval_kit Benchmarks + +### Mode 1: Megatron in-process (mcore) + +This is the primary mode. The model runs directly inside the `torchrun` process — no separate server. + +```bash +export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore + +ns eval \ + --cluster=my_cluster \ + --output_dir=/path/to/results \ + --benchmarks=eval_kit.LibriSpeech_test_clean \ + --server_type=megatron \ + --server_gpus=8 \ + --server_container=/path/to/eval-kit-nemo-skills.sqsh \ + ++model_type=mcore \ + ++model_config=/path/to/config.yaml \ + ++load_dir=/path/to/checkpoint/TP_1/ +``` + +Key parameters: + +| Parameter | Purpose | +|-----------|---------| +| `--benchmarks=eval_kit.` | VLMEvalKit dataset name (e.g., `LibriSpeech_test_clean`, `MMBench_DEV_EN`, `TedLium_ASR_Test`) | +| `++model_type=mcore` | Triggers self-contained mode (no HTTP server, model loaded in-process) | +| `++model_config=` | Path to Megatron model YAML config on the cluster | +| `++load_dir=` | Path to Megatron checkpoint directory on the cluster | +| `--server_gpus=8` | Number of GPUs allocated to the torchrun process | +| `--server_container=` | Container with Megatron + VLMEvalKit dependencies | + +!!! note + `--server_gpus` controls GPU allocation even though no server is started. In mcore mode, these GPUs go directly to the `torchrun` main task. + +!!! note + `--model` is **not needed** for mcore mode — the model is specified via `++model_config` and `++load_dir`. + +### Mode 2: vLLM server + +The pipeline starts a vLLM server, and VLMEvalKit's `VLLMLocal` client connects to it. + +```bash +export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore + +ns eval \ + --cluster=my_cluster \ + --output_dir=/path/to/results \ + --benchmarks=eval_kit.MMBench_DEV_EN \ + --model=Qwen/Qwen2-Audio-7B-Instruct \ + --server_type=vllm \ + --server_gpus=2 \ + --server_container=/path/to/vllm-audio.sqsh \ + --main_container=/path/to/eval-kit-nemo-skills.sqsh \ + --server_args="--max-model-len 8192 --trust-remote-code" \ + ++model_type=vllm \ + ++model_name=qwen2-audio-7b +``` + +Key differences from mcore mode: + +| Parameter | Purpose | +|-----------|---------| +| `--model=` | HuggingFace model name or path (vLLM downloads/loads it) | +| `++model_type=vllm` | VLMEvalKit uses its `VLLMLocal` client | +| `++model_name=` | Model identifier used by VLMEvalKit for result naming | +| `--main_container=` | Container for the eval_kit client (must have `vlmeval`). Separate from the vLLM server container | +| `--server_container=` | Container for the vLLM server | + +!!! warning + The vLLM server container and the eval_kit client container are different. Use `--server_container` for vLLM and `--main_container` for the eval_kit client that needs `vlmeval`. + +## Available Benchmarks + +Any VLMEvalKit dataset can be used with the `eval_kit.` prefix. Examples: + +### Audio / ASR + +| Benchmark name | Dataset | +|---|---| +| `eval_kit.LibriSpeech_test_clean` | LibriSpeech test-clean (2,620 samples) | +| `eval_kit.LibriSpeech_test_other` | LibriSpeech test-other | +| `eval_kit.TedLium_ASR_Test` | TED-LIUM | +| `eval_kit.GigaSpeech_ASR_test` | GigaSpeech | +| `eval_kit.VoxPopuli_ASR_test` | VoxPopuli | +| `eval_kit.AMI_ASR_Test` | AMI meeting transcription | +| `eval_kit.SPGISpeech_ASR_test` | SPGISpeech | +| `eval_kit.Earnings22_ASR_Test` | Earnings22 | + +### Vision-Language + +| Benchmark name | Dataset | +|---|---| +| `eval_kit.MMBench_DEV_EN` | MMBench English dev | +| `eval_kit.MME` | MME perception + cognition | +| `eval_kit.MMMU_DEV_VAL` | MMMU dev+val | +| `eval_kit.MathVista_MINI` | MathVista mini | + +The full list depends on your VLMEvalKit version. Check `vlmeval/dataset/` for all supported datasets. + +## mcore_skills: NeMo Skills Data + Megatron In-Process + +For benchmarks that already have NeMo Skills JSONL data (like `asr-leaderboard`), you can use the `mcore_skills` generation type. This reads NeMo Skills data and prompts but uses MultiModalMCore for inference (no server). + +```bash +export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore + +ns eval \ + --cluster=my_cluster \ + --output_dir=/path/to/results \ + --benchmarks=asr-leaderboard \ + --split=librispeech_clean \ + --data_dir=/data \ + --generation_type=mcore_skills \ + --server_type=megatron \ + --server_gpus=8 \ + --server_container=/path/to/eval-kit-nemo-skills.sqsh \ + ++model_config=/path/to/config.yaml \ + ++load_dir=/path/to/checkpoint/TP_1/ \ + ++tokenizer=/path/to/tokenizer +``` + +Key differences from eval_kit: + +| | eval_kit | mcore_skills | +|---|---|---| +| Data source | VLMEvalKit downloads from HuggingFace | NeMo Skills JSONL from `--data_dir` | +| Prompts | VLMEvalKit's built-in prompts | NeMo Skills prompt templates | +| Evaluation | VLMEvalKit's `dataset.evaluate()` | ASR WER via VLMEvalKit's `asr_wer()` | +| Benchmarks | Any VLMEvalKit dataset | Any NeMo Skills benchmark with JSONL | +| Flag | `--benchmarks=eval_kit.` | `--generation_type=mcore_skills` | + +## Cluster Config Example + +Here is a complete cluster config section for eval_kit support: + +```yaml +containers: + eval_kit: /path/to/eval-kit-nemo-skills.sqsh + megatron: /path/to/megatron-container.sqsh + vllm: /path/to/vllm-container.sqsh + # ... other containers + +mounts: + - /path/to/megatron-lm:/path/to/megatron-lm + - /path/to/data:/data + - /path/to/hf_cache:/workspace_hf/hf_cache + - /path/to/vlmevalkit_cache:/path/to/vlmevalkit_cache + +env_vars: + - MEGATRON_PATH=/path/to/megatron-lm + - PYTHONPATH=/path/to/megatron-lm + - LMUData=/path/to/vlmevalkit_cache + - HF_HOME=/workspace_hf/hf_cache + - HYDRA_FULL_ERROR=1 + - CUDA_DEVICE_MAX_CONNECTIONS=1 +``` + +## Understanding Results + +After evaluation completes, results are in `/eval-results/`: + +``` +/ +└── eval-results/ + └── eval_kit.LibriSpeech_test_clean/ + ├── output.jsonl # Per-sample results (generation + expected_answer) + ├── eval_kit_metrics.json # Aggregate metrics from VLMEvalKit + └── metrics.json # NeMo Skills summary +``` + +The `eval_kit_metrics.json` contains VLMEvalKit's computed metrics. For ASR benchmarks this is typically: + +```json +{ + "result": " Dataset WER (%) Metric\n0 LibriSpeechDataset 1.555811 WER" +} +``` + +## Troubleshooting + +### `No module named 'megatron.core'` + +The `MEGATRON_PATH` or `PYTHONPATH` is not set correctly in the cluster config `env_vars`. Ensure both point to a Megatron-LM installation that contains `megatron/core/`. + +### `env variable RD_TABLEBENCH_SRC is missing` + +Some VLMEvalKit versions have a hard assert on this environment variable at import time. Fix: use the stable VLMEvalKitMcore version, or set `RD_TABLEBENCH_SRC=/tmp` in your cluster config env_vars. + +### `ModuleNotFoundError: No module named 'vlmeval'` + +The `NEMO_SKILLS_VLMEVALKIT_PATH` was not set when you ran `ns eval`, so the `vlmeval/` directory was not packaged. Set it and re-run: + +```bash +export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore +ns eval ... +``` + +### Installation command for missing dependencies + +If the eval_kit container is missing some Python packages, use `--installation_command`: + +```bash +--installation_command "pip install --no-deps pylatexenc==2.10" +``` + +This runs inside the container before the main task starts. diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md index 32f885d4d9..6a54960b39 100644 --- a/docs/evaluation/index.md +++ b/docs/evaluation/index.md @@ -12,6 +12,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f - [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#flores-200), [wmt24pp](./multilingual.md#wmt24pp) - [**Speech & Audio**](./speech-audio.md): e.g. [asr-leaderboard](./speech-audio.md#asr-leaderboard), [mmau-pro](./speech-audio.md#mmau-pro) - [**Vision-Language Models (VLM)**](./vlm.md): e.g. [mmmu-pro](./vlm.md#mmmu-pro) +- [**VLMEvalKit Integration (eval_kit)**](./eval-kit.md): Run VLMEvalKit benchmarks via Megatron in-process or vLLM - [**Speculative Decoding (SD)**](./speculative-decoding.md): e.g. [SPEED-Bench](./speculative-decoding.md#SPEED-Bench) See [nemo_skills/dataset](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support. diff --git a/nemo_skills/dataset/eval_kit/__init__.py b/nemo_skills/dataset/eval_kit/__init__.py new file mode 100644 index 0000000000..a8c29e8977 --- /dev/null +++ b/nemo_skills/dataset/eval_kit/__init__.py @@ -0,0 +1,45 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# VLMEvalKit integration module. +# Benchmarks are referenced as eval_kit., e.g. eval_kit.MMBench_DEV_EN +# The sub-benchmark name after eval_kit. is dynamically resolved and passed to VLMEvalKit. + +GENERATION_MODULE = "nemo_skills.inference.eval.eval_kit" +METRICS_TYPE = "eval_kit" +GENERATION_ARGS = "" +NUM_SAMPLES = 0 # VLMEvalKit inference is deterministic; no random seeds + +# No JSONL input file; VLMEvalKit manages its own data via build_dataset() +SKIP_INPUT_FILE = True + +# Note: SELF_CONTAINED_TASK is NOT set here because it depends on model_type. +# For mcore mode (Megatron in-process), the pipeline sets self_contained_task=True +# at runtime based on ++model_type=mcore in extra_arguments. +# For vllm mode, the standard NeMo Skills server/client flow is used. + + +def get_extra_generation_args(benchmark): + """Return extra generation args for the given benchmark name. + + Extracts the VLMEvalKit dataset name from the dotted benchmark name + (e.g. eval_kit.MMBench_DEV_EN -> ++vlm_dataset=MMBench_DEV_EN). + """ + if "." not in benchmark: + raise ValueError( + f"eval_kit benchmark must be in 'eval_kit.' format, got '{benchmark}'. " + f"Example: eval_kit.MMBench_DEV_EN, eval_kit.LibriSpeech_test_clean" + ) + sub = benchmark.split(".", 1)[1] + return f" ++vlm_dataset={sub} " diff --git a/nemo_skills/dataset/utils.py b/nemo_skills/dataset/utils.py index d918ba086e..5239dd3c0e 100644 --- a/nemo_skills/dataset/utils.py +++ b/nemo_skills/dataset/utils.py @@ -161,6 +161,13 @@ def _load_external_dataset(dataset_path): def get_default_dataset_module(dataset): data_path = "/nemo_run/code/nemo_skills/dataset" + + # For dotted names like eval_kit.MMBench_DEV_EN, import the parent package. + # The sub-benchmark part is handled by the module's get_extra_generation_args(). + if dataset.startswith("eval_kit."): + dataset_module = importlib.import_module("nemo_skills.dataset.eval_kit") + return dataset_module, data_path + dataset_module = importlib.import_module(f"nemo_skills.dataset.{dataset}") return dataset_module, data_path diff --git a/nemo_skills/evaluation/evaluator/audio.py b/nemo_skills/evaluation/evaluator/audio.py index f212859ff1..6d8af5b1b6 100644 --- a/nemo_skills/evaluation/evaluator/audio.py +++ b/nemo_skills/evaluation/evaluator/audio.py @@ -505,13 +505,35 @@ def evaluate_sample(sample: dict[str, Any], config: AudioEvaluatorConfig) -> dic """Evaluate single sample based on task_type. Returns dict of updates to merge.""" updates = {} task_type = sample.get("task_type", "unknown") - generation = sample["generation"].strip() + generation_raw = sample.get("generation") + generation = generation_raw.strip() if isinstance(generation_raw, str) else "" expected_answer = sample.get("expected_answer", "").strip() # Strip helpful prefixes for ASR tasks (e.g., "The audio says: ...") if config.strip_helpful_prefixes: generation = strip_helpful_prefixes(generation) + # Normalise AudioBench speech-translation task types (ST-EN-ZH -> Translation) + _ASR_TYPES = {"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"} + _TRANSLATION_TYPES = {"AST", "Translation"} + # AudioBench speech translation types: ST-{src}-{tgt} + if task_type.startswith("ST-"): + _TRANSLATION_TYPES.add(task_type) + + if task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation: + base = { + "is_correct": False, + "error": "missing_generation", + } + if task_type in _TRANSLATION_TYPES: + return {**base, "bleu": 0.0} + if task_type == "CER": + return {**base, "cer": 1.0} + if task_type == "ASR-PC": + return {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0} + # ASR / ASR-ZH / ASR_LEADERBOARD + return {**base, "wer": 1.0} + if task_type == "ASR-PC": mode = resolve_asr_normalization_mode(config) metrics = evaluate_asr_pc( @@ -522,7 +544,7 @@ def evaluate_sample(sample: dict[str, Any], config: AudioEvaluatorConfig) -> dic ) updates.update(metrics) - elif task_type == "ASR": + elif task_type in {"ASR", "ASR-ZH"}: mode = resolve_asr_normalization_mode(config) metrics = evaluate_asr(expected_answer, generation, normalization_mode=mode) updates.update(metrics) @@ -544,7 +566,7 @@ def evaluate_sample(sample: dict[str, Any], config: AudioEvaluatorConfig) -> dic updates[f"wer_{metric_suffix}"] = ref_metrics["wer"] updates[f"is_correct_{metric_suffix}"] = ref_metrics["is_correct"] - elif task_type in ["AST", "Translation"]: + elif task_type in _TRANSLATION_TYPES: metrics = evaluate_translation(expected_answer, generation) updates.update(metrics) @@ -561,6 +583,13 @@ def evaluate_sample(sample: dict[str, Any], config: AudioEvaluatorConfig) -> dic metrics = evaluate_pc_rate(expected_answer, generation) updates.update(metrics) + elif task_type == "MathQA": + # AudioBench MathQA: exact string match after normalization + gen_norm = generation.strip().lower() + ref_norm = expected_answer.strip().lower() + updates["is_correct"] = gen_norm == ref_norm + updates["predicted_answer"] = generation + else: if "requires_judge" not in sample: updates["requires_judge"] = True diff --git a/nemo_skills/evaluation/metrics/eval_kit_metrics.py b/nemo_skills/evaluation/metrics/eval_kit_metrics.py new file mode 100644 index 0000000000..ffa760826a --- /dev/null +++ b/nemo_skills/evaluation/metrics/eval_kit_metrics.py @@ -0,0 +1,95 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +from pathlib import Path + +from nemo_skills.evaluation.metrics.base import BaseMetrics + + +class EvalKitMetrics(BaseMetrics): + """Metrics class for VLMEvalKit benchmarks. + + VLMEvalKit computes its own aggregate metrics during evaluation. + This class reads pre-computed aggregates from eval_kit_metrics.json + (written by EvalKitGenerationTask) rather than computing per-sample metrics. + The per-sample JSONL is still read by ComputeMetrics for the update() loop, + but we only count entries here -- the real metrics come from the JSON file. + + Note: ComputeMetrics only calls setup() on the "_all_" calculator. When + the data contains ``subset_for_metrics``, additional per-subset calculator + instances are created but never receive a setup() call. We use a + class-level ``_shared_metrics_file`` so that those subset instances can + still locate the eval_kit_metrics.json discovered by the "_all_" instance. + """ + + # Shared across all instances so subset calculators can find the file + # even though only the "_all_" calculator receives setup(). + _shared_metrics_file: Path | None = None + + def __init__(self, **kwargs): + super().__init__(compute_no_answer=False) + self.eval_kit_metrics_file = None + + def setup(self, input_files): + """Find the eval_kit_metrics.json in the same directory as the input files.""" + if input_files: + # input_files are like ['/path/to/eval-results/eval_kit.MMBench_DEV_EN/output.jsonl'] + metrics_dir = Path(input_files[0]).parent + candidate = metrics_dir / "eval_kit_metrics.json" + if candidate.exists(): + self.eval_kit_metrics_file = candidate + EvalKitMetrics._shared_metrics_file = candidate + else: + # Reset stale shared path so a previous run's file isn't reused. + EvalKitMetrics._shared_metrics_file = None + + def update(self, predictions): + """Count entries but don't compute per-sample metrics.""" + self.total += 1 + + def get_metrics(self): + """Return pre-computed VLMEvalKit aggregate metrics.""" + metrics_dict = {} + + # Load pre-computed metrics from VLMEvalKit. + # Fall back to the class-level shared file for subset calculators + # that never received a setup() call. + eval_kit_results = {} + effective_file = self.eval_kit_metrics_file or EvalKitMetrics._shared_metrics_file + if effective_file and effective_file.exists(): + with open(effective_file, "rt", encoding="utf-8") as f: + eval_kit_results = json.load(f) + + # Build the metrics in NeMo Skills format + agg_dict = {"num_entries": self.total} + + # Flatten VLMEvalKit results into the metrics dict + for key, value in eval_kit_results.items(): + if isinstance(value, dict): + # Nested results (e.g., per-category scores) + for sub_key, sub_value in value.items(): + if isinstance(sub_value, (int, float)): + agg_dict[f"{key}_{sub_key}"] = sub_value + elif isinstance(value, (int, float)): + agg_dict[key] = value + + metrics_dict["greedy"] = agg_dict + return metrics_dict + + def metrics_to_print(self): + return None + + def evaluations_to_print(self): + return ["greedy"] diff --git a/nemo_skills/evaluation/metrics/map_metrics.py b/nemo_skills/evaluation/metrics/map_metrics.py index 92f9f3282c..00332bde61 100644 --- a/nemo_skills/evaluation/metrics/map_metrics.py +++ b/nemo_skills/evaluation/metrics/map_metrics.py @@ -30,6 +30,7 @@ SweBenchMetrics, ) from nemo_skills.evaluation.metrics.critpt_metrics import CritPtMetrics +from nemo_skills.evaluation.metrics.eval_kit_metrics import EvalKitMetrics from nemo_skills.evaluation.metrics.gradingbench_metrics import GradingBenchMetrics from nemo_skills.evaluation.metrics.hleaa_metrics import HLEAAMetrics from nemo_skills.evaluation.metrics.icpc_metrics import ICPCMetrics @@ -87,6 +88,7 @@ "compute-eval": ComputeEvalMetrics, "gradingbench": GradingBenchMetrics, "critpt": CritPtMetrics, + "eval_kit": EvalKitMetrics, "specdec": SpecdecMetrics, } diff --git a/nemo_skills/evaluation/metrics/translation_metrics.py b/nemo_skills/evaluation/metrics/translation_metrics.py index 5a819152cd..8f5be0bdeb 100644 --- a/nemo_skills/evaluation/metrics/translation_metrics.py +++ b/nemo_skills/evaluation/metrics/translation_metrics.py @@ -16,7 +16,6 @@ from collections import defaultdict import numpy as np -from sacrebleu import corpus_bleu from nemo_skills.evaluation.metrics.base import BaseMetrics, as_float @@ -35,6 +34,8 @@ class TranslationMetrics(BaseMetrics): # TODO: add support for other translation metrics, such as MetricX def get_metrics(self): + from sacrebleu import corpus_bleu + metrics_dict = {} for key in self.translation_dict: src_lang, tgt_lang = key.split("->") diff --git a/nemo_skills/inference/eval/eval_kit.py b/nemo_skills/inference/eval/eval_kit.py new file mode 100644 index 0000000000..420250d55d --- /dev/null +++ b/nemo_skills/inference/eval/eval_kit.py @@ -0,0 +1,561 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""VLMEvalKit integration for NeMo Skills. + +This module implements a self-contained generation task that uses VLMEvalKit's +inference and evaluation pipeline. Two modes are supported: + +1. Megatron in-process (model_type=mcore): VLMEvalKit's MultiModalMCore loads + and runs the Megatron model directly. No NeMo Skills server is started. + +2. vLLM client (model_type=vllm): NeMo Skills starts a vLLM server normally, + and VLMEvalKit's VLLMLocal connects to it as a client. + +Benchmarks are referenced as eval_kit. in NeMo Skills, +e.g. --benchmarks eval_kit.MMBench_DEV_EN +""" + +import json +import logging +import os +import pickle +import threading +from dataclasses import field +from pathlib import Path + +import hydra +from omegaconf import MISSING + +try: + from nemo_skills.inference.generate import GenerationTask +except ImportError: + # On the cluster, GenerationTask may not be importable due to missing deps + # (nemo_run, litellm, etc.). The inheritance is only needed for the pipeline's + # __func__ check which runs locally. On the cluster we just need a base class. + GenerationTask = object + +from nemo_skills.utils import get_logger_name, nested_dataclass + +LOG = logging.getLogger(get_logger_name(__file__)) + +# VLMEvalKit (vlmeval) is packaged alongside Skills code by nemo_run when +# NEMO_SKILLS_VLMEVALKIT_PATH is set (see eval.py extra_package_dirs logic). +# It lands at /nemo_run/code/vlmeval/ on the cluster, importable via PYTHONPATH. +# func-timeout is installed at job start via --installation_command in the run script. +# No venv-based requirements are needed (get_generation_requirements returns None). + + +@nested_dataclass(kw_only=True) +class EvalKitConfig: + """Configuration for VLMEvalKit generation task.""" + + # VLMEvalKit dataset name (injected by pipeline from benchmark name) + vlm_dataset: str = MISSING + + # Model configuration + model_type: str = "mcore" # "mcore" or "vllm" + model_config: str | None = None # Path to YAML config for mcore + load_dir: str | None = None # Checkpoint directory for mcore + load_ckpt: str | None = None # Specific checkpoint for mcore + server_url: str | None = None # URL for vLLM server (vllm mode) + model_name: str | None = None # Model name for vLLM + + # Inference parameters + reasoning: bool = False + temperature: float = 1.0 + top_k: int = 1 + top_p: float = 0.95 + + # Video dataset parameters + nframe: int = 16 + fps: int = -1 + nframe_max: int = -1 + use_subtitle: bool = False + media_dir: str = "./" + + # Evaluation parameters + eval_mode: str = "all" # "all", "infer", or "eval" + judge: str | None = None + judge_nproc: int = 4 + judge_retry: int = 3 + + # Output configuration (populated by the pipeline) + work_dir: str = "./outputs" + output_file: str = "" + skip_filled: bool = False # Accepted from pipeline but unused (VLMEvalKit has its own resume) + + # Fields accepted from pipeline but unused by eval_kit (avoids Hydra errors from common_args) + eval_config: dict = field(default_factory=dict) + + +cs = hydra.core.config_store.ConfigStore.instance() +cs.store(name="base_eval_kit_config", node=EvalKitConfig) + + +class EvalKitGenerationTask(GenerationTask): + """Generation task using VLMEvalKit. + + Supports two modes: + - mcore: Self-contained, no external server. Pipeline sets + self_contained_task=True so no server is started. + - vllm: Pipeline starts a vLLM server normally. This task overrides + ``configure_client_overrides`` to translate the server address into + eval_kit's flat config fields (``++server_url``, ``++model_name``) + instead of the standard nested ``++server.*`` overrides. + """ + + # --- Declarative pipeline attributes (read generically by pipeline/eval.py) --- + CONTAINER_KEY = "eval_kit" + USE_TORCHRUN = True + + @classmethod + def is_self_contained(cls, extra_arguments: str = "") -> bool: + """Self-contained only when user explicitly requests mcore mode. + + Note: EvalKitConfig.model_type defaults to "mcore" at runtime, but + at submission time we check explicit user intent. Without the flag + the pipeline assumes vllm (server-based) mode. + """ + return "++model_type=mcore" in extra_arguments + + @classmethod + def configure_client_overrides(cls, *, host: str, port: int, model: str, server_type: str) -> str: + """Return Hydra overrides for connecting to an already-running server. + + EvalKitConfig uses flat fields (server_url, model_name) rather than + the standard nested ``server.*`` group, so we translate here. + """ + return f"++server_url=http://{host}:{port} ++model_name={model} ++model_type=vllm " + + @classmethod + def get_env_prefix(cls) -> str: + """Shell env setup prepended before the main command (Megatron/VLMEvalKit needs).""" + return ( + 'export LMUData="${LMUData:-${LMUDATA:-}}" && ' + "export LD_LIBRARY_PATH=/opt/hpcx/ucx/lib:${LD_LIBRARY_PATH:-} && " + "export MKL_THREADING_LAYER=GNU && " + "export OMP_NUM_THREADS=1 && " + "export MKL_NUM_THREADS=1 && " + "ldconfig && " + # Create empty .env so VLMEvalKit's load_env() doesn't emit ERROR logs. + "touch /nemo_run/code/.env 2>/dev/null; " + ) + + @classmethod + def get_extra_package_dirs(cls) -> list[str]: + """Directories to package alongside nemo_run code (VLMEvalKit vlmeval/).""" + vlmevalkit_path = os.environ.get("NEMO_SKILLS_VLMEVALKIT_PATH") + if vlmevalkit_path: + pkg = os.path.join(vlmevalkit_path, "vlmeval") + if os.path.isdir(pkg): + return [pkg] + return [] + + @classmethod + def get_generation_default_args(cls): + return "" + + @classmethod + def get_generation_requirements(cls): + # VLMEvalKit is installed via --installation_command (pip install from mounted source). + # No additional venv-based requirements needed. + return None + + def __init__(self, cfg: EvalKitConfig): + self.cfg = cfg + + # Validate environment + lmu_data = os.environ.get("LMUData") + if not lmu_data: + raise ValueError( + "LMUData environment variable must be set for eval_kit benchmarks. " + "Add LMUData=/mounted/path to your cluster config env_vars." + ) + + # Build model FIRST so that initialize_megatron() sets up the + # distributed process group before we need dist.barrier() for + # rank-0-first dataset download. + if cfg.model_type == "mcore": + from vlmeval.vlm.multimodal_mcore.model import MultiModalMCore + + if not cfg.model_config: + raise ValueError("model_config is required for mcore model_type.") + self.model = MultiModalMCore( + model_config=cfg.model_config, + load_dir=cfg.load_dir, + load_ckpt=cfg.load_ckpt, + reasoning=cfg.reasoning, + ) + self.model_name = f"mcore_{Path(cfg.model_config).stem}" + elif cfg.model_type == "vllm": + from vlmeval.vlm.vllm_local import VLLMLocal + + if not cfg.server_url: + raise ValueError("server_url is required for vllm model_type.") + self.model = VLLMLocal( + vllm_url=cfg.server_url, + autospawn=False, + model_name=cfg.model_name or "vllm_local", + reasoning_mode=cfg.reasoning, + temperature=cfg.temperature, + top_k=cfg.top_k, + top_p=cfg.top_p, + ) + self.model_name = cfg.model_name or "vllm_local" + else: + raise ValueError(f"Unknown model_type: {cfg.model_type}. Must be 'mcore' or 'vllm'.") + + # Build dataset after model so the distributed process group is available + # for the rank-0-first download pattern (run.py:428-433). + from vlmeval.dataset import build_dataset + + dataset_kwargs = self._build_dataset_kwargs() + rank = int(os.environ.get("RANK", 0)) + world_size = int(os.environ.get("WORLD_SIZE", 1)) + + if world_size > 1: + import torch.distributed as dist + + if rank == 0: + build_dataset(cfg.vlm_dataset, **dataset_kwargs) + dist.barrier() + + self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs) + if self.dataset is None: + raise ValueError(f"VLMEvalKit dataset '{cfg.vlm_dataset}' is not valid.") + + self.work_dir = os.path.join(cfg.work_dir, "eval_kit_work", cfg.vlm_dataset) + os.makedirs(self.work_dir, exist_ok=True) + + # Async JSONL writer state + self._async_stop = threading.Event() + self._async_written_indices = set() + self._async_lock = threading.Lock() + self._async_thread = None + + # ------------------------------------------------------------------ + # Incremental JSONL writer (mirrors NeMo Skills' -async pattern) + # ------------------------------------------------------------------ + + def _build_index_to_meta(self): + """Build a lookup from dataset index -> {question, answer} for JSONL rows.""" + meta = {} + df = self.dataset.data + for _, row in df.iterrows(): + idx = row["index"] + meta[idx] = { + "question": str(row["question"]) if "question" in row.index else "", + "expected_answer": str(row["answer"]) if "answer" in row.index else "", + } + return meta + + def _pkl_to_prediction(self, value): + """Extract the prediction string from a pkl entry (str or dict).""" + if isinstance(value, dict) and "prediction" in value: + return str(value["prediction"]) + return str(value) + + def _async_writer_loop(self, pkl_path, index_meta, output_path, poll_interval=5): + """Background thread: poll the pkl file and append new entries to JSONL.""" + while not self._async_stop.is_set(): + self._flush_pkl_to_jsonl(pkl_path, index_meta, output_path) + self._async_stop.wait(timeout=poll_interval) + # Final flush after inference signals stop + self._flush_pkl_to_jsonl(pkl_path, index_meta, output_path) + + def _flush_pkl_to_jsonl(self, pkl_path, index_meta, output_path): + """Read the pkl, find new entries, append them to the JSONL file.""" + if not os.path.exists(pkl_path): + return + try: + with open(pkl_path, "rb") as f: + data = pickle.load(f) + except Exception: + # pkl may be mid-write; skip this cycle + return + if not isinstance(data, dict): + return + + new_entries = [] + with self._async_lock: + for idx, value in data.items(): + if idx not in self._async_written_indices: + self._async_written_indices.add(idx) + meta = index_meta.get(idx, {}) + new_entries.append( + { + "generation": self._pkl_to_prediction(value), + "expected_answer": meta.get("expected_answer", ""), + "question": meta.get("question", ""), + } + ) + + if new_entries: + with open(output_path, "a", encoding="utf-8") as f: + for entry in new_entries: + f.write(json.dumps(entry) + "\n") + LOG.info( + "Async JSONL: flushed %d new entries (total %d)", len(new_entries), len(self._async_written_indices) + ) + + def _start_async_writer(self): + """Start the background JSONL writer if output_file is configured.""" + if not self.cfg.output_file: + return + rank = int(os.environ.get("RANK", 0)) + if rank != 0: + return + + world_size = int(os.environ.get("WORLD_SIZE", 1)) + ds_name = self.dataset.dataset_name + pkl_path = os.path.join(self.work_dir, f"0{world_size}_{ds_name}.pkl") + + output_dir = Path(self.cfg.output_file).parent + output_dir.mkdir(parents=True, exist_ok=True) + + # Clear any previous async file + async_path = self.cfg.output_file + if os.path.exists(async_path): + os.remove(async_path) + + index_meta = self._build_index_to_meta() + + self._async_stop.clear() + self._async_written_indices.clear() + self._async_thread = threading.Thread( + target=self._async_writer_loop, + args=(pkl_path, index_meta, async_path), + daemon=True, + ) + self._async_thread.start() + LOG.info("Started async JSONL writer, monitoring %s", pkl_path) + + def _stop_async_writer(self): + """Stop the background writer and wait for final flush.""" + if self._async_thread is None: + return + self._async_stop.set() + self._async_thread.join(timeout=30) + self._async_thread = None + LOG.info("Async JSONL writer stopped (%d entries written)", len(self._async_written_indices)) + + def _build_dataset_kwargs(self): + """Build dataset kwargs mirroring VLMEvalKit's run.py:390-425.""" + from vlmeval.smp import listinstr + + kwargs = {} + ds = self.cfg.vlm_dataset + + if ds in ["MMLongBench_DOC", "DUDE", "DUDE_MINI", "SLIDEVQA", "SLIDEVQA_MINI"]: + kwargs["model"] = self.cfg.model_name or self.cfg.model_config or "" + + if ds in ( + "Video-MME", + "Video-MME-With-Audio", + "WorldSense-AVLM", + "MetropolisVideoDataset", + "WorldSense", + "avqa_val", + ): + kwargs["use_subtitle"] = self.cfg.use_subtitle + if ds in ( + "Video-MME", + "MetropolisVideoDataset", + "MLVU", + "LongVideoBench", + "MMBench-Video", + "MVBench", + "MLVU_MCQ", + "PAI-Bench-U", + ): + kwargs["nframe"] = self.cfg.nframe + if ds in [ + "Video-MME", + "MLVU", + "LongVideoBench", + "WorldSense", + "avqa_val", + "MMBench-Video", + "MVBench", + "MLVU_MCQ", + "PAI-Bench-U", + ]: + kwargs["fps"] = self.cfg.fps + if ds in [ + "Video-MME", + "MLVU", + "LongVideoBench", + "WorldSense", + "avqa_val", + "MLVU_MCQ", + "MMBench-Video", + "PAI-Bench-U", + ]: + kwargs["nframe_max"] = self.cfg.nframe_max + if ds in ["ANet-RTL", "Charades-STA"]: + kwargs["nframe"] = self.cfg.nframe + + if listinstr(["Video-MME-With-Audio", "DailyOmni", "WorldSense-AVLM", "JensenKeyNote"], ds): + kwargs["media_dir"] = self.cfg.media_dir + + return kwargs + + def generate(self): + """Run VLMEvalKit inference and evaluation.""" + from vlmeval.inference import infer_data_job + from vlmeval.inference_mt import infer_data_job_mt + from vlmeval.inference_video import infer_data_job_video + from vlmeval.smp import get_pred_file_format + + dataset = self.dataset + ds_name = dataset.dataset_name + pred_format = get_pred_file_format() + result_file_base = f"{self.model_name}_{ds_name}.{pred_format}" + + rank = int(os.environ.get("RANK", 0)) + + # Start incremental JSONL writer before inference begins + self._start_async_writer() + + # Dispatch to correct inference function (mirrors run.py:453-488) + try: + if self.cfg.eval_mode != "eval": + if dataset.MODALITY == "VIDEO": + self.model = infer_data_job_video( + model=self.model, + work_dir=self.work_dir, + model_name=self.model_name, + dataset=dataset, + result_file_name=result_file_base, + strip_think=not self.cfg.reasoning, + reasoning_flag=self.cfg.reasoning, + ) + elif dataset.TYPE == "MT": + self.model = infer_data_job_mt( + model=self.model, + work_dir=self.work_dir, + model_name=self.model_name, + dataset=dataset, + ) + else: + self.model = infer_data_job( + model=self.model, + work_dir=self.work_dir, + model_name=self.model_name, + dataset=dataset, + strip_think=not self.cfg.reasoning, + reasoning_flag=self.cfg.reasoning, + ) + finally: + self._stop_async_writer() + + # Evaluate (mirrors run.py:490-548) + eval_result = {} + if self.cfg.eval_mode != "infer" and rank == 0: + from vlmeval.smp import get_pred_file_path + + result_file = get_pred_file_path(self.work_dir, self.model_name, ds_name, use_env_format=True) + judge_kwargs = { + "nproc": self.cfg.judge_nproc, + "verbose": False, + "retry": self.cfg.judge_retry, + } + if self.cfg.judge: + judge_kwargs["model"] = self.cfg.judge + + if os.path.exists(result_file): + try: + eval_result = dataset.evaluate(result_file, **judge_kwargs) + except KeyError as e: + if e.args and e.args[0] == "model": + LOG.warning( + "Dataset %s requires a judge model for evaluation (e.g. MathVista). " + "Skipping evaluation. Set ++judge= (e.g. gpt-4o) to enable. " + "Inference output was still written.", + ds_name, + ) + eval_result = {} + else: + raise + if eval_result is None: + eval_result = {} + + # Convert to NeMo Skills format and write outputs (rank 0 only) + if rank == 0: + self._convert_to_nemo_skills_format(eval_result) + + # Write .done file for pipeline tracking + if self.cfg.output_file: + Path(f"{self.cfg.output_file}.done").touch() + + def _convert_to_nemo_skills_format(self, eval_result): + """Rewrite the final ordered JSONL output and eval_kit_metrics.json. + + The async writer has already been producing incremental JSONL during + inference. Here we overwrite with the authoritative, properly-ordered + result that VLMEvalKit merged from all ranks. + """ + if not self.cfg.output_file: + return + + from vlmeval.smp import get_pred_file_path + from vlmeval.smp import load as vlm_load + + output_dir = Path(self.cfg.output_file).parent + output_dir.mkdir(parents=True, exist_ok=True) + + # Write JSONL (required by summarize_results to find output*jsonl files) + result_file = get_pred_file_path( + self.work_dir, + self.model_name, + self.dataset.dataset_name, + use_env_format=True, + ) + if os.path.exists(result_file): + df = vlm_load(result_file) + with open(self.cfg.output_file, "w", encoding="utf-8") as f: + for _, row in df.iterrows(): + entry = { + "generation": str(row["prediction"]) if "prediction" in row.index else "", + "expected_answer": str(row["answer"]) if "answer" in row.index else "", + "question": str(row["question"]) if "question" in row.index else "", + } + f.write(json.dumps(entry) + "\n") + LOG.info("Wrote final ordered JSONL to %s (%d entries)", self.cfg.output_file, len(df)) + else: + LOG.warning("VLMEvalKit result file not found: %s", result_file) + + # Write aggregate metrics for EvalKitMetrics to read + # eval_result can be a dict or a pandas DataFrame (e.g. ASR); avoid "if eval_result" for DataFrame + if eval_result is not None: + metrics_data = eval_result if isinstance(eval_result, dict) else {"result": str(eval_result)} + metrics_path = output_dir / "eval_kit_metrics.json" + with open(metrics_path, "w", encoding="utf-8") as f: + json.dump(metrics_data, f, indent=2, default=str) + LOG.info("Wrote eval_kit metrics to %s", metrics_path) + + +GENERATION_TASK_CLASS = EvalKitGenerationTask + + +@hydra.main(version_base=None, config_name="base_eval_kit_config") +def main(cfg: EvalKitConfig): + cfg = EvalKitConfig(_init_nested=True, **cfg) + task = EvalKitGenerationTask(cfg) + task.generate() + + +if __name__ == "__main__": + main() diff --git a/nemo_skills/inference/factory.py b/nemo_skills/inference/factory.py index cd29bbd2c5..93f5bbe193 100644 --- a/nemo_skills/inference/factory.py +++ b/nemo_skills/inference/factory.py @@ -19,10 +19,12 @@ class GenerationType(str, Enum): generate = "generate" math_judge = "math_judge" check_contamination = "check_contamination" + mcore_skills = "mcore_skills" GENERATION_MODULE_MAP = { GenerationType.generate: "nemo_skills.inference.generate", GenerationType.math_judge: "nemo_skills.inference.llm_math_judge", GenerationType.check_contamination: "nemo_skills.inference.check_contamination", + GenerationType.mcore_skills: "nemo_skills.inference.mcore_skills", } diff --git a/nemo_skills/inference/generate.py b/nemo_skills/inference/generate.py index 3122f2ceeb..fcf0a94446 100644 --- a/nemo_skills/inference/generate.py +++ b/nemo_skills/inference/generate.py @@ -273,6 +273,38 @@ def _get_disallowed_params(self): class GenerationTask: + # --- Declarative pipeline attributes --- + # Subclasses can override to declare their runtime needs generically. + # The pipeline reads these instead of hardcoding module-name checks. + + # Container key in cluster_config["containers"]; None means use "nemo-skills" default. + CONTAINER_KEY: str | None = None + + # Whether to wrap the command with torchrun for multi-GPU data-parallel inference. + USE_TORCHRUN: bool = False + + @classmethod + def is_self_contained(cls, extra_arguments: str = "") -> bool: + """Whether this task manages its own model (no NeMo Skills server). + + Override in subclasses. *extra_arguments* is the raw CLI extra args string + so that the decision can depend on runtime flags (e.g. model_type). + """ + return False + + @classmethod + def get_env_prefix(cls) -> str: + """Shell commands prepended before the main command (e.g. env exports). + + Return an empty string if no special environment is needed. + """ + return "" + + @classmethod + def get_extra_package_dirs(cls) -> list[str]: + """Extra directories to package alongside nemo_run code.""" + return [] + @classmethod def get_generation_default_args(cls) -> str: """ diff --git a/nemo_skills/inference/mcore_skills.py b/nemo_skills/inference/mcore_skills.py new file mode 100644 index 0000000000..5db561d1b6 --- /dev/null +++ b/nemo_skills/inference/mcore_skills.py @@ -0,0 +1,547 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""NeMo Skills generation via VLMEvalKit MultiModalMCore in-process. + +This module implements Option A from the plan: read NeMo Skills JSONL, fill +prompts with NeMo Skills prompt config, run inference through MultiModalMCore +synchronously, write NeMo Skills-format JSONL. No HTTP server; evaluation +remains NeMo Skills metrics on the output. + +When run with torchrun (multi-GPU), all ranks participate in model.generate(); +only rank 0 performs file I/O. +""" + +import json +import logging +import os +import re +from dataclasses import field +from pathlib import Path + +import hydra +from omegaconf import MISSING +from tqdm import tqdm + +from nemo_skills.prompt.utils import get_prompt +from nemo_skills.utils import chunk_data, get_logger_name, nested_dataclass + +LOG = logging.getLogger(get_logger_name(__file__)) + +try: + from nemo_skills.inference.generate import GenerationTask +except ImportError: + GenerationTask = None + +if GenerationTask is not None: + _get_server_command_fn = GenerationTask.get_server_command_fn +else: + + @classmethod + def _get_server_command_fn(cls): + from nemo_skills.pipeline.utils import get_server_command + + return get_server_command + + +@nested_dataclass(kw_only=True) +class MegatronMCoreConfig: + """Configuration for MegatronMCore NeMo Skills generation.""" + + input_file: str = MISSING + output_file: str = MISSING + + # Prompt config for text-only data (used by fill_prompt). Not needed when the + # input JSONL already contains OpenAI-format 'messages' (e.g. asr-leaderboard). + prompt_config: str | None = None + + # Tokenizer for prompt filling (format_as_string=True). HF model name or path. + # Required when prompt_config is set; optional for messages-only data. + tokenizer: str | None = None + + # MultiModalMCore model + model_config: str = MISSING + load_dir: str | None = None + load_ckpt: str | None = None + reasoning: bool = False + + # Prompt options (mirror GenerationTaskConfig where needed) + code_tags: str | None = None + examples_type: str | None = None + system_message: str | None = None + start_assistant_response_key: str | None = None + chat_template_kwargs: dict = field(default_factory=dict) + + # Base directory to resolve relative audio/image paths (e.g. NEMO_SKILLS_DATA_DIR). + data_dir: str = "" + + # Generation limits and resume + max_samples: int = -1 + skip_filled: bool = False + num_chunks: int | None = None + chunk_id: int | None = None + + # Output + generation_key: str = "generation" + add_generation_stats: bool = True + async_position_key: str = "_async_position" + + dry_run: bool = False + + # Dataset name passed to MultiModalMCore.generate() — used by VLMEvalKit internally + # for dataset-specific logic (e.g. video tile config). Defaults to "nemo_skills". + dataset_name: str = "nemo_skills" + + # Accepted from pipeline/dataset modules but unused by mcore_skills (avoid Hydra errors). + # These come via ++key=value overrides from dataset modules (e.g. asr-leaderboard). + eval_config: dict = field(default_factory=dict) + eval_type: str | None = None + prompt_format: str | None = None + enable_audio: bool = False + + +def _make_mcore_model(cfg: MegatronMCoreConfig): + from vlmeval.vlm.multimodal_mcore.model import MultiModalMCore + + return MultiModalMCore( + model_config=cfg.model_config, + load_dir=cfg.load_dir, + load_ckpt=cfg.load_ckpt, + reasoning=cfg.reasoning, + ) + + +class MegatronMCoreGenerationTask: + """Generation task using NeMo Skills data + prompts and VLMEvalKit MultiModalMCore in-process.""" + + get_server_command_fn = _get_server_command_fn + + # --- Declarative pipeline attributes (read generically by pipeline/eval.py) --- + CONTAINER_KEY = "eval_kit" + USE_TORCHRUN = True + # Metrics are computed by VLMEvalKit (asr_wer etc.) and saved as + # eval_kit_metrics.json — tell the summarize step to use EvalKitMetrics. + METRICS_TYPE_OVERRIDE = "eval_kit" + + @classmethod + def is_self_contained(cls, extra_arguments: str = "") -> bool: + """Always self-contained (in-process MultiModalMCore, no HTTP server).""" + return True + + @classmethod + def get_env_prefix(cls) -> str: + """Shell env setup prepended before the main command (Megatron/VLMEvalKit needs).""" + return ( + 'export LMUData="${LMUData:-${LMUDATA:-}}" && ' + "export LD_LIBRARY_PATH=/opt/hpcx/ucx/lib:${LD_LIBRARY_PATH:-} && " + "export MKL_THREADING_LAYER=GNU && " + "export OMP_NUM_THREADS=1 && " + "export MKL_NUM_THREADS=1 && " + "ldconfig && " + # Create empty .env so VLMEvalKit's load_env() doesn't emit ERROR logs. + "touch /nemo_run/code/.env 2>/dev/null; " + ) + + @classmethod + def get_extra_package_dirs(cls) -> list[str]: + """Directories to package alongside nemo_run code (VLMEvalKit vlmeval/).""" + vlmevalkit_path = os.environ.get("NEMO_SKILLS_VLMEVALKIT_PATH") + if vlmevalkit_path: + pkg = os.path.join(vlmevalkit_path, "vlmeval") + if os.path.isdir(pkg): + return [pkg] + return [] + + @classmethod + def get_generation_default_args(cls): + return "" + + @classmethod + def get_generation_requirements(cls): + return None + + def __init__(self, cfg: MegatronMCoreConfig): + self.cfg = cfg + # Prompt is only needed for text-only data (no 'messages' field). + # For multimodal data with OpenAI-format messages, _build_mcore_messages + # extracts content directly — no prompt template required. + if cfg.prompt_config: + self.prompt = get_prompt( + prompt_config=cfg.prompt_config, + tokenizer=cfg.tokenizer, + code_tags=cfg.code_tags, + examples_type=cfg.examples_type, + system_message=cfg.system_message, + ) + else: + self.prompt = None + self.model = _make_mcore_model(cfg) + + def load_data(self): + data = [] + with open(self.cfg.input_file, "rt", encoding="utf-8") as fin: + for line in fin: + data.append(json.loads(line)) + if self.cfg.num_chunks is not None and self.cfg.chunk_id is not None: + data, self.cfg.output_file = chunk_data(data, self.cfg.output_file, self.cfg.chunk_id, self.cfg.num_chunks) + LOG.info( + "Chunking: %d chunks, processing chunk %d; samples in chunk: %d", + self.cfg.num_chunks, + self.cfg.chunk_id, + len(data), + ) + if self.cfg.max_samples > 0: + data = data[: self.cfg.max_samples] + return data + + def skip_completed_samples(self, data: list) -> list: + if not self.cfg.skip_filled or not Path(self.cfg.output_file).exists(): + return data + filled = 0 + with open(self.cfg.output_file, "rt", encoding="utf-8") as fin: + for _ in fin: + filled += 1 + if filled >= len(data): + return [] + return data[filled:] + + def fill_prompt(self, data_point: dict, data: list) -> str: + from copy import deepcopy + + data_point = deepcopy(data_point) + filled = self.prompt.fill( + data_point, + start_assistant_response_key=self.cfg.start_assistant_response_key, + chat_template_kwargs=self.cfg.chat_template_kwargs or {}, + format_as_string=True, + ) + return filled if isinstance(filled, str) else str(filled) + + def _get_data_dir(self) -> str: + """Return the effective data_dir from cfg or eval_config.""" + data_dir = getattr(self.cfg, "data_dir", None) or "" + if not data_dir and getattr(self.cfg, "eval_config", None): + data_dir = self.cfg.eval_config.get("data_dir") or "" + return data_dir + + def _resolve_path(self, path: str) -> str: + """Resolve a media file path, handling relative paths and mount mismatches. + + 1. Relative paths are joined with data_dir. + 2. Absolute paths that don't exist on disk are retried relative to data_dir + (handles mount mismatches, e.g. JSONL has /dataset/... but data is at /data/...). + """ + if not path: + return path + data_dir = self._get_data_dir() + if not os.path.isabs(path): + if data_dir: + return os.path.join(data_dir, path) + return path + # Absolute path — use as-is if it exists + if os.path.exists(path): + return path + # Absolute path doesn't exist — try stripping the first directory component + # and re-rooting under data_dir (e.g. /dataset/asr-leaderboard/... → /data/asr-leaderboard/...) + if data_dir: + # Strip leading /mount_name/ to get the relative portion + parts = path.strip("/").split("/", 1) + if len(parts) == 2: + relative = parts[1] + candidate = os.path.join(data_dir, relative) + if os.path.exists(candidate): + return candidate + return path + + def _build_mcore_messages(self, data_point: dict) -> list | None: + """Convert a NeMo Skills data point into MultiModalMCore message list. + + If the data point has a 'messages' field (OpenAI format), converts it to + list[dict] with types: "text", "image", "sound". + + Only user/assistant message text is included — system messages are skipped + because MultiModalMCore's generate_inner() builds its own prompt template + with system/user roles internally. + + If no 'messages' field, returns None (caller should use fill_prompt for text-only). + """ + messages = data_point.get("messages") + if not messages: + return None + + mcore: list[dict] = [] + text_parts: list[str] = [] + + for msg in messages: + if not isinstance(msg, dict): + continue + role = msg.get("role", "") + content = msg.get("content", "") + + # Skip system messages — generate_inner builds its own system prompt. + if role == "system": + continue + + # Audio: single or multiple + if "audio" in msg: + audio = msg["audio"] + if isinstance(audio, dict) and "path" in audio: + path = self._resolve_path(audio["path"]) + mcore.append({"type": "sound", "value": path, "sample_rate": 16000}) + if "audios" in msg: + for audio in msg["audios"]: + if isinstance(audio, dict) and "path" in audio: + path = self._resolve_path(audio["path"]) + mcore.append({"type": "sound", "value": path, "sample_rate": 16000}) + + # Content: str or list of content items (text, image_url) + if isinstance(content, str): + if content.strip(): + text_parts.append(content.strip()) + elif isinstance(content, list): + for item in content: + if isinstance(item, dict): + if item.get("type") == "text" and "text" in item: + text_parts.append(item["text"].strip()) + elif item.get("type") == "image_url": + image_url = item.get("image_url") or {} + url = image_url.get("url", "") + if url.startswith("file://"): + path = url[7:] + else: + path = url + if path: + path = self._resolve_path(path) + mcore.append({"type": "image", "value": path}) + + combined_text = "\n".join(t for t in text_parts if t) + if combined_text: + mcore.append({"type": "text", "value": combined_text}) + + if not mcore: + return None + return mcore + + def dump_outputs(self, outputs: list, fout): + for out in outputs: + fout.write(json.dumps(out) + "\n") + + @staticmethod + def _strip_thinking_tags(text: str) -> str: + """Strip ... tags (including empty ones) from model output.""" + return re.sub(r".*?", "", text, flags=re.DOTALL).strip() + + def _generate_for_sample(self, data_point: dict, data: list) -> str: + """Run model inference for a single data point. Returns generated text.""" + message_list = self._build_mcore_messages(data_point) + if message_list is not None: + raw = self.model.generate(message_list, dataset=self.cfg.dataset_name) + return self._strip_thinking_tags(raw) + if self.prompt is None: + raise ValueError( + "Data point has no 'messages' field and prompt_config is not set. " + "Either provide ++prompt_config for text-only data or ensure " + "the input JSONL contains OpenAI-format 'messages'." + ) + prompt_str = self.fill_prompt(data_point, data) + raw = self.model.generate( + [{"type": "text", "value": prompt_str}], + dataset=self.cfg.dataset_name, + ) + return self._strip_thinking_tags(raw) + + def generate(self): + import sys + + # Use Megatron DP rank/size for data sharding (matches VLMEvalKit pattern). + # With data_parallel=True in generate_and_post_process, each DP rank runs + # generation independently on its shard while TP ranks synchronise internally. + dp_rank = self.model.get_dp_rank() + dp_size = self.model.get_dp_size() + + output_dir = Path(self.cfg.output_file).absolute().parent + if dp_rank == 0: + output_dir.mkdir(parents=True, exist_ok=True) + + data = self.load_data() + data = self.skip_completed_samples(data) + if not data: + if dp_rank == 0: + LOG.info("No data to process, skipping generation") + return + if self.cfg.dry_run: + if dp_rank == 0: + LOG.info("Dry run: would process %d samples", len(data)) + return + + # Round-robin shard by dp_rank (same strategy as VLMEvalKit infer_data). + my_indices = list(range(dp_rank, len(data), dp_size)) + my_data = [data[i] for i in my_indices] + + if dp_rank == 0: + LOG.info( + "Data parallelism: dp_size=%d, total=%d, this rank=%d samples", + dp_size, + len(data), + len(my_data), + ) + + # Per-rank output file — visible during the run so progress can be + # monitored (e.g. ``wc -l output_rank*.jsonl``). Contains a + # ``_dp_global_idx`` field used for ordered merging at the end. + rank_file = output_dir / f"output_rank{dp_rank}.jsonl" + + # Suppress VLMEvalKit's per-sample print() on non-primary DP ranks to + # avoid 8x duplicate output in logs. + _real_stdout = sys.stdout + if dp_rank != 0: + sys.stdout = open(os.devnull, "w") + + try: + with open(rank_file, "w", encoding="utf-8") as fout: + iterator = tqdm(my_data, desc=f"mcore_skills[dp{dp_rank}]") if dp_rank == 0 else my_data + for local_idx, data_point in enumerate(iterator): + global_idx = my_indices[local_idx] + gen = self._generate_for_sample(data_point, data) + output = { + "_dp_global_idx": global_idx, + self.cfg.generation_key: gen, + **{k: v for k, v in data_point.items() if k != self.cfg.async_position_key}, + } + fout.write(json.dumps(output) + "\n") + fout.flush() + finally: + if dp_rank != 0: + sys.stdout.close() + sys.stdout = _real_stdout + + # Barrier: wait for all DP ranks to finish writing. + import torch.distributed as dist + + if dist.is_initialized(): + dist.barrier() + + # Rank 0 merges per-rank files into the final ordered output. + if dp_rank == 0: + all_results: dict[int, str] = {} + for r in range(dp_size): + rf = output_dir / f"output_rank{r}.jsonl" + if rf.exists() and rf.stat().st_size > 0: + with open(rf, "rt", encoding="utf-8") as fin: + for line in fin: + entry = json.loads(line) + idx = entry.pop("_dp_global_idx") + all_results[idx] = json.dumps(entry) + + mode = "a" if self.cfg.skip_filled and Path(self.cfg.output_file).exists() else "w" + merged_lines = [all_results[idx] + "\n" for idx in sorted(all_results.keys())] + with open(self.cfg.output_file, mode, encoding="utf-8") as fout: + fout.writelines(merged_lines) + LOG.info( + "Merged %d results from %d DP ranks into %s", + len(all_results), + dp_size, + self.cfg.output_file, + ) + + # Clean up per-rank files after successful merge. + for r in range(dp_size): + rf = output_dir / f"output_rank{r}.jsonl" + rf.unlink(missing_ok=True) + + # Evaluate using VLMEvalKit (same as eval_kit.py does). + # Done BEFORE marking .done so failed metrics prevent false completion. + self._evaluate_results() + + Path(f"{self.cfg.output_file}.done").touch() + + def _evaluate_results(self): + """Compute metrics using VLMEvalKit's evaluation functions. + + Uses the same asr_wer() that eval_kit.py calls via dataset.evaluate(), + so metrics are identical. Saves eval_kit_metrics.json (consumed by + EvalKitMetrics in the summarize step). + """ + output_file = self.cfg.output_file + if not output_file or not Path(output_file).exists(): + return + + output_path = Path(output_file) + + try: + from vlmeval.dataset.avlm.utils import asr_wer + + # Read entries and build VLMEvalKit-format results list + entries = [] + results = [] + with open(output_file, "rt", encoding="utf-8") as fin: + for line in fin: + entry = json.loads(line) + # Strip leftover tags (older runs may have them) + gen_key = self.cfg.generation_key + gen = entry.get(gen_key, "") + cleaned = self._strip_thinking_tags(gen) + if cleaned != gen: + entry[gen_key] = cleaned + entries.append(entry) + results.append( + { + "gt": entry.get("expected_answer", ""), + "pred": entry[gen_key], + } + ) + + # Re-write output.jsonl with cleaned generations + with open(output_file, "w", encoding="utf-8") as fout: + for entry in entries: + fout.write(json.dumps(entry) + "\n") + + # Compute WER using VLMEvalKit (same function as eval_kit path) + wer_score = asr_wer(results) + LOG.info("ASR WER: %.2f%%", wer_score) + + # Save as eval_kit_metrics.json (same format eval_kit.py writes) + metrics = {"wer": wer_score} + metrics_file = output_path.parent / "eval_kit_metrics.json" + with open(metrics_file, "w", encoding="utf-8") as f: + json.dump(metrics, f, indent=2) + LOG.info("Metrics saved to %s", metrics_file) + + except ImportError: + LOG.warning( + "VLMEvalKit asr_wer not available — skipping eval-kit-style metrics. " + "The summarize_results job will compute metrics separately." + ) + except Exception: + LOG.exception("Inline metrics computation failed") + + +GENERATION_TASK_CLASS = MegatronMCoreGenerationTask + +cs = hydra.core.config_store.ConfigStore.instance() +cs.store(name="base_mcore_skills_config", node=MegatronMCoreConfig) + + +@hydra.main(version_base=None, config_name="base_mcore_skills_config") +def main(cfg: MegatronMCoreConfig): + cfg = MegatronMCoreConfig(_init_nested=True, **cfg) + task = MegatronMCoreGenerationTask(cfg) + task.generate() + + +if __name__ == "__main__": + import nemo_skills.utils as utils + + utils.setup_logging() + main() diff --git a/nemo_skills/pipeline/eval.py b/nemo_skills/pipeline/eval.py index 37e9c38095..69d2e27aeb 100644 --- a/nemo_skills/pipeline/eval.py +++ b/nemo_skills/pipeline/eval.py @@ -36,6 +36,34 @@ LOG = logging.getLogger(get_logger_name(__file__)) +def _apply_task_overrides(combined_cmd, task_classes, job_num_gpus, cluster_config): + """Apply env/torchrun/container overrides declared by generation task classes. + + Returns (modified_cmd, container). + """ + # Environment prefix (first non-empty wins; jobs are not mixed across task types) + for tc in task_classes: + prefix = tc.get_env_prefix() if hasattr(tc, "get_env_prefix") else "" + if prefix: + combined_cmd = f"{prefix}{combined_cmd}" + break + + # Torchrun for multi-GPU data-parallel inference + if any(getattr(tc, "USE_TORCHRUN", False) for tc in task_classes): + if job_num_gpus and int(job_num_gpus) > 1: + combined_cmd = combined_cmd.replace("python -m ", f"torchrun --nproc_per_node {job_num_gpus} -m ", 1) + + # Container selection (task class CONTAINER_KEY, falling back to nemo-skills default) + container = cluster_config["containers"]["nemo-skills"] + for tc in task_classes: + key = getattr(tc, "CONTAINER_KEY", None) + if key and key in cluster_config.get("containers", {}): + container = cluster_config["containers"][key] + break + + return combined_cmd, container + + class SingleNodeMode(str, enum.Enum): sequential = "sequential" parallel = "parallel" @@ -416,6 +444,20 @@ def eval( sbatch_kwargs = parse_kwargs(sbatch_kwargs, exclusive=exclusive, qos=qos, time_min=time_min) get_random_port = pipeline_utils.should_get_random_port(server_gpus, exclusive) + + # Build extra_package_dirs: include any dirs declared by generation task classes + extra_pkg_dirs = [] + seen_pkg_dirs = set() + for ba in benchmarks_dict.values(): + task_cls = ba.generation_task_class + if task_cls is not None and hasattr(task_cls, "get_extra_package_dirs"): + for pkg_dir in task_cls.get_extra_package_dirs(): + if pkg_dir not in seen_pkg_dirs: + seen_pkg_dirs.add(pkg_dir) + extra_pkg_dirs.append(pkg_dir) + LOG.info("Packaging extra dir from %s: %s", task_cls.__name__, pkg_dir) + extra_pkg_dirs = extra_pkg_dirs or None + has_tasks = False job_id_to_tasks = {} benchmark_to_judge_tasks = {} @@ -434,20 +476,34 @@ def eval( job_server_address, job_server_command, job_sandbox_env_overrides, + job_num_gpus, ) = job_args prev_tasks = _task_dependencies for _ in range(dependent_jobs + 1): has_tasks = True + combined_cmd = pipeline_utils.wrap_python_path(cmd=combine_cmds(cmds, single_node_mode)) + + # Apply env/torchrun/container overrides from generation task classes + job_task_classes = [ + benchmarks_dict[b].generation_task_class + for b in job_benchmarks + if benchmarks_dict[b].generation_task_class is not None + ] + combined_cmd, job_container = _apply_task_overrides( + combined_cmd, job_task_classes, job_num_gpus, cluster_config + ) + new_task = pipeline_utils.add_task( exp, - cmd=pipeline_utils.wrap_python_path(cmd=combine_cmds(cmds, single_node_mode)), + cmd=combined_cmd, task_name=f"{expname}-{'-'.join(job_benchmarks)}", log_dir=log_dir, - container=main_container or cluster_config["containers"]["nemo-skills"], + container=main_container or job_container, cluster_config=cluster_config, partition=partition, account=account, + num_gpus=job_num_gpus, server_config=job_server_config, with_sandbox=job_needs_sandbox or with_sandbox, keep_mounts_for_sandbox=job_needs_sandbox_to_keep_mounts or keep_mounts_for_sandbox, @@ -461,6 +517,7 @@ def eval( prev_tasks if cluster_config["executor"] == "slurm" else all_tasks + _task_dependencies ), get_server_command=job_server_command, + extra_package_dirs=extra_pkg_dirs, sbatch_kwargs=sbatch_kwargs, installation_command=installation_command, skip_hf_home_check=skip_hf_home_check, @@ -609,6 +666,8 @@ def eval( command += f" --wandb_group={wandb_group} " if wandb_project: command += f" --wandb_project={wandb_project} " + if data_dir: + command += f" --data_dir={data_dir} " if metrics_kwargs: command += f" --metrics_kwargs='{kwargs_to_string(metrics_kwargs)}' " diff --git a/nemo_skills/pipeline/utils/eval.py b/nemo_skills/pipeline/utils/eval.py index ff1ab345fe..0edc91b50f 100644 --- a/nemo_skills/pipeline/utils/eval.py +++ b/nemo_skills/pipeline/utils/eval.py @@ -30,10 +30,24 @@ LOG = logging.getLogger(get_logger_name(__file__)) +def _resolve_generation_task_class(module_name: str): + """Import a generation module and return its GENERATION_TASK_CLASS, or None.""" + try: + if module_name.endswith(".py") or os.sep in module_name: + path_suffix = ".py" if not module_name.endswith(".py") else "" + mod = import_from_path(module_name + path_suffix) + else: + mod = importlib.import_module(module_name) + return getattr(mod, "GENERATION_TASK_CLASS", None) + except (ImportError, ModuleNotFoundError): + LOG.debug("Could not resolve GENERATION_TASK_CLASS from %s", module_name) + return None + + @dataclass class BenchmarkArgs: name: str - input_file: str + input_file: str | None generation_args: str judge_args: str judge_pipeline_args: dict @@ -46,10 +60,14 @@ class BenchmarkArgs: metrics_type: str | None = None benchmark_group: str | None = None score_module: str | None = None + self_contained_task: bool = False + num_gpus: int | None = None # For self-contained tasks that need GPU allocation on the main task job_ids: list[int] = field(default_factory=list) remaining_jobs: list[dict] = field(default_factory=list) # Per-benchmark sandbox environment overrides in KEY=VALUE form sandbox_env_overrides: list[str] = field(default_factory=list) + # Resolved GENERATION_TASK_CLASS (populated by prepare_eval_commands) + generation_task_class: type | None = None @property def requires_judge(self): @@ -91,50 +109,57 @@ def get_benchmark_args_from_module( local_data_path=None, data_dir=None, ): + skip_input_file = getattr(benchmark_module, "SKIP_INPUT_FILE", False) + self_contained_task = getattr(benchmark_module, "SELF_CONTAINED_TASK", False) + if split is None: split = get_arg_from_module_or_dict(benchmark_module, "EVAL_SPLIT", "test", override_dict) - if not is_on_cluster: - if pipeline_utils.is_mounted_filepath(cluster_config, data_path) or cluster_config["executor"] == "none": - input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" - if local_data_path is not None: - unmounted_path = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" + input_file = None + if not skip_input_file: + if not is_on_cluster: + if pipeline_utils.is_mounted_filepath(cluster_config, data_path) or cluster_config["executor"] == "none": + input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" + if local_data_path is not None: + unmounted_path = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" + else: + unmounted_input_file = pipeline_utils.get_unmounted_path(cluster_config, input_file) + unmounted_path = str( + Path(__file__).parents[3] / unmounted_input_file.replace("/nemo_run/code/", "") + ) else: - unmounted_input_file = pipeline_utils.get_unmounted_path(cluster_config, input_file) - unmounted_path = str(Path(__file__).parents[3] / unmounted_input_file.replace("/nemo_run/code/", "")) + # will be copied over in this case as it must come from extra datasets + input_file = f"/nemo_run/code/{Path(data_path).name}/{benchmark.replace('.', '/')}/{split}.jsonl" + unmounted_path = Path(data_path) / benchmark.replace(".", "/") / f"{split}.jsonl" else: - # will be copied over in this case as it must come from extra datasets - input_file = f"/nemo_run/code/{Path(data_path).name}/{benchmark.replace('.', '/')}/{split}.jsonl" - unmounted_path = Path(data_path) / benchmark.replace(".", "/") / f"{split}.jsonl" - else: - # on cluster we will always use the mounted path - input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" - unmounted_path = pipeline_utils.get_unmounted_path(cluster_config, input_file) - - unmounted_path = str(unmounted_path) - # When data_dir is specified, use it for both input_file and the existence check - # data_dir is always assumed to be a mounted path - if data_dir: - data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir) - input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl" - check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl" - else: - check_path = unmounted_path - # checking if data file exists (can check locally as well) - if is_on_cluster: - if not pipeline_utils.cluster_path_exists(cluster_config, check_path): - raise ValueError( - f"Data file {check_path} does not exist on cluster. " - "Please check the benchmark and split parameters. " - "Did you forget to run prepare data commands or add data_dir argument?" - ) - else: - if not Path(check_path).exists(): - raise ValueError( - f"Data file {check_path} does not exist locally. " - "Please check the benchmark and split parameters. " - "Did you forget to run prepare data commands or add data_dir argument?" - ) + # on cluster we will always use the mounted path + input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" + unmounted_path = pipeline_utils.get_unmounted_path(cluster_config, input_file) + + unmounted_path = str(unmounted_path) + # When data_dir is specified, use it for both input_file and the existence check + # data_dir is always assumed to be a mounted path + if data_dir: + data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir) + input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl" + check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl" + else: + check_path = unmounted_path + # checking if data file exists (can check locally as well) + if is_on_cluster: + if not pipeline_utils.cluster_path_exists(cluster_config, check_path): + raise ValueError( + f"Data file {check_path} does not exist on cluster. " + "Please check the benchmark and split parameters. " + "Did you forget to run prepare data commands or add data_dir argument?" + ) + else: + if not Path(check_path).exists(): + raise ValueError( + f"Data file {check_path} does not exist locally. " + "Please check the benchmark and split parameters. " + "Did you forget to run prepare data commands or add data_dir argument?" + ) # this is deprecated, should remove in the future prompt_config = get_arg_from_module_or_dict(benchmark_module, "PROMPT_CONFIG", "", override_dict=override_dict) @@ -146,6 +171,11 @@ def get_benchmark_args_from_module( if eval_args: generation_args = f"{eval_args} {generation_args}" generation_args += f" ++eval_config.split={split} " + + # Let the dataset module inject extra generation args (e.g. ++vlm_dataset=) + if hasattr(benchmark_module, "get_extra_generation_args"): + generation_args += benchmark_module.get_extra_generation_args(benchmark) + requires_sandbox = get_arg_from_module_or_dict(benchmark_module, "REQUIRES_SANDBOX", False, override_dict) keep_mounts_for_sandbox = get_arg_from_module_or_dict( benchmark_module, "KEEP_MOUNTS_FOR_SANDBOX", False, override_dict @@ -202,6 +232,7 @@ def get_benchmark_args_from_module( benchmark_group=benchmark_group, metrics_type=metrics_type, sandbox_env_overrides=sandbox_env_overrides, + self_contained_task=self_contained_task, ) @@ -366,6 +397,24 @@ def prepare_eval_commands( ): LOG.warning("Found benchmark (%s) which requires sandbox to keep mounts, enabling it.", benchmark) + # Resolve GENERATION_TASK_CLASS for each benchmark and query declarative attributes. + # Each task class declares its own is_self_contained(), get_env_prefix(), etc. + for ba in benchmarks_dict.values(): + effective_module = generation_module or ba.generation_module + task_cls = _resolve_generation_task_class(effective_module) + ba.generation_task_class = task_cls + if task_cls is not None and hasattr(task_cls, "is_self_contained"): + if task_cls.is_self_contained(extra_arguments): + ba.self_contained_task = True + if server_parameters["server_gpus"]: + ba.num_gpus = server_parameters["server_gpus"] + # Allow task class to override metrics_type (e.g. mcore_skills uses + # VLMEvalKit evaluation and writes eval_kit_metrics.json). + if task_cls is not None and hasattr(task_cls, "METRICS_TYPE_OVERRIDE"): + ba.metrics_type = task_cls.METRICS_TYPE_OVERRIDE + + has_self_contained = any(ba.self_contained_task for ba in benchmarks_dict.values()) + total_evals = 0 for benchmark, benchmark_args in benchmarks_dict.items(): if benchmark_args.num_samples == 0: @@ -396,6 +445,12 @@ def prepare_eval_commands( # if num_jobs is -1, we run all benchmarks in parallel num_jobs = total_evals + # Self-contained tasks (e.g., eval_kit mcore mode) bypass the server/client split + # and manage their own GPU allocation, so each benchmark must get its own job (no grouping). + if has_self_contained and num_jobs != total_evals: + LOG.info("Self-contained tasks detected, forcing num_jobs = total_evals (no job grouping).") + num_jobs = total_evals + if num_jobs == 0: return benchmarks_dict, [] @@ -408,6 +463,7 @@ def prepare_eval_commands( cur_job_idx = 0 get_random_port = pipeline_utils.should_get_random_port(server_parameters["server_gpus"], exclusive) + job_server_config, job_server_address, job_extra_arguments = pipeline_utils.configure_client( **server_parameters, extra_arguments=extra_arguments, @@ -461,10 +517,31 @@ def prepare_eval_commands( "which is not supported for evaluation when grouping jobs." ) + # Self-contained tasks don't use NeMo Skills server, so skip + # server-related args that configure_client adds to job_extra_arguments. + # Tasks with configure_client_overrides translate server params into + # their own config format (e.g. eval_kit uses flat ++server_url instead + # of nested ++server.* overrides). + if benchmark_args.self_contained_task: + effective_extra_args = extra_arguments + elif hasattr(generation_task, "configure_client_overrides"): + # rsplit to handle URLs like http://host:port (takes last colon) + host, port = (job_server_address or "localhost:5000").rsplit(":", 1) + model = server_parameters["model"] + server_type = server_parameters["server_type"] + task_overrides = generation_task.configure_client_overrides( + host=host, + port=int(port), + model=model, + server_type=server_type, + ) + effective_extra_args = f"{task_overrides} {extra_arguments}" + else: + effective_extra_args = job_extra_arguments full_extra_arguments = ( f"{generation_task.get_generation_default_args()} " f"{benchmark_args.generation_args} " - f"{job_extra_arguments} " + f"{effective_extra_args} " ) cmd = pipeline_utils.get_generation_cmd( @@ -504,30 +581,43 @@ def prepare_eval_commands( env_source[key] = b job_sandbox_env_overrides = [f"{k}={v}" for k, v in env_map.items()] + # For self-contained tasks, override server config and get num_gpus + job_num_gpus = None + is_self_contained_job = any(benchmarks_dict[b].self_contained_task for b in job_benchmarks) + if is_self_contained_job: + effective_server_config = None + for b in job_benchmarks: + if benchmarks_dict[b].num_gpus is not None: + job_num_gpus = benchmarks_dict[b].num_gpus + break + else: + effective_server_config = job_server_config + # TODO: move to a dataclass job_batches.append( ( job_cmds, - job_benchmarks, + sorted(job_benchmarks), job_needs_sandbox, job_needs_sandbox_to_keep_mounts, - job_server_config, + effective_server_config, job_server_address, # a check above guarantees that this is the same for all tasks in a job generation_task.get_server_command_fn(), job_sandbox_env_overrides, + job_num_gpus, ) ) - job_server_config, job_server_address, job_extra_arguments = pipeline_utils.configure_client( - **server_parameters, - extra_arguments=extra_arguments, - get_random_port=get_random_port, - ) for job_benchmark in job_benchmarks: benchmarks_dict[job_benchmark].job_ids.append(cur_job_idx) cur_job_idx += 1 job_cmds = [] job_benchmarks = set() + job_server_config, job_server_address, job_extra_arguments = pipeline_utils.configure_client( + **server_parameters, + extra_arguments=extra_arguments, + get_random_port=get_random_port, + ) cur_eval += 1 diff --git a/nemo_skills/pipeline/utils/generation.py b/nemo_skills/pipeline/utils/generation.py index 4432181cff..812120655c 100644 --- a/nemo_skills/pipeline/utils/generation.py +++ b/nemo_skills/pipeline/utils/generation.py @@ -136,9 +136,15 @@ def build_requirements_venv_cmd(requirements: list[str]) -> str: 'mkdir -p "$VENV_ROOT" && ' 'if [ ! -f "$READY_FILE" ]; then ' ' if mkdir "$LOCK_DIR" 2>/dev/null; then ' - ' if ! uv venv --system-site-packages "$VENV_DIR"; then rmdir "$LOCK_DIR"; exit 1; fi; ' - ' . "$VENV_DIR/bin/activate"; ' - ' if ! uv pip install -r "$REQS_FILE"; then rmdir "$LOCK_DIR"; exit 1; fi; ' + " if command -v uv >/dev/null 2>&1; then " + ' if ! uv venv --system-site-packages "$VENV_DIR"; then rmdir "$LOCK_DIR"; exit 1; fi; ' + ' . "$VENV_DIR/bin/activate"; ' + ' if ! uv pip install -r "$REQS_FILE"; then rmdir "$LOCK_DIR"; exit 1; fi; ' + " else " + ' if ! python3 -m venv --system-site-packages "$VENV_DIR"; then rmdir "$LOCK_DIR"; exit 1; fi; ' + ' . "$VENV_DIR/bin/activate"; ' + ' if ! python3 -m pip install -r "$REQS_FILE"; then rmdir "$LOCK_DIR"; exit 1; fi; ' + " fi; " ' touch "$READY_FILE"; ' ' rmdir "$LOCK_DIR"; ' " else " @@ -434,8 +440,6 @@ def get_generation_cmd( If requirements are provided, a per-requirements uv venv is prepared and activated before running the generation command. """ - if input_file is None and input_dir is None: - raise ValueError("Either input_file or input_dir must be provided.") if input_file is not None and input_dir is not None: raise ValueError("Please provide either input_file or input_dir, not both.") @@ -458,7 +462,9 @@ def get_generation_cmd( hydra_config_args, override_args = separate_hydra_args(extra_arguments) # Handle file paths vs module names - common_args = f"++skip_filled=True ++input_file={input_file} ++output_file={output_file}" + common_args = f"++skip_filled=True ++output_file={output_file}" + if input_file is not None: + common_args += f" ++input_file={input_file}" if script.endswith(".py") or os.sep in script: # It's a file path, run it directly with .py extension script_path = script if script.endswith(".py") else f"{script}.py" diff --git a/requirements/eval-kit.txt b/requirements/eval-kit.txt new file mode 100644 index 0000000000..dcfc99670b --- /dev/null +++ b/requirements/eval-kit.txt @@ -0,0 +1,3 @@ +# VLMEvalKit (vlmeval) is installed at job start via --installation_command +# in run_eval_kit.sh (pip install from mounted cluster source). +# This file is kept as a placeholder; no venv-based requirements are needed.