Add emergent_tts dataset and evaluation scripts by vmendelev · Pull Request #1249 · NVIDIA-NeMo/Skills

vmendelev · 2026-02-18T15:11:50Z

Summary

Add emergent_tts benchmark dataset with preparation and scoring scripts
Add Hydra configs for emergent TTS evaluation
Add conversion script from nemo-skills outputs to emergent format
Add emergent_tts README documentation
Add dependency checking utilities

Files added

nemo_skills/dataset/emergent_tts/__init__.py — dataset registration
nemo_skills/dataset/emergent_tts/prepare.py — data preparation
nemo_skills/dataset/emergent_tts/README.md — documentation
nemo_skills/dataset/emergent_tts/emergent/ — emergent framework integration
nemo_skills/dataset/emergent_tts/scripts/ — scoring, evaluation, and conversion scripts
nemo_skills/dataset/emergent_tts/scripts/config/ — Hydra config files
.gitignore — minor additions for TTS artifacts

Test plan

Run python -m pytest tests/ to verify no regressions
Verify emergent_tts dataset preparation works end-to-end

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added EmergentTTS-Eval dataset integration for evaluating text-to-speech outputs.
- Introduced evaluation pipeline with generation, scoring, and aggregation stages.
- Added dependency checker tool and data preparation utilities.
- Included multiple configuration templates for different evaluation scenarios.
Documentation
- Added comprehensive README documenting EmergentTTS-Eval setup and workflow.

Introduce the emergent_tts dataset package with prepare/generate/score helpers and default configs to run EmergentTTS evaluation via NeMo-Skills. Co-authored-by: Cursor <cursoragent@cursor.com>

Install google-genai for EmergentTTS-Eval, run scoring from the dataset base dir so relative paths resolve, and avoid shipping large local caches/data. Document EmergentTTS-Eval usage in nv_tts guide. Co-authored-by: Cursor <cursoragent@cursor.com>

Document dataset preparation (HF_TOKEN) and evaluation workflow, including cloning and patching EmergentTTS-Eval for NVIDIA Inference API judging. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-02-18T15:18:00Z

📝 Walkthrough

Walkthrough

This PR adds EmergentTTS-Eval dataset integration to NeMo-Skills, including data preparation, dependency checking, configuration management, audio conversion, and pipeline orchestration for end-to-end TTS evaluation workflows.

Changes

Cohort / File(s)	Summary
Baseline Setup `.gitignore`, `nemo_skills/dataset/emergent_tts/__init__.py`, `nemo_skills/dataset/emergent_tts/README.md`	Added gitignore patterns for cache and data directories; created emergent_tts package with comprehensive documentation of dataset integration and evaluation workflow.
Dataset Preparation & Dependencies `nemo_skills/dataset/emergent_tts/prepare.py`, `nemo_skills/dataset/emergent_tts/scripts/check_deps.py`	Implemented data preparation script with HuggingFace dataset loading, baseline audio export, JSONL generation, and checkpoint downloads; added dependency validation utility for prepare and scoring stages.
Audio Conversion & Formatting `nemo_skills/dataset/emergent_tts/scripts/convert_ns_outputs_to_emergent.py`	Converts NeMo-Skills TTS outputs to Emergent audio layout via symlink or copy operations; extracts metadata and handles overwrite semantics.
Pipeline Orchestration `nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py`, `nemo_skills/dataset/emergent_tts/scripts/score.py`	Orchestrates multi-stage evaluation (generation, scoring, aggregation); implements EmergentTTS-Eval scoring with metrics collection and per-benchmark processing; handles environment setup and API configuration.
Configuration Templates `nemo_skills/dataset/emergent_tts/scripts/config/default.yaml`, `nemo_skills/dataset/emergent_tts/scripts/config/interactive_10.yaml`, `nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10.yaml`, `nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10_base.yaml`	Added YAML configurations for distributed cluster execution, interactive smoke-testing, and local evaluation scenarios with generation and scoring parameters.
Module Constants `nemo_skills/dataset/emergent_tts/emergent/__init__.py`, `nemo_skills/dataset/emergent_tts/scripts/__init__.py`	Defined GENERATION_ARGS constant for prompt formatting and added package docstrings.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Orchestrator as Orchestrator<br/>(run_tts_eval.py)
    participant Prepare as Prepare<br/>(prepare.py)
    participant TTSServer as TTS Server<br/>(NeMo-Skills)
    participant Scorer as Scorer<br/>(score.py)
    participant Emergent as EmergentTTS-Eval

    User->>Orchestrator: run with config
    Note over Orchestrator: Load config & setup
    
    Orchestrator->>Prepare: Stage 1: run_generation()
    Prepare->>Prepare: Load HF dataset
    Prepare->>Prepare: Export baseline audio
    Prepare->>Prepare: Write JSONL & download checkpoint
    Prepare-->>Orchestrator: Ready for generation
    
    Orchestrator->>TTSServer: Stage 2: Submit TTS inference
    Note over TTSServer: Generate outputs via<br/>NeMo-Skills
    TTSServer-->>Orchestrator: outputs.jsonl
    
    Orchestrator->>Scorer: Stage 2: run_scoring()
    Scorer->>Scorer: Convert NS outputs to Emergent
    Scorer->>Emergent: Run evaluation
    Note over Emergent: Score WER, MOS, Win-rate
    Emergent-->>Scorer: metrics.json
    Scorer-->>Orchestrator: Scoring complete
    
    Orchestrator->>Scorer: Stage 3: run_aggregation()
    Scorer->>Scorer: Aggregate metrics
    Scorer-->>Orchestrator: Summary results
    
    Orchestrator-->>User: Done!

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

#1157: Updates GENERATION_ARGS across audio-related dataset packages; shares the pattern of module-level constant configuration used in this PR's emergent_tts module.

Suggested labels

enhancement

Suggested reviewers

melllinia
gwarmstrong
Kipok

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add emergent_tts dataset and evaluation scripts' accurately and concisely summarizes the main changeset, clearly conveying the primary addition of a new benchmark dataset with associated tooling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr5/emergent-tts-dataset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (7)

nemo_skills/dataset/emergent_tts/scripts/check_deps.py (2)

19-24: Narrow the exception to ImportError; move return None to else.

except Exception is flagged by the linter (BLE001) and violates the project guideline: "Do not catch exceptions when they are not normally expected to be raised." Python's import system documentation states: "If the loader cannot execute the module, it should raise an ImportError, although any other exception raised during exec_module() will be propagated." Catching all Exception silently swallows unexpected errors (e.g. a broken torch CUDA extension raising OSError) that should bubble up. The static analysis also flags TRY300 for the return None placement.
♻️ Proposed fix
 def _try_import(module: str) -> str | None:
     try:
         importlib.import_module(module)
-        return None
-    except Exception as e:
+    except ImportError as e:
         return f"{module} ({type(e).__name__}: {e})"
+    else:
+        return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py` around lines 19 - 24,
The _try_import function currently catches all Exception and returns None inside
the try; change it to only catch ImportError so unexpected errors propagate, and
move the successful-return None into an else block after the try/except.
Specifically, in _try_import use importlib.import_module(module) in the try,
catch only ImportError to return the formatted error string, and place return
None in an else clause so import failures are handled but other exceptions
(e.g., OSError from extensions) are not swallowed.
27-28: Clarify the comment on parents[4] — it describes the file location, not the variable.

The inline comment # .../nemo_skills/dataset/emergent_tts/scripts reads as though it describes the resolved value of repo_root, but parents[4] from check_deps.py walks four levels up (scripts → emergent_tts → dataset → nemo_skills → repo root). The comment actually describes the file's own location.
🔧 Suggested clarification
-    repo_root = Path(__file__).resolve().parents[4]  # .../nemo_skills/dataset/emergent_tts/scripts
+    # __file__ is at nemo_skills/dataset/emergent_tts/scripts/check_deps.py → parents[4] is the repo root
+    repo_root = Path(__file__).resolve().parents[4]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py` around lines 27 - 28,
The inline comment next to repo_root in _venv_install_hint is misleading: update
the comment so it clearly states that Path(__file__).resolve().parents[4]
returns the repository root (walking up from the file location), and if you want
to show the file location mention Path(__file__).resolve() separately; e.g.,
clarify that parents[4] corresponds to the repo root relative to the script and
not the script path itself and adjust the comment near repo_root accordingly.

nemo_skills/dataset/emergent_tts/scripts/score.py (3)

54-73: sys.argv manipulation to call convert_main is fragile.

Monkey-patching sys.argv to invoke convert_main() as a library call is error-prone (not thread-safe, breaks if convert_main is changed to use a global parser, etc.). Consider refactoring the converter to expose a function that accepts parameters directly, then call it here.

Sketch: refactor converter to accept params

In convert_ns_outputs_to_emergent.py, extract the core logic into a callable function:

def convert(ns_output_jsonl: str, out_dir: str, mode: str = "symlink", overwrite: bool = False):
    # ... current main() logic with these params instead of argparse ...

Then in score.py:

 def _convert(ns_output_jsonl: Path, out_dir: Path, overwrite: bool) -> None:
-    from nemo_skills.dataset.emergent_tts.scripts.convert_ns_outputs_to_emergent import main as convert_main
-
-    # Reuse converter as a library via argv.
-    import sys
-
-    argv = sys.argv
-    try:
-        sys.argv = [
-            argv[0],
-            "--ns_output_jsonl",
-            str(ns_output_jsonl),
-            "--out_dir",
-            str(out_dir),
-            "--mode",
-            "symlink",
-        ] + (["--overwrite"] if overwrite else [])
-        convert_main()
-    finally:
-        sys.argv = argv
+    from nemo_skills.dataset.emergent_tts.scripts.convert_ns_outputs_to_emergent import convert
+    convert(str(ns_output_jsonl), str(out_dir), mode="symlink", overwrite=overwrite)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/score.py` around lines 54 - 73, The
current _convert function monkey-patches sys.argv and calls convert_main(),
which is fragile and not thread-safe; refactor
convert_ns_outputs_to_emergent.convert_ns_outputs_to_emergent (the module that
currently exposes main as convert_main) to expose a direct API like
convert(ns_output_jsonl: str | Path, out_dir: str | Path, mode: str = "symlink",
overwrite: bool = False) that implements the core logic, make the existing
main() simply parse args and call that convert(...) helper, and then update
_convert to import and call convert(ns_output_jsonl, out_dir, mode="symlink",
overwrite=overwrite) instead of mutating sys.argv and calling convert_main().

176-186: Consider atomic write for metrics.json.

If json.dump fails mid-write (e.g., disk full), metrics.json will be left in a corrupted state. Since metrics are fully loaded into memory first (line 181), writing to a temp file and renaming would be safer. This is a minor concern given it's an output file, not input data.

As per coding guidelines, "perform all computations before re-opening files for writing to avoid accidental data loss if code fails during execution".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/score.py` around lines 176 - 186,
The code writes metrics.json directly which can leave a corrupt file if
json.dump fails; instead, after loading metrics (the variable metrics) write to
a temporary file in the same directory (derived from bench_dir) and then
atomically replace/rename it to bench_dir/"metrics.json" (use Path.replace or
os.replace) so that emergent_metrics_path, metrics, bench_dir and bench are
still used to locate and log the file; ensure the temp file is created in
bench_dir, closed/flushed before replace, and cleaned up on error.

93-122: Global state mutation (os.environ, os.chdir) in _run_emergent_scoring.

Setting os.environ["EMERGENT_TTS_DATA_BASE_PATH"] (line 94) and changing the working directory (line 102) are global side effects. The chdir is properly restored in a finally block, but the environment variable is never cleaned up. If run_scoring is called multiple times with different emergent_data_base_path values, the env var will be stale between calls. Consider restoring the original env var value in the finally block as well.

Proposed fix

     os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = str(emergent_data_base_path)
 
     prev_cwd = os.getcwd()
+    prev_env = os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")
     try:
         os.chdir(str(emergent_data_base_path.parent))
         emergent_inference.eval_api_closed_model(
             ...
         )
     finally:
         os.chdir(prev_cwd)
+        if prev_env is None:
+            os.environ.pop("EMERGENT_TTS_DATA_BASE_PATH", None)
+        else:
+            os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = prev_env

Wait — on re-reading, the env var is set on line 94 before prev_cwd is captured, so the save/restore should wrap both. Let me adjust:

Corrected proposed fix

+    prev_env = os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")
     os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = str(emergent_data_base_path)
 
     prev_cwd = os.getcwd()
     try:
         os.chdir(str(emergent_data_base_path.parent))
         emergent_inference.eval_api_closed_model(
             ...
         )
     finally:
         os.chdir(prev_cwd)
+        if prev_env is None:
+            os.environ.pop("EMERGENT_TTS_DATA_BASE_PATH", None)
+        else:
+            os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = prev_env

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/score.py` around lines 93 - 122, The
code sets os.environ["EMERGENT_TTS_DATA_BASE_PATH"] and chdirs without restoring
the environment; modify the routine (the block around
os.environ["EMERGENT_TTS_DATA_BASE_PATH"], prev_cwd, and the finally) to first
capture the original env value (orig_env =
os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")) and prev_cwd, then set the env
and chdir, and in the finally restore both (os.environ[...] = orig_env or del
os.environ[...] if orig_env is None, and os.chdir(prev_cwd)); keep the existing
call to emergent_inference.eval_api_closed_model unchanged.

nemo_skills/dataset/emergent_tts/prepare.py (1)

64-89: Download retry catches bare Exception.

Line 84 catches Exception broadly. For network downloads, the expected failures are urllib.error.URLError, OSError, and ContentTooShortError (already handled above). Narrowing the catch to (URLError, OSError) would prevent masking unexpected bugs (e.g., KeyboardInterrupt is not caught by Exception, but MemoryError could be).

Proposed fix

+from urllib.error import ContentTooShortError, URLError
 ...
         except ContentTooShortError as e:
             # Partial download: wait and retry.
             wait_s = min(5 * attempt, 30)
             print(f"Warning: partial download for wv_mos.ckpt (attempt {attempt}/{max_attempts}): {e}")
             time.sleep(wait_s)
-        except Exception as e:
+        except (URLError, OSError) as e:
             wait_s = min(5 * attempt, 30)
             print(f"Warning: failed downloading wv_mos.ckpt (attempt {attempt}/{max_attempts}): {e}")
             time.sleep(wait_s)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/prepare.py` around lines 64 - 89, The retry
loop in _download_wv_mos currently catches a broad Exception which can mask
unrelated errors; change the second except clause to only catch
network/filesystem related errors (e.g., except (urllib.error.URLError, OSError)
as e) so the retry logic still handles download failures for WV_MOS_URL and
tmp_path operations but lets other unexpected exceptions propagate; keep the
existing ContentTooShortError handling, wait_s/backoff logic, tmp_path cleanup,
and final RuntimeError if attempts exhaust.

nemo_skills/dataset/emergent_tts/scripts/config/default.yaml (1)

48-48: pip install without version pins in installation_command.

pip install editdistance whisper-normalizer json-repair tenacity installs unpinned latest versions. This can cause non-reproducible environments. Consider pinning versions (e.g., editdistance==0.8.1) or at least documenting the known-working versions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/config/default.yaml` at line 48, The
installation_command currently installs unpinned packages (editdistance,
whisper-normalizer, json-repair, tenacity, and google-genai) which makes
environments non-reproducible; update the installation_command entry in
default.yaml to pin known-good versions for each of these packages (or add a
comment documenting tested versions), e.g., specify exact version constraints
for editdistance, whisper-normalizer, json-repair, tenacity, and keep the
existing --no-deps flag for google-genai if needed so installations are
reproducible and predictable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/emergent_tts/prepare.py`:
- Around line 193-202: The conversion from float samples to int16 can wrap if
samples exceed [-1,1]; update the code that creates audio_array_int16 (working
with variables audio_array, audio_array_int16, and the AudioSegment export to
wav_path) to first clamp/clip audio_array to [-1.0, 1.0] (e.g., via np.clip),
then scale and convert to int16 (preferably round then astype) before building
AudioSegment and exporting; this ensures no int16 overflow/wrap and avoids
audible artifacts.
- Around line 46-61: In _require_deps() replace the broad "except Exception"
with "except ImportError" so only failed imports are caught, and update the
RuntimeError message to remove the hardcoded developer path—use a generic
installation instruction (e.g., "activate your virtualenv and run: pip install
datasets numpy pydub tqdm librosa soundfile") so users get actionable,
non-personalized guidance; keep the original raised-from (from e) behavior when
re-raising the RuntimeError.

In `@nemo_skills/dataset/emergent_tts/README.md`:
- Line 45: The README contains duplicate section numbering: update the three
heading lines so they read "### 3) Clone + patch EmergentTTS-Eval-public for
NVIDIA Inference API judging", change the second "### 3) Run evaluation" to "###
4) Run evaluation", and increment "### 4) Smoke test" to "### 5) Smoke test"
(these exact heading strings identify the locations to edit) so the section
numbers are sequential.
- Around line 20-28: Replace all developer-specific absolute paths in the
README.md code blocks (e.g., `/home/vmendelev/...` and
`/lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code`) with clear
placeholders like `<REPO_ROOT>` and `<CLUSTER_WORKDIR>` and update the repo
reference `<repo_url>` to the actual EmergentTTS-Eval-public repository URL;
ensure every code block that currently contains hard-coded paths (including the
examples around the prepare.py invocation and the git clone block) uses these
placeholders or environment variables (e.g., export REPO_ROOT="/path/to/repo")
so contributors can substitute their own paths.
- Around line 64-71: The README has a contradiction between the prose and the
example config about judger_base_url: the prose instructs threading
judger_base_url as the base URL (e.g., https://inference-api.nvidia.com/v1)
while the config example sets scoring.judger_base_url to the full chat
completions path; fix by choosing the base-only form and updating the example
and any code that consumes it (e.g., api_clients.py) to append the endpoint path
when constructing requests; ensure references to judger_base_url and
scoring.judger_base_url consistently use the base (no /v1/chat/completions) and
that OpenAI(...) or equivalent client is created with base_url=judger_base_url
and the code appends the proper /v1/chat/completions suffix where needed, and
update the README snippet and scripts/config/default.yaml to match.
- Around line 1-124: The README for emergent_tts is missing example evaluation
results; update nemo_skills/dataset/emergent_tts/README.md to include a short
"Expected results" section showing sample metrics (WER, MOS, win-rate) for at
least one tested model and the exact config used; reference the example config
path scripts/config/default.yaml (and interactive_10.yaml for smoke tests) and
the output filenames (eval-results/.../output.jsonl,
emergent-tts-eval_*_evaluation-metrics.json, metrics.json) so readers can
reproduce the run and compare their numbers.

In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py`:
- Around line 57-80: The duplicate missing-module entries occur because modules
like "pydub", "librosa", and "soundfile" are checked in both the prepare and
scoring loops and appended to missing twice; modify the checks around args.stage
so each module is only added once by deduplicating before appending (e.g.,
maintain a local seen set or build a single ordered list of modules to check
when args.stage == "all") and continue to use _try_import(module) to produce err
and append to missing only if module not in seen; ensure you reference the
existing variables and functions (_try_import, missing, args.stage) and preserve
existing loop/error handling semantics.

In `@nemo_skills/dataset/emergent_tts/scripts/config/default.yaml`:
- Around line 9-18: default.yaml currently embeds developer-specific absolute
paths for keys like container, output_dir, nemo_code_path, data_dir,
scoring_code_path, and emergent_data_dir and also contains a repeated typo
"containters"; replace those hardcoded paths with clear placeholders (e.g.
<CONTAINER_PATH>, <OUTPUT_DIR>, <NEMO_CODE_PATH>, etc.) or
environment-variable-style tokens and add short inline comments explaining which
must be customized, and fix the spelling of the "container" key (remove
"containters") wherever it appears so consumers get a canonical, editable
default config.

In `@nemo_skills/dataset/emergent_tts/scripts/config/interactive_10.yaml`:
- Around line 3-9: The YAML uses hardcoded user-specific absolute paths and a
typo: change the `container` key value (currently containing "containters") to
`containers` and replace all user-specific absolute paths referenced by keys
`container`, `mount_paths`, `output_dir`, and `nemo_code_path` with reusable
placeholders or environment-interpolated values (e.g., use
`${oc.env:LUSTRE_BASE}` or `${env:LUSTRE_BASE}` depending on your config system)
so other developers can override them at runtime; ensure the updated `container`
key points to the correct container path expression and that `mount_paths`,
`output_dir`, and `nemo_code_path` use the same base variable to construct their
paths.

In
`@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10_base.yaml`:
- Around line 11-14: Replace hard-coded developer paths in the YAML (keys
output_dir, nemo_code_path, and data_dir) with portable placeholders or
environment-expanded variables and/or exclude local variants via .gitignore;
specifically, change the values for output_dir, nemo_code_path, and data_dir to
use Hydra/env placeholders like ${oc.env:HOME}/path or clearly-marked tokens
such as <YOUR_WORKSPACE>/... so other contributors don't need to edit the file,
and add a gitignore rule (e.g.,
nemo_skills/dataset/emergent_tts/scripts/config/local_*.yaml) so personal
local_*.yaml files are not committed.

In `@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10.yaml`:
- Around line 4-5: Replace the hard-coded developer paths in this config by
converting the absolute values for output_dir, nemo_code_path, and data_dir into
either placeholder variables (e.g., "<PATH_TO_OUTPUT>", "<PATH_TO_NEMO_CODE>",
"<PATH_TO_DATA>") or reference environment variables (e.g., ${ENV_VAR}) and
remove or generalize the usage comment that contains the specific
NEMO_SKILLS_CONFIG_DIR developer path; alternatively, omit this local config
from the repo and add it to .gitignore. Update the entries named output_dir,
nemo_code_path, data_dir and the usage comment so the file no longer contains
absolute developer-specific paths.

In `@nemo_skills/dataset/emergent_tts/scripts/convert_ns_outputs_to_emergent.py`:
- Around line 57-86: The loop currently silently skips records when a
destination file already exists (the check at dst.exists() and not
args.overwrite) without incrementing any counter; add a new counter variable
(e.g., existing_skipped or skipped_existing) initialized alongside
converted/skipped/missing, increment it inside the dst.exists() and not
args.overwrite branch, and include that counter in the final print summary along
with converted, skipped (no unique_id_eval), and missing_audio so the user can
see how many files were skipped because they already existed; modify references
around _link_or_copy, dst, args.overwrite, and out_dir accordingly.

In `@nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py`:
- Around line 152-168: The aggregation branch currently runs only when
args.stage == "aggregation", so --stage all skips aggregation; modify the
condition in run_tts_eval.py (the block that builds agg_cmd and calls
ns_run_cmd) to also run when args.stage == "all" (e.g., if args.stage in
("aggregation","all")) or add a clear inline comment near args.stage explaining
that aggregation is intentionally separate and must be invoked with
"aggregation"; update references to agg_cmd and the ns_run_cmd call accordingly
so aggregation runs after scoring when requested.
- Around line 119-131: The command string score_cmd in run_tts_eval.py currently
inlines the secret via `JUDGER_API_KEY={judger_api_key}`; remove that prefix
from the constructed `score_cmd` and instead pass the key via the environment
when invoking `ns_run_cmd` (or ensure the Slurm job inherits the caller env),
e.g., add an env dict containing "JUDGER_API_KEY": judger_api_key to the
ns_run_cmd call or rely on the scoring script reading os.environ; leave the rest
of the command (flags built from scoring.get(...)) unchanged and ensure no other
code interpolates judger_api_key into any logged strings or job script content.

---

Nitpick comments:
In `@nemo_skills/dataset/emergent_tts/prepare.py`:
- Around line 64-89: The retry loop in _download_wv_mos currently catches a
broad Exception which can mask unrelated errors; change the second except clause
to only catch network/filesystem related errors (e.g., except
(urllib.error.URLError, OSError) as e) so the retry logic still handles download
failures for WV_MOS_URL and tmp_path operations but lets other unexpected
exceptions propagate; keep the existing ContentTooShortError handling,
wait_s/backoff logic, tmp_path cleanup, and final RuntimeError if attempts
exhaust.

In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py`:
- Around line 19-24: The _try_import function currently catches all Exception
and returns None inside the try; change it to only catch ImportError so
unexpected errors propagate, and move the successful-return None into an else
block after the try/except. Specifically, in _try_import use
importlib.import_module(module) in the try, catch only ImportError to return the
formatted error string, and place return None in an else clause so import
failures are handled but other exceptions (e.g., OSError from extensions) are
not swallowed.
- Around line 27-28: The inline comment next to repo_root in _venv_install_hint
is misleading: update the comment so it clearly states that
Path(__file__).resolve().parents[4] returns the repository root (walking up from
the file location), and if you want to show the file location mention
Path(__file__).resolve() separately; e.g., clarify that parents[4] corresponds
to the repo root relative to the script and not the script path itself and
adjust the comment near repo_root accordingly.

In `@nemo_skills/dataset/emergent_tts/scripts/config/default.yaml`:
- Line 48: The installation_command currently installs unpinned packages
(editdistance, whisper-normalizer, json-repair, tenacity, and google-genai)
which makes environments non-reproducible; update the installation_command entry
in default.yaml to pin known-good versions for each of these packages (or add a
comment documenting tested versions), e.g., specify exact version constraints
for editdistance, whisper-normalizer, json-repair, tenacity, and keep the
existing --no-deps flag for google-genai if needed so installations are
reproducible and predictable.

In `@nemo_skills/dataset/emergent_tts/scripts/score.py`:
- Around line 54-73: The current _convert function monkey-patches sys.argv and
calls convert_main(), which is fragile and not thread-safe; refactor
convert_ns_outputs_to_emergent.convert_ns_outputs_to_emergent (the module that
currently exposes main as convert_main) to expose a direct API like
convert(ns_output_jsonl: str | Path, out_dir: str | Path, mode: str = "symlink",
overwrite: bool = False) that implements the core logic, make the existing
main() simply parse args and call that convert(...) helper, and then update
_convert to import and call convert(ns_output_jsonl, out_dir, mode="symlink",
overwrite=overwrite) instead of mutating sys.argv and calling convert_main().
- Around line 176-186: The code writes metrics.json directly which can leave a
corrupt file if json.dump fails; instead, after loading metrics (the variable
metrics) write to a temporary file in the same directory (derived from
bench_dir) and then atomically replace/rename it to bench_dir/"metrics.json"
(use Path.replace or os.replace) so that emergent_metrics_path, metrics,
bench_dir and bench are still used to locate and log the file; ensure the temp
file is created in bench_dir, closed/flushed before replace, and cleaned up on
error.
- Around line 93-122: The code sets os.environ["EMERGENT_TTS_DATA_BASE_PATH"]
and chdirs without restoring the environment; modify the routine (the block
around os.environ["EMERGENT_TTS_DATA_BASE_PATH"], prev_cwd, and the finally) to
first capture the original env value (orig_env =
os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")) and prev_cwd, then set the env
and chdir, and in the finally restore both (os.environ[...] = orig_env or del
os.environ[...] if orig_env is None, and os.chdir(prev_cwd)); keep the existing
call to emergent_inference.eval_api_closed_model unchanged.

coderabbitai · 2026-02-18T15:18:03Z

nemo_skills/dataset/emergent_tts/prepare.py

+def _require_deps():
+    try:
+        import numpy as np  # noqa: F401
+        from datasets import load_dataset  # noqa: F401
+        import librosa  # noqa: F401
+        import soundfile  # noqa: F401
+        from pydub import AudioSegment  # noqa: F401
+        from tqdm import tqdm  # noqa: F401
+    except Exception as e:  # pragma: no cover
+        raise RuntimeError(
+            "Missing dependencies for EmergentTTS-Eval preparation.\n\n"
+            "Install into the repo venv:\n"
+            "  cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval\n"
+            "  . ./.venv/bin/activate\n"
+            "  pip install datasets numpy pydub tqdm librosa soundfile\n"
+        ) from e


⚠️ Potential issue | 🟠 Major

Hardcoded developer path in error message; catch ImportError instead of bare Exception.

Two issues:

Line 58 contains a hardcoded developer-specific path (/home/vmendelev/workspace/...). This is meaningless to other users. Replace with a generic instruction.

The except Exception on line 54 should be except ImportError — that's the only exception expected from failed imports. Catching broader exceptions can mask unrelated bugs (e.g., a library that imports successfully but fails during its own init for a different reason).

Proposed fix

def _require_deps(): try: - import numpy as np # noqa: F401 - from datasets import load_dataset # noqa: F401 - import librosa # noqa: F401 - import soundfile # noqa: F401 - from pydub import AudioSegment # noqa: F401 - from tqdm import tqdm # noqa: F401 - except Exception as e: # pragma: no cover + import numpy as np + from datasets import load_dataset + import librosa + import soundfile + from pydub import AudioSegment + from tqdm import tqdm + except ImportError as e: # pragma: no cover raise RuntimeError( "Missing dependencies for EmergentTTS-Eval preparation.\n\n" - "Install into the repo venv:\n" - " cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval\n" - " . ./.venv/bin/activate\n" - " pip install datasets numpy pydub tqdm librosa soundfile\n" + "Install the required packages:\n" + " pip install datasets numpy pydub tqdm librosa soundfile\n" ) from e

As per coding guidelines, "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving".

🧰 Tools

🪛 Ruff (0.15.1)

[warning] 48-48: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

[warning] 49-49: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

[warning] 50-50: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

[warning] 51-51: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

[warning] 52-52: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

[warning] 53-53: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

[warning] 55-61: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/prepare.py` around lines 46 - 61, In _require_deps() replace the broad "except Exception" with "except ImportError" so only failed imports are caught, and update the RuntimeError message to remove the hardcoded developer path—use a generic installation instruction (e.g., "activate your virtualenv and run: pip install datasets numpy pydub tqdm librosa soundfile") so users get actionable, non-personalized guidance; keep the original raised-from (from e) behavior when re-raising the RuntimeError.

coderabbitai · 2026-02-18T15:18:03Z

nemo_skills/dataset/emergent_tts/prepare.py

+                audio_array = curr["audio"]["array"]
+                audio_sr = int(curr["audio"]["sampling_rate"])
+                audio_array_int16 = np.int16(audio_array * 32767)
+                audio_segment = AudioSegment(
+                    audio_array_int16.tobytes(),
+                    frame_rate=audio_sr,
+                    sample_width=2,
+                    channels=1,
+                )
+                audio_segment.export(str(wav_path), format="wav")


⚠️ Potential issue | 🟡 Minor

Potential int16 overflow/wrap when audio samples exceed [-1, 1].

np.int16(audio_array * 32767) will silently wrap around if any sample exceeds the [-1.0, 1.0] range (e.g., a value of 1.001 * 32767 = 32799 wraps to -32737 as int16). This produces audible clicks/artifacts. Clip before converting.

Proposed fix

- audio_array_int16 = np.int16(audio_array * 32767) + audio_array_int16 = np.int16(np.clip(audio_array, -1.0, 1.0) * 32767)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/prepare.py` around lines 193 - 202, The conversion from float samples to int16 can wrap if samples exceed [-1,1]; update the code that creates audio_array_int16 (working with variables audio_array, audio_array_int16, and the AudioSegment export to wav_path) to first clamp/clip audio_array to [-1.0, 1.0] (e.g., via np.clip), then scale and convert to int16 (preferably round then astype) before building AudioSegment and exporting; this ensures no int16 overflow/wrap and avoids audible artifacts.

coderabbitai · 2026-02-18T15:18:03Z

nemo_skills/dataset/emergent_tts/README.md

+## EmergentTTS-Eval dataset (`emergent_tts`)
+
+This dataset integration lets you:
+
+- **Prepare** the EmergentTTS-Eval test set under a shared `data_dir` (download baseline audios + metadata + MOS model).
+- **Generate** TTS outputs with NeMo-Skills (`ns eval` via `run_tts_eval.py`).
+- **Score** the generated outputs with EmergentTTS-Eval (WER/MOS/win-rate, depending on config).
+
+### 1) Prepare the test set (requires `HF_TOKEN`)
+
+`prepare.py` downloads the dataset and writes all required artifacts into:
+
+- `<DATA_DIR>/emergent_tts/emergent/test.jsonl`
+- `<DATA_DIR>/emergent_tts/data/emergent_tts_eval_data.jsonl`
+- `<DATA_DIR>/emergent_tts/data/baseline_audios/*.wav`
+- `<DATA_DIR>/emergent_tts/data/wv_mos.ckpt`
+
+Run it from your dev machine (or any environment with network access):
+
+```bash
+cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval
+. ./.venv/bin/activate
+
+export HF_TOKEN="<your_hf_token>"
+
+python nemo_skills/dataset/emergent_tts/prepare.py \
+  --output_dir "<DATA_DIR>/emergent_tts"
+```
+
+Optional flags:
+
+- `--num_samples 10`: write only the first 10 samples (smoke test).
+- `--overwrite`: re-download / regenerate outputs.
+
+### 2) Configure evaluation
+
+Use the example configs in `nemo_skills/dataset/emergent_tts/scripts/config/`.
+
+In `scripts/config/default.yaml`, set:
+
+- `generation.data_dir: <DATA_DIR>`
+- `scoring.emergent_data_dir: <DATA_DIR>/emergent_tts/data`
+- `scoring.scoring_code_path: <PATH_TO>/EmergentTTS-Eval-public` (on the cluster)
+
+### 3) Clone + patch EmergentTTS-Eval-public for NVIDIA Inference API judging
+
+On EOS (or wherever you run scoring), clone EmergentTTS-Eval:
+
+```bash
+cd /lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code
+git clone <repo_url> EmergentTTS-Eval-public
+```
+
+Then update Emergent’s judge client selection so that **Gemini models are called via NVIDIA’s OpenAI-compatible Inference API**.
+
+Target behavior:
+
+- **Model name** stays as: `gcp/google/gemini-2.5-pro` (or similar).
+- **Base URL** is NVIDIA Inference API: `https://inference-api.nvidia.com/v1`
+- **API key** comes from: `JUDGER_API_KEY` (or `NVIDIA_API_KEY`)
+
+Minimal patch checklist inside `EmergentTTS-Eval-public`:
+
+- In `api_clients.py` (or wherever the client is chosen), ensure `gcp/google/*` uses an **OpenAI-compatible** client (not the Google SDK client), e.g.:
+  - `OpenAI(base_url=<judger_base_url>, api_key=os.getenv("JUDGER_API_KEY"))`
+- Thread `judger_base_url` through so calls use `https://inference-api.nvidia.com/v1` (not the full `/v1/chat/completions` endpoint).
+
+After patching, set these in `scripts/config/default.yaml`:
+
+- `scoring.judge_model: gcp/google/gemini-2.5-pro`
+- `scoring.judger_base_url: https://inference-api.nvidia.com/v1/chat/completions`
+
+### 3) Run evaluation (generation + scoring)
+
+From your dev machine, submit jobs to EOS:
+
+```bash
+cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval
+. ./.venv/bin/activate
+mkdir -p .nemo_run
+
+export NEMORUN_HOME="$PWD/.nemo_run"
+export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/ns_eval/cluster_configs
+export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1
+
+# Required for win-rate judging (NVIDIA Inference API key)
+export JUDGER_API_KEY="<your_nvidia_api_key>"
+
+python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval \
+  --config nemo_skills/dataset/emergent_tts/scripts/config/default.yaml \
+  --stage all \
+  --expname emergent_eval
+```
+
+### 4) Smoke test (10 samples, interactive)
+
+```bash
+cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval
+. ./.venv/bin/activate
+mkdir -p .nemo_run
+
+export NEMORUN_HOME="$PWD/.nemo_run"
+export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/ns_eval/cluster_configs
+export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1
+
+python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval \
+  --config nemo_skills/dataset/emergent_tts/scripts/config/interactive_10.yaml \
+  --stage generation \
+  --expname emergent_smoke10
+```
+
+### Outputs
+
+NeMo-Skills generation writes:
+
+- `<output_dir>/eval-results/emergent_tts.emergent/output.jsonl`
+- `<output_dir>/eval-results/emergent_tts.emergent/audio/*.wav` (or equivalent)
+
+Emergent scoring writes (in the same benchmark folder):
+
+- `emergent-tts-eval_*_evaluation-predictions.jsonl`
+- `emergent-tts-eval_*_evaluation-metrics.json`
+- `metrics.json` (a NeMo-Skills-friendly copy of Emergent metrics)
+


⚠️ Potential issue | 🟡 Minor

Add expected evaluation results for at least one tested model.

The README documents the workflow thoroughly but does not include any sample metric output (e.g. WER, MOS, win-rate) for a reference model run. Based on learnings from CONTRIBUTING.md: "When adding new benchmarks, add documentation with example commands for how to run evaluation, expected results for tested models, and any dataset-specific details like special preparation arguments or non-standard inference arguments."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/README.md` around lines 1 - 124, The README for emergent_tts is missing example evaluation results; update nemo_skills/dataset/emergent_tts/README.md to include a short "Expected results" section showing sample metrics (WER, MOS, win-rate) for at least one tested model and the exact config used; reference the example config path scripts/config/default.yaml (and interactive_10.yaml for smoke tests) and the output filenames (eval-results/.../output.jsonl, emergent-tts-eval_*_evaluation-metrics.json, metrics.json) so readers can reproduce the run and compare their numbers.

coderabbitai · 2026-02-18T15:18:03Z

nemo_skills/dataset/emergent_tts/README.md

+```bash
+cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval
+. ./.venv/bin/activate
+
+export HF_TOKEN="<your_hf_token>"
+
+python nemo_skills/dataset/emergent_tts/prepare.py \
+  --output_dir "<DATA_DIR>/emergent_tts"
+```


⚠️ Potential issue | 🟠 Major

Replace developer-specific absolute paths with generic placeholders.

The README hard-codes /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval (lines 21, 78, 98, and elsewhere) and /lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code (line 50) throughout every code block. These paths are specific to one developer's environment and will not work for any other contributor.

Replace them with environment variables or clearly marked placeholders, e.g. <REPO_ROOT>, <CLUSTER_WORKDIR>. The <repo_url> on line 51 also needs to be filled in with the actual EmergentTTS-Eval-public repository URL.

Also applies to: 49-52, 77-110

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/README.md` around lines 20 - 28, Replace all developer-specific absolute paths in the README.md code blocks (e.g., `/home/vmendelev/...` and `/lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code`) with clear placeholders like `<REPO_ROOT>` and `<CLUSTER_WORKDIR>` and update the repo reference `<repo_url>` to the actual EmergentTTS-Eval-public repository URL; ensure every code block that currently contains hard-coded paths (including the examples around the prepare.py invocation and the git clone block) uses these placeholders or environment variables (e.g., export REPO_ROOT="/path/to/repo") so contributors can substitute their own paths.

coderabbitai · 2026-02-18T15:18:03Z

nemo_skills/dataset/emergent_tts/README.md

+- `scoring.emergent_data_dir: <DATA_DIR>/emergent_tts/data`
+- `scoring.scoring_code_path: <PATH_TO>/EmergentTTS-Eval-public` (on the cluster)
+
+### 3) Clone + patch EmergentTTS-Eval-public for NVIDIA Inference API judging


⚠️ Potential issue | 🟡 Minor

Fix duplicate section number — two sections are labeled "3)".

Line 45 is ### 3) Clone + patch EmergentTTS-Eval-public and line 73 is ### 3) Run evaluation. The subsequent section (### 4) Smoke test) is also off by one. The correct numbering should be: 3 → Clone & patch, 4 → Run evaluation, 5 → Smoke test.

Also applies to: 73-73, 95-95

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/README.md` at line 45, The README contains duplicate section numbering: update the three heading lines so they read "### 3) Clone + patch EmergentTTS-Eval-public for NVIDIA Inference API judging", change the second "### 3) Run evaluation" to "### 4) Run evaluation", and increment "### 4) Smoke test" to "### 5) Smoke test" (these exact heading strings identify the locations to edit) so the section numbers are sequential.

coderabbitai · 2026-02-18T15:18:03Z

nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10_base.yaml

+output_dir: /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval/_local_runs/emergent_tts_smoke10_base
+
+nemo_code_path: /home/vmendelev/workspace/expressiveness/src/NeMo
+


⚠️ Potential issue | 🟠 Major

Hard-coded developer paths make this config non-portable.

output_dir, nemo_code_path, and data_dir (lines 11, 13-14, 22-25) all reference /home/vmendelev/.... Any other contributor must manually edit these before using this file.

Either replace them with clearly-marked placeholders (e.g. ${oc.env:HOME}/... for Hydra, or <YOUR_WORKSPACE>/... as a comment hint) or add a .gitignore pattern such as nemo_skills/dataset/emergent_tts/scripts/config/local_*.yaml to keep local/personal variants out of the repository entirely, consistent with the existing cluster_configs/* pattern in .gitignore.

Also applies to: 22-25

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10_base.yaml` around lines 11 - 14, Replace hard-coded developer paths in the YAML (keys output_dir, nemo_code_path, and data_dir) with portable placeholders or environment-expanded variables and/or exclude local variants via .gitignore; specifically, change the values for output_dir, nemo_code_path, and data_dir to use Hydra/env placeholders like ${oc.env:HOME}/path or clearly-marked tokens such as <YOUR_WORKSPACE>/... so other contributors don't need to edit the file, and add a gitignore rule (e.g., nemo_skills/dataset/emergent_tts/scripts/config/local_*.yaml) so personal local_*.yaml files are not committed.

coderabbitai · 2026-02-18T15:18:03Z

nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10.yaml

+#   export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval/cluster_configs
+#   python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval --config <this_file> --stage generation


⚠️ Potential issue | 🟠 Major

Same hard-coded developer paths as local_interactive_10_base.yaml.

output_dir, nemo_code_path, data_dir (lines 10, 14, 25), and even the usage comment on line 4 (NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/...) are all absolute paths tied to a specific developer environment. Apply the same fix suggested for local_interactive_10_base.yaml: use placeholder values or exclude these local configs via .gitignore.

Also applies to: 10-10, 14-14, 25-25

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10.yaml` around lines 4 - 5, Replace the hard-coded developer paths in this config by converting the absolute values for output_dir, nemo_code_path, and data_dir into either placeholder variables (e.g., "<PATH_TO_OUTPUT>", "<PATH_TO_NEMO_CODE>", "<PATH_TO_DATA>") or reference environment variables (e.g., ${ENV_VAR}) and remove or generalize the usage comment that contains the specific NEMO_SKILLS_CONFIG_DIR developer path; alternatively, omit this local config from the repo and add it to .gitignore. Update the entries named output_dir, nemo_code_path, data_dir and the usage comment so the file no longer contains absolute developer-specific paths.

coderabbitai · 2026-02-18T15:18:03Z

nemo_skills/dataset/emergent_tts/scripts/convert_ns_outputs_to_emergent.py

+    converted = 0
+    skipped = 0
+    missing = 0
+
+    with open(args.ns_output_jsonl, "r", encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            record = json.loads(line)
+            user_json = _extract_user_json(record) or {}
+            unique_id = user_json.get("unique_id_eval", record.get("unique_id_eval"))
+            audio_path = (record.get("audio") or {}).get("path")
+
+            if unique_id is None:
+                skipped += 1
+                continue
+            if not audio_path or not os.path.exists(audio_path):
+                missing += 1
+                continue
+
+            dst = out_dir / f"{unique_id}.wav"
+            if dst.exists() and not args.overwrite:
+                continue
+            _link_or_copy(audio_path, str(dst), args.mode)
+            converted += 1
+
+    print(
+        f"Converted {converted} files into {out_dir}. "
+        f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}"


⚠️ Potential issue | 🟡 Minor

Silently skipped "already exists" records are not counted in the summary.

When dst.exists() and not args.overwrite (Line 79), the record is skipped without incrementing any counter. The final summary reports converted, skipped(no unique_id_eval), and missing_audio, but "already-exists" records are invisible. This makes it hard to tell whether a partial run actually processed everything.

Proposed fix

converted = 0 skipped = 0 missing = 0 + already_exists = 0 with open(args.ns_output_jsonl, "r", encoding="utf-8") as f: for line in f: @@ -79,6 +80,7 @@ dst = out_dir / f"{unique_id}.wav" if dst.exists() and not args.overwrite: + already_exists += 1 continue _link_or_copy(audio_path, str(dst), args.mode) converted += 1 print( f"Converted {converted} files into {out_dir}. " - f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}" + f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}, already_exists={already_exists}" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

converted = 0

skipped = 0

missing = 0

with open(args.ns_output_jsonl, "r", encoding="utf-8") as f:

for line in f:

line = line.strip()

if not line:

continue

record = json.loads(line)

user_json = _extract_user_json(record) or {}

unique_id = user_json.get("unique_id_eval", record.get("unique_id_eval"))

audio_path = (record.get("audio") or {}).get("path")

if unique_id is None:

skipped += 1

continue

if not audio_path or not os.path.exists(audio_path):

missing += 1

continue

dst = out_dir / f"{unique_id}.wav"

if dst.exists() and not args.overwrite:

continue

_link_or_copy(audio_path, str(dst), args.mode)

converted += 1

print(

f"Converted {converted} files into {out_dir}. "

f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}"

converted = 0

skipped = 0

missing = 0

already_exists = 0

with open(args.ns_output_jsonl, "r", encoding="utf-8") as f:

for line in f:

line = line.strip()

if not line:

continue

record = json.loads(line)

user_json = _extract_user_json(record) or {}

unique_id = user_json.get("unique_id_eval", record.get("unique_id_eval"))

audio_path = (record.get("audio") or {}).get("path")

if unique_id is None:

skipped += 1

continue

if not audio_path or not os.path.exists(audio_path):

missing += 1

continue

dst = out_dir / f"{unique_id}.wav"

if dst.exists() and not args.overwrite:

already_exists += 1

continue

_link_or_copy(audio_path, str(dst), args.mode)

converted += 1

print(

f"Converted {converted} files into {out_dir}. "

f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}, already_exists={already_exists}"

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/convert_ns_outputs_to_emergent.py` around lines 57 - 86, The loop currently silently skips records when a destination file already exists (the check at dst.exists() and not args.overwrite) without incrementing any counter; add a new counter variable (e.g., existing_skipped or skipped_existing) initialized alongside converted/skipped/missing, increment it inside the dst.exists() and not args.overwrite branch, and include that counter in the final print summary along with converted, skipped (no unique_id_eval), and missing_audio so the user can see how many files were skipped because they already existed; modify references around _link_or_copy, dst, args.overwrite, and out_dir accordingly.

coderabbitai · 2026-02-18T15:18:04Z

nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py

+            score_cmd = (
+                (f"cd {emergent_data_base_dir} && " if emergent_data_base_dir else "")
+                + f"JUDGER_API_KEY={judger_api_key} "
+                + f"PYTHONPATH={scoring_code_path}:$PYTHONPATH "
+                + "python -m nemo_skills.dataset.emergent_tts.scripts.score "
+                + f"--results_dir {output_dir} "
+                + f"--benchmark {benchmark} "
+                + f"--emergent_data_dir {emergent_data_dir} "
+                + f"--judge_model {scoring.get('judge_model', 'gcp/google/gemini-2.5-pro')} "
+                + f"--judger_base_url {scoring.get('judger_base_url', 'https://inference-api.nvidia.com/v1/chat/completions')} "
+                + f"--num_threads {int(scoring.get('num_threads', 8))} "
+                + f"--evaluate_function {scoring.get('evaluate_function', 'win_rate')}"
+            )


⚠️ Potential issue | 🟠 Major

API key embedded in plain text in the shell command string.

Line 121 interpolates judger_api_key directly into the command string: JUDGER_API_KEY={judger_api_key}. This command is passed to ns_run_cmd, which likely persists it in Slurm job scripts and log files, leaking the secret. Prefer passing the key via the environment rather than inlining it in the command.

Possible approach

If ns_run_cmd supports environment variable propagation, pass JUDGER_API_KEY as an env var. Alternatively, have the scoring script read the key from the environment itself (it already can via os.environ), and ensure the Slurm job inherits the caller's environment. Remove the inline JUDGER_API_KEY=... prefix from score_cmd.

score_cmd = ( (f"cd {emergent_data_base_dir} && " if emergent_data_base_dir else "") - + f"JUDGER_API_KEY={judger_api_key} " + f"PYTHONPATH={scoring_code_path}:$PYTHONPATH " + "python -m nemo_skills.dataset.emergent_tts.scripts.score "

Then in score.py, read JUDGER_API_KEY from the environment at scoring time (which it already does indirectly via the Emergent client).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py` around lines 119 - 131, The command string score_cmd in run_tts_eval.py currently inlines the secret via `JUDGER_API_KEY={judger_api_key}`; remove that prefix from the constructed `score_cmd` and instead pass the key via the environment when invoking `ns_run_cmd` (or ensure the Slurm job inherits the caller env), e.g., add an env dict containing "JUDGER_API_KEY": judger_api_key to the ns_run_cmd call or rely on the scoring script reading os.environ; leave the rest of the command (flags built from scoring.get(...)) unchanged and ensure no other code interpolates judger_api_key into any logged strings or job script content.

coderabbitai · 2026-02-18T15:18:04Z

nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py

+    if args.stage == "aggregation":
+        print("\n" + "=" * 60)
+        print("Stage 3: AGGREGATION")
+        print("=" * 60)
+        agg_cmd = f"python -m nemo_skills.dataset.emergent_tts.scripts.score --results_dir {output_dir} --aggregation_only"
+        ns_run_cmd(
+            ctx=MockContext(),
+            cluster=cfg["cluster"],
+            container=cfg["container"],
+            partition=cfg["partition"],
+            num_gpus=0,
+            mount_paths=cfg["mount_paths"],
+            command=agg_cmd,
+            reuse_code=False,
+            expname=f"{args.expname}_agg",
+            log_dir=f"{output_dir}/eval-logs",
+        )


⚠️ Potential issue | 🟡 Minor

--stage all does not include aggregation.

Line 152 uses == (if args.stage == "aggregation"), so running --stage all executes generation and scoring but skips aggregation. The docstring mentions "Generation -> Scoring (-> Aggregation)" which hints this is by design (aggregation requires scoring to complete first and may be a separate step), but this will surprise users who expect "all" to mean all stages. Consider adding a clarifying comment or including aggregation in "all" with a run_after dependency.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py` around lines 152 - 168, The aggregation branch currently runs only when args.stage == "aggregation", so --stage all skips aggregation; modify the condition in run_tts_eval.py (the block that builds agg_cmd and calls ns_run_cmd) to also run when args.stage == "all" (e.g., if args.stage in ("aggregation","all")) or add a clear inline comment near args.stage explaining that aggregation is intentionally separate and must be invoked with "aggregation"; update references to agg_cmd and the ns_run_cmd call accordingly so aggregation runs after scoring when requested.

vmendelev and others added 3 commits February 18, 2026 07:09

Add emergent_tts dataset + eval scripts

dc025e3

Introduce the emergent_tts dataset package with prepare/generate/score helpers and default configs to run EmergentTTS evaluation via NeMo-Skills. Co-authored-by: Cursor <cursoragent@cursor.com>

Fix Emergent scoring deps and paths

f7cc464

Install google-genai for EmergentTTS-Eval, run scoring from the dataset base dir so relative paths resolve, and avoid shipping large local caches/data. Document EmergentTTS-Eval usage in nv_tts guide. Co-authored-by: Cursor <cursoragent@cursor.com>

Add emergent_tts README

351b1bf

Document dataset preparation (HF_TOKEN) and evaluation workflow, including cloning and patching EmergentTTS-Eval for NVIDIA Inference API judging. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

gwarmstrong added the not ready label Feb 27, 2026

		output_dir: /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval/_local_runs/emergent_tts_smoke10_base

		nemo_code_path: /home/vmendelev/workspace/expressiveness/src/NeMo

		# export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval/cluster_configs
		# python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval --config <this_file> --stage generation

Conversation

vmendelev commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files added

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 18, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vmendelev commented Feb 18, 2026 •

edited by coderabbitai bot

Loading