Skip to content

Add emergent_tts dataset and evaluation scripts#1249

Open
vmendelev wants to merge 3 commits intomainfrom
pr5/emergent-tts-dataset
Open

Add emergent_tts dataset and evaluation scripts#1249
vmendelev wants to merge 3 commits intomainfrom
pr5/emergent-tts-dataset

Conversation

@vmendelev
Copy link
Collaborator

@vmendelev vmendelev commented Feb 18, 2026

Summary

  • Add emergent_tts benchmark dataset with preparation and scoring scripts
  • Add Hydra configs for emergent TTS evaluation
  • Add conversion script from nemo-skills outputs to emergent format
  • Add emergent_tts README documentation
  • Add dependency checking utilities

Files added

  • nemo_skills/dataset/emergent_tts/__init__.py — dataset registration
  • nemo_skills/dataset/emergent_tts/prepare.py — data preparation
  • nemo_skills/dataset/emergent_tts/README.md — documentation
  • nemo_skills/dataset/emergent_tts/emergent/ — emergent framework integration
  • nemo_skills/dataset/emergent_tts/scripts/ — scoring, evaluation, and conversion scripts
  • nemo_skills/dataset/emergent_tts/scripts/config/ — Hydra config files
  • .gitignore — minor additions for TTS artifacts

Test plan

  • Run python -m pytest tests/ to verify no regressions
  • Verify emergent_tts dataset preparation works end-to-end

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added EmergentTTS-Eval dataset integration for evaluating text-to-speech outputs.
    • Introduced evaluation pipeline with generation, scoring, and aggregation stages.
    • Added dependency checker tool and data preparation utilities.
    • Included multiple configuration templates for different evaluation scenarios.
  • Documentation

    • Added comprehensive README documenting EmergentTTS-Eval setup and workflow.

vmendelev and others added 3 commits February 18, 2026 07:09
Introduce the emergent_tts dataset package with prepare/generate/score helpers and default configs to run EmergentTTS evaluation via NeMo-Skills.

Co-authored-by: Cursor <cursoragent@cursor.com>
Install google-genai for EmergentTTS-Eval, run scoring from the dataset base dir so relative paths resolve, and avoid shipping large local caches/data. Document EmergentTTS-Eval usage in nv_tts guide.

Co-authored-by: Cursor <cursoragent@cursor.com>
Document dataset preparation (HF_TOKEN) and evaluation workflow, including cloning and patching EmergentTTS-Eval for NVIDIA Inference API judging.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

This PR adds EmergentTTS-Eval dataset integration to NeMo-Skills, including data preparation, dependency checking, configuration management, audio conversion, and pipeline orchestration for end-to-end TTS evaluation workflows.

Changes

Cohort / File(s) Summary
Baseline Setup
.gitignore, nemo_skills/dataset/emergent_tts/__init__.py, nemo_skills/dataset/emergent_tts/README.md
Added gitignore patterns for cache and data directories; created emergent_tts package with comprehensive documentation of dataset integration and evaluation workflow.
Dataset Preparation & Dependencies
nemo_skills/dataset/emergent_tts/prepare.py, nemo_skills/dataset/emergent_tts/scripts/check_deps.py
Implemented data preparation script with HuggingFace dataset loading, baseline audio export, JSONL generation, and checkpoint downloads; added dependency validation utility for prepare and scoring stages.
Audio Conversion & Formatting
nemo_skills/dataset/emergent_tts/scripts/convert_ns_outputs_to_emergent.py
Converts NeMo-Skills TTS outputs to Emergent audio layout via symlink or copy operations; extracts metadata and handles overwrite semantics.
Pipeline Orchestration
nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py, nemo_skills/dataset/emergent_tts/scripts/score.py
Orchestrates multi-stage evaluation (generation, scoring, aggregation); implements EmergentTTS-Eval scoring with metrics collection and per-benchmark processing; handles environment setup and API configuration.
Configuration Templates
nemo_skills/dataset/emergent_tts/scripts/config/default.yaml, nemo_skills/dataset/emergent_tts/scripts/config/interactive_10.yaml, nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10.yaml, nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10_base.yaml
Added YAML configurations for distributed cluster execution, interactive smoke-testing, and local evaluation scenarios with generation and scoring parameters.
Module Constants
nemo_skills/dataset/emergent_tts/emergent/__init__.py, nemo_skills/dataset/emergent_tts/scripts/__init__.py
Defined GENERATION_ARGS constant for prompt formatting and added package docstrings.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Orchestrator as Orchestrator<br/>(run_tts_eval.py)
    participant Prepare as Prepare<br/>(prepare.py)
    participant TTSServer as TTS Server<br/>(NeMo-Skills)
    participant Scorer as Scorer<br/>(score.py)
    participant Emergent as EmergentTTS-Eval

    User->>Orchestrator: run with config
    Note over Orchestrator: Load config & setup
    
    Orchestrator->>Prepare: Stage 1: run_generation()
    Prepare->>Prepare: Load HF dataset
    Prepare->>Prepare: Export baseline audio
    Prepare->>Prepare: Write JSONL & download checkpoint
    Prepare-->>Orchestrator: Ready for generation
    
    Orchestrator->>TTSServer: Stage 2: Submit TTS inference
    Note over TTSServer: Generate outputs via<br/>NeMo-Skills
    TTSServer-->>Orchestrator: outputs.jsonl
    
    Orchestrator->>Scorer: Stage 2: run_scoring()
    Scorer->>Scorer: Convert NS outputs to Emergent
    Scorer->>Emergent: Run evaluation
    Note over Emergent: Score WER, MOS, Win-rate
    Emergent-->>Scorer: metrics.json
    Scorer-->>Orchestrator: Scoring complete
    
    Orchestrator->>Scorer: Stage 3: run_aggregation()
    Scorer->>Scorer: Aggregate metrics
    Scorer-->>Orchestrator: Summary results
    
    Orchestrator-->>User: Done!
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • #1157: Updates GENERATION_ARGS across audio-related dataset packages; shares the pattern of module-level constant configuration used in this PR's emergent_tts module.

Suggested labels

enhancement

Suggested reviewers

  • melllinia
  • gwarmstrong
  • Kipok
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add emergent_tts dataset and evaluation scripts' accurately and concisely summarizes the main changeset, clearly conveying the primary addition of a new benchmark dataset with associated tooling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr5/emergent-tts-dataset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (7)
nemo_skills/dataset/emergent_tts/scripts/check_deps.py (2)

19-24: Narrow the exception to ImportError; move return None to else.

except Exception is flagged by the linter (BLE001) and violates the project guideline: "Do not catch exceptions when they are not normally expected to be raised." Python's import system documentation states: "If the loader cannot execute the module, it should raise an ImportError, although any other exception raised during exec_module() will be propagated." Catching all Exception silently swallows unexpected errors (e.g. a broken torch CUDA extension raising OSError) that should bubble up. The static analysis also flags TRY300 for the return None placement.

♻️ Proposed fix
 def _try_import(module: str) -> str | None:
     try:
         importlib.import_module(module)
-        return None
-    except Exception as e:
+    except ImportError as e:
         return f"{module} ({type(e).__name__}: {e})"
+    else:
+        return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py` around lines 19 - 24,
The _try_import function currently catches all Exception and returns None inside
the try; change it to only catch ImportError so unexpected errors propagate, and
move the successful-return None into an else block after the try/except.
Specifically, in _try_import use importlib.import_module(module) in the try,
catch only ImportError to return the formatted error string, and place return
None in an else clause so import failures are handled but other exceptions
(e.g., OSError from extensions) are not swallowed.

27-28: Clarify the comment on parents[4] — it describes the file location, not the variable.

The inline comment # .../nemo_skills/dataset/emergent_tts/scripts reads as though it describes the resolved value of repo_root, but parents[4] from check_deps.py walks four levels up (scriptsemergent_ttsdatasetnemo_skills → repo root). The comment actually describes the file's own location.

🔧 Suggested clarification
-    repo_root = Path(__file__).resolve().parents[4]  # .../nemo_skills/dataset/emergent_tts/scripts
+    # __file__ is at nemo_skills/dataset/emergent_tts/scripts/check_deps.py → parents[4] is the repo root
+    repo_root = Path(__file__).resolve().parents[4]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py` around lines 27 - 28,
The inline comment next to repo_root in _venv_install_hint is misleading: update
the comment so it clearly states that Path(__file__).resolve().parents[4]
returns the repository root (walking up from the file location), and if you want
to show the file location mention Path(__file__).resolve() separately; e.g.,
clarify that parents[4] corresponds to the repo root relative to the script and
not the script path itself and adjust the comment near repo_root accordingly.
nemo_skills/dataset/emergent_tts/scripts/score.py (3)

54-73: sys.argv manipulation to call convert_main is fragile.

Monkey-patching sys.argv to invoke convert_main() as a library call is error-prone (not thread-safe, breaks if convert_main is changed to use a global parser, etc.). Consider refactoring the converter to expose a function that accepts parameters directly, then call it here.

Sketch: refactor converter to accept params

In convert_ns_outputs_to_emergent.py, extract the core logic into a callable function:

def convert(ns_output_jsonl: str, out_dir: str, mode: str = "symlink", overwrite: bool = False):
    # ... current main() logic with these params instead of argparse ...

Then in score.py:

 def _convert(ns_output_jsonl: Path, out_dir: Path, overwrite: bool) -> None:
-    from nemo_skills.dataset.emergent_tts.scripts.convert_ns_outputs_to_emergent import main as convert_main
-
-    # Reuse converter as a library via argv.
-    import sys
-
-    argv = sys.argv
-    try:
-        sys.argv = [
-            argv[0],
-            "--ns_output_jsonl",
-            str(ns_output_jsonl),
-            "--out_dir",
-            str(out_dir),
-            "--mode",
-            "symlink",
-        ] + (["--overwrite"] if overwrite else [])
-        convert_main()
-    finally:
-        sys.argv = argv
+    from nemo_skills.dataset.emergent_tts.scripts.convert_ns_outputs_to_emergent import convert
+    convert(str(ns_output_jsonl), str(out_dir), mode="symlink", overwrite=overwrite)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/score.py` around lines 54 - 73, The
current _convert function monkey-patches sys.argv and calls convert_main(),
which is fragile and not thread-safe; refactor
convert_ns_outputs_to_emergent.convert_ns_outputs_to_emergent (the module that
currently exposes main as convert_main) to expose a direct API like
convert(ns_output_jsonl: str | Path, out_dir: str | Path, mode: str = "symlink",
overwrite: bool = False) that implements the core logic, make the existing
main() simply parse args and call that convert(...) helper, and then update
_convert to import and call convert(ns_output_jsonl, out_dir, mode="symlink",
overwrite=overwrite) instead of mutating sys.argv and calling convert_main().

176-186: Consider atomic write for metrics.json.

If json.dump fails mid-write (e.g., disk full), metrics.json will be left in a corrupted state. Since metrics are fully loaded into memory first (line 181), writing to a temp file and renaming would be safer. This is a minor concern given it's an output file, not input data.

As per coding guidelines, "perform all computations before re-opening files for writing to avoid accidental data loss if code fails during execution".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/score.py` around lines 176 - 186,
The code writes metrics.json directly which can leave a corrupt file if
json.dump fails; instead, after loading metrics (the variable metrics) write to
a temporary file in the same directory (derived from bench_dir) and then
atomically replace/rename it to bench_dir/"metrics.json" (use Path.replace or
os.replace) so that emergent_metrics_path, metrics, bench_dir and bench are
still used to locate and log the file; ensure the temp file is created in
bench_dir, closed/flushed before replace, and cleaned up on error.

93-122: Global state mutation (os.environ, os.chdir) in _run_emergent_scoring.

Setting os.environ["EMERGENT_TTS_DATA_BASE_PATH"] (line 94) and changing the working directory (line 102) are global side effects. The chdir is properly restored in a finally block, but the environment variable is never cleaned up. If run_scoring is called multiple times with different emergent_data_base_path values, the env var will be stale between calls. Consider restoring the original env var value in the finally block as well.

Proposed fix
     os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = str(emergent_data_base_path)
 
     prev_cwd = os.getcwd()
+    prev_env = os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")
     try:
         os.chdir(str(emergent_data_base_path.parent))
         emergent_inference.eval_api_closed_model(
             ...
         )
     finally:
         os.chdir(prev_cwd)
+        if prev_env is None:
+            os.environ.pop("EMERGENT_TTS_DATA_BASE_PATH", None)
+        else:
+            os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = prev_env

Wait — on re-reading, the env var is set on line 94 before prev_cwd is captured, so the save/restore should wrap both. Let me adjust:

Corrected proposed fix
+    prev_env = os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")
     os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = str(emergent_data_base_path)
 
     prev_cwd = os.getcwd()
     try:
         os.chdir(str(emergent_data_base_path.parent))
         emergent_inference.eval_api_closed_model(
             ...
         )
     finally:
         os.chdir(prev_cwd)
+        if prev_env is None:
+            os.environ.pop("EMERGENT_TTS_DATA_BASE_PATH", None)
+        else:
+            os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = prev_env
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/score.py` around lines 93 - 122, The
code sets os.environ["EMERGENT_TTS_DATA_BASE_PATH"] and chdirs without restoring
the environment; modify the routine (the block around
os.environ["EMERGENT_TTS_DATA_BASE_PATH"], prev_cwd, and the finally) to first
capture the original env value (orig_env =
os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")) and prev_cwd, then set the env
and chdir, and in the finally restore both (os.environ[...] = orig_env or del
os.environ[...] if orig_env is None, and os.chdir(prev_cwd)); keep the existing
call to emergent_inference.eval_api_closed_model unchanged.
nemo_skills/dataset/emergent_tts/prepare.py (1)

64-89: Download retry catches bare Exception.

Line 84 catches Exception broadly. For network downloads, the expected failures are urllib.error.URLError, OSError, and ContentTooShortError (already handled above). Narrowing the catch to (URLError, OSError) would prevent masking unexpected bugs (e.g., KeyboardInterrupt is not caught by Exception, but MemoryError could be).

Proposed fix
+from urllib.error import ContentTooShortError, URLError
 ...
         except ContentTooShortError as e:
             # Partial download: wait and retry.
             wait_s = min(5 * attempt, 30)
             print(f"Warning: partial download for wv_mos.ckpt (attempt {attempt}/{max_attempts}): {e}")
             time.sleep(wait_s)
-        except Exception as e:
+        except (URLError, OSError) as e:
             wait_s = min(5 * attempt, 30)
             print(f"Warning: failed downloading wv_mos.ckpt (attempt {attempt}/{max_attempts}): {e}")
             time.sleep(wait_s)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/prepare.py` around lines 64 - 89, The retry
loop in _download_wv_mos currently catches a broad Exception which can mask
unrelated errors; change the second except clause to only catch
network/filesystem related errors (e.g., except (urllib.error.URLError, OSError)
as e) so the retry logic still handles download failures for WV_MOS_URL and
tmp_path operations but lets other unexpected exceptions propagate; keep the
existing ContentTooShortError handling, wait_s/backoff logic, tmp_path cleanup,
and final RuntimeError if attempts exhaust.
nemo_skills/dataset/emergent_tts/scripts/config/default.yaml (1)

48-48: pip install without version pins in installation_command.

pip install editdistance whisper-normalizer json-repair tenacity installs unpinned latest versions. This can cause non-reproducible environments. Consider pinning versions (e.g., editdistance==0.8.1) or at least documenting the known-working versions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/config/default.yaml` at line 48, The
installation_command currently installs unpinned packages (editdistance,
whisper-normalizer, json-repair, tenacity, and google-genai) which makes
environments non-reproducible; update the installation_command entry in
default.yaml to pin known-good versions for each of these packages (or add a
comment documenting tested versions), e.g., specify exact version constraints
for editdistance, whisper-normalizer, json-repair, tenacity, and keep the
existing --no-deps flag for google-genai if needed so installations are
reproducible and predictable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/emergent_tts/prepare.py`:
- Around line 193-202: The conversion from float samples to int16 can wrap if
samples exceed [-1,1]; update the code that creates audio_array_int16 (working
with variables audio_array, audio_array_int16, and the AudioSegment export to
wav_path) to first clamp/clip audio_array to [-1.0, 1.0] (e.g., via np.clip),
then scale and convert to int16 (preferably round then astype) before building
AudioSegment and exporting; this ensures no int16 overflow/wrap and avoids
audible artifacts.
- Around line 46-61: In _require_deps() replace the broad "except Exception"
with "except ImportError" so only failed imports are caught, and update the
RuntimeError message to remove the hardcoded developer path—use a generic
installation instruction (e.g., "activate your virtualenv and run: pip install
datasets numpy pydub tqdm librosa soundfile") so users get actionable,
non-personalized guidance; keep the original raised-from (from e) behavior when
re-raising the RuntimeError.

In `@nemo_skills/dataset/emergent_tts/README.md`:
- Line 45: The README contains duplicate section numbering: update the three
heading lines so they read "### 3) Clone + patch EmergentTTS-Eval-public for
NVIDIA Inference API judging", change the second "### 3) Run evaluation" to "###
4) Run evaluation", and increment "### 4) Smoke test" to "### 5) Smoke test"
(these exact heading strings identify the locations to edit) so the section
numbers are sequential.
- Around line 20-28: Replace all developer-specific absolute paths in the
README.md code blocks (e.g., `/home/vmendelev/...` and
`/lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code`) with clear
placeholders like `<REPO_ROOT>` and `<CLUSTER_WORKDIR>` and update the repo
reference `<repo_url>` to the actual EmergentTTS-Eval-public repository URL;
ensure every code block that currently contains hard-coded paths (including the
examples around the prepare.py invocation and the git clone block) uses these
placeholders or environment variables (e.g., export REPO_ROOT="/path/to/repo")
so contributors can substitute their own paths.
- Around line 64-71: The README has a contradiction between the prose and the
example config about judger_base_url: the prose instructs threading
judger_base_url as the base URL (e.g., https://inference-api.nvidia.com/v1)
while the config example sets scoring.judger_base_url to the full chat
completions path; fix by choosing the base-only form and updating the example
and any code that consumes it (e.g., api_clients.py) to append the endpoint path
when constructing requests; ensure references to judger_base_url and
scoring.judger_base_url consistently use the base (no /v1/chat/completions) and
that OpenAI(...) or equivalent client is created with base_url=judger_base_url
and the code appends the proper /v1/chat/completions suffix where needed, and
update the README snippet and scripts/config/default.yaml to match.
- Around line 1-124: The README for emergent_tts is missing example evaluation
results; update nemo_skills/dataset/emergent_tts/README.md to include a short
"Expected results" section showing sample metrics (WER, MOS, win-rate) for at
least one tested model and the exact config used; reference the example config
path scripts/config/default.yaml (and interactive_10.yaml for smoke tests) and
the output filenames (eval-results/.../output.jsonl,
emergent-tts-eval_*_evaluation-metrics.json, metrics.json) so readers can
reproduce the run and compare their numbers.

In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py`:
- Around line 57-80: The duplicate missing-module entries occur because modules
like "pydub", "librosa", and "soundfile" are checked in both the prepare and
scoring loops and appended to missing twice; modify the checks around args.stage
so each module is only added once by deduplicating before appending (e.g.,
maintain a local seen set or build a single ordered list of modules to check
when args.stage == "all") and continue to use _try_import(module) to produce err
and append to missing only if module not in seen; ensure you reference the
existing variables and functions (_try_import, missing, args.stage) and preserve
existing loop/error handling semantics.

In `@nemo_skills/dataset/emergent_tts/scripts/config/default.yaml`:
- Around line 9-18: default.yaml currently embeds developer-specific absolute
paths for keys like container, output_dir, nemo_code_path, data_dir,
scoring_code_path, and emergent_data_dir and also contains a repeated typo
"containters"; replace those hardcoded paths with clear placeholders (e.g.
<CONTAINER_PATH>, <OUTPUT_DIR>, <NEMO_CODE_PATH>, etc.) or
environment-variable-style tokens and add short inline comments explaining which
must be customized, and fix the spelling of the "container" key (remove
"containters") wherever it appears so consumers get a canonical, editable
default config.

In `@nemo_skills/dataset/emergent_tts/scripts/config/interactive_10.yaml`:
- Around line 3-9: The YAML uses hardcoded user-specific absolute paths and a
typo: change the `container` key value (currently containing "containters") to
`containers` and replace all user-specific absolute paths referenced by keys
`container`, `mount_paths`, `output_dir`, and `nemo_code_path` with reusable
placeholders or environment-interpolated values (e.g., use
`${oc.env:LUSTRE_BASE}` or `${env:LUSTRE_BASE}` depending on your config system)
so other developers can override them at runtime; ensure the updated `container`
key points to the correct container path expression and that `mount_paths`,
`output_dir`, and `nemo_code_path` use the same base variable to construct their
paths.

In
`@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10_base.yaml`:
- Around line 11-14: Replace hard-coded developer paths in the YAML (keys
output_dir, nemo_code_path, and data_dir) with portable placeholders or
environment-expanded variables and/or exclude local variants via .gitignore;
specifically, change the values for output_dir, nemo_code_path, and data_dir to
use Hydra/env placeholders like ${oc.env:HOME}/path or clearly-marked tokens
such as <YOUR_WORKSPACE>/... so other contributors don't need to edit the file,
and add a gitignore rule (e.g.,
nemo_skills/dataset/emergent_tts/scripts/config/local_*.yaml) so personal
local_*.yaml files are not committed.

In `@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10.yaml`:
- Around line 4-5: Replace the hard-coded developer paths in this config by
converting the absolute values for output_dir, nemo_code_path, and data_dir into
either placeholder variables (e.g., "<PATH_TO_OUTPUT>", "<PATH_TO_NEMO_CODE>",
"<PATH_TO_DATA>") or reference environment variables (e.g., ${ENV_VAR}) and
remove or generalize the usage comment that contains the specific
NEMO_SKILLS_CONFIG_DIR developer path; alternatively, omit this local config
from the repo and add it to .gitignore. Update the entries named output_dir,
nemo_code_path, data_dir and the usage comment so the file no longer contains
absolute developer-specific paths.

In `@nemo_skills/dataset/emergent_tts/scripts/convert_ns_outputs_to_emergent.py`:
- Around line 57-86: The loop currently silently skips records when a
destination file already exists (the check at dst.exists() and not
args.overwrite) without incrementing any counter; add a new counter variable
(e.g., existing_skipped or skipped_existing) initialized alongside
converted/skipped/missing, increment it inside the dst.exists() and not
args.overwrite branch, and include that counter in the final print summary along
with converted, skipped (no unique_id_eval), and missing_audio so the user can
see how many files were skipped because they already existed; modify references
around _link_or_copy, dst, args.overwrite, and out_dir accordingly.

In `@nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py`:
- Around line 152-168: The aggregation branch currently runs only when
args.stage == "aggregation", so --stage all skips aggregation; modify the
condition in run_tts_eval.py (the block that builds agg_cmd and calls
ns_run_cmd) to also run when args.stage == "all" (e.g., if args.stage in
("aggregation","all")) or add a clear inline comment near args.stage explaining
that aggregation is intentionally separate and must be invoked with
"aggregation"; update references to agg_cmd and the ns_run_cmd call accordingly
so aggregation runs after scoring when requested.
- Around line 119-131: The command string score_cmd in run_tts_eval.py currently
inlines the secret via `JUDGER_API_KEY={judger_api_key}`; remove that prefix
from the constructed `score_cmd` and instead pass the key via the environment
when invoking `ns_run_cmd` (or ensure the Slurm job inherits the caller env),
e.g., add an env dict containing "JUDGER_API_KEY": judger_api_key to the
ns_run_cmd call or rely on the scoring script reading os.environ; leave the rest
of the command (flags built from scoring.get(...)) unchanged and ensure no other
code interpolates judger_api_key into any logged strings or job script content.

---

Nitpick comments:
In `@nemo_skills/dataset/emergent_tts/prepare.py`:
- Around line 64-89: The retry loop in _download_wv_mos currently catches a
broad Exception which can mask unrelated errors; change the second except clause
to only catch network/filesystem related errors (e.g., except
(urllib.error.URLError, OSError) as e) so the retry logic still handles download
failures for WV_MOS_URL and tmp_path operations but lets other unexpected
exceptions propagate; keep the existing ContentTooShortError handling,
wait_s/backoff logic, tmp_path cleanup, and final RuntimeError if attempts
exhaust.

In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py`:
- Around line 19-24: The _try_import function currently catches all Exception
and returns None inside the try; change it to only catch ImportError so
unexpected errors propagate, and move the successful-return None into an else
block after the try/except. Specifically, in _try_import use
importlib.import_module(module) in the try, catch only ImportError to return the
formatted error string, and place return None in an else clause so import
failures are handled but other exceptions (e.g., OSError from extensions) are
not swallowed.
- Around line 27-28: The inline comment next to repo_root in _venv_install_hint
is misleading: update the comment so it clearly states that
Path(__file__).resolve().parents[4] returns the repository root (walking up from
the file location), and if you want to show the file location mention
Path(__file__).resolve() separately; e.g., clarify that parents[4] corresponds
to the repo root relative to the script and not the script path itself and
adjust the comment near repo_root accordingly.

In `@nemo_skills/dataset/emergent_tts/scripts/config/default.yaml`:
- Line 48: The installation_command currently installs unpinned packages
(editdistance, whisper-normalizer, json-repair, tenacity, and google-genai)
which makes environments non-reproducible; update the installation_command entry
in default.yaml to pin known-good versions for each of these packages (or add a
comment documenting tested versions), e.g., specify exact version constraints
for editdistance, whisper-normalizer, json-repair, tenacity, and keep the
existing --no-deps flag for google-genai if needed so installations are
reproducible and predictable.

In `@nemo_skills/dataset/emergent_tts/scripts/score.py`:
- Around line 54-73: The current _convert function monkey-patches sys.argv and
calls convert_main(), which is fragile and not thread-safe; refactor
convert_ns_outputs_to_emergent.convert_ns_outputs_to_emergent (the module that
currently exposes main as convert_main) to expose a direct API like
convert(ns_output_jsonl: str | Path, out_dir: str | Path, mode: str = "symlink",
overwrite: bool = False) that implements the core logic, make the existing
main() simply parse args and call that convert(...) helper, and then update
_convert to import and call convert(ns_output_jsonl, out_dir, mode="symlink",
overwrite=overwrite) instead of mutating sys.argv and calling convert_main().
- Around line 176-186: The code writes metrics.json directly which can leave a
corrupt file if json.dump fails; instead, after loading metrics (the variable
metrics) write to a temporary file in the same directory (derived from
bench_dir) and then atomically replace/rename it to bench_dir/"metrics.json"
(use Path.replace or os.replace) so that emergent_metrics_path, metrics,
bench_dir and bench are still used to locate and log the file; ensure the temp
file is created in bench_dir, closed/flushed before replace, and cleaned up on
error.
- Around line 93-122: The code sets os.environ["EMERGENT_TTS_DATA_BASE_PATH"]
and chdirs without restoring the environment; modify the routine (the block
around os.environ["EMERGENT_TTS_DATA_BASE_PATH"], prev_cwd, and the finally) to
first capture the original env value (orig_env =
os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")) and prev_cwd, then set the env
and chdir, and in the finally restore both (os.environ[...] = orig_env or del
os.environ[...] if orig_env is None, and os.chdir(prev_cwd)); keep the existing
call to emergent_inference.eval_api_closed_model unchanged.

Comment on lines +46 to +61
def _require_deps():
try:
import numpy as np # noqa: F401
from datasets import load_dataset # noqa: F401
import librosa # noqa: F401
import soundfile # noqa: F401
from pydub import AudioSegment # noqa: F401
from tqdm import tqdm # noqa: F401
except Exception as e: # pragma: no cover
raise RuntimeError(
"Missing dependencies for EmergentTTS-Eval preparation.\n\n"
"Install into the repo venv:\n"
" cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval\n"
" . ./.venv/bin/activate\n"
" pip install datasets numpy pydub tqdm librosa soundfile\n"
) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded developer path in error message; catch ImportError instead of bare Exception.

Two issues:

  1. Line 58 contains a hardcoded developer-specific path (/home/vmendelev/workspace/...). This is meaningless to other users. Replace with a generic instruction.

  2. The except Exception on line 54 should be except ImportError — that's the only exception expected from failed imports. Catching broader exceptions can mask unrelated bugs (e.g., a library that imports successfully but fails during its own init for a different reason).

Proposed fix
 def _require_deps():
     try:
-        import numpy as np  # noqa: F401
-        from datasets import load_dataset  # noqa: F401
-        import librosa  # noqa: F401
-        import soundfile  # noqa: F401
-        from pydub import AudioSegment  # noqa: F401
-        from tqdm import tqdm  # noqa: F401
-    except Exception as e:  # pragma: no cover
+        import numpy as np
+        from datasets import load_dataset
+        import librosa
+        import soundfile
+        from pydub import AudioSegment
+        from tqdm import tqdm
+    except ImportError as e:  # pragma: no cover
         raise RuntimeError(
             "Missing dependencies for EmergentTTS-Eval preparation.\n\n"
-            "Install into the repo venv:\n"
-            "  cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval\n"
-            "  . ./.venv/bin/activate\n"
-            "  pip install datasets numpy pydub tqdm librosa soundfile\n"
+            "Install the required packages:\n"
+            "  pip install datasets numpy pydub tqdm librosa soundfile\n"
         ) from e

As per coding guidelines, "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving".

🧰 Tools
🪛 Ruff (0.15.1)

[warning] 48-48: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


[warning] 49-49: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


[warning] 50-50: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


[warning] 51-51: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


[warning] 52-52: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


[warning] 53-53: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


[warning] 55-61: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/prepare.py` around lines 46 - 61, In
_require_deps() replace the broad "except Exception" with "except ImportError"
so only failed imports are caught, and update the RuntimeError message to remove
the hardcoded developer path—use a generic installation instruction (e.g.,
"activate your virtualenv and run: pip install datasets numpy pydub tqdm librosa
soundfile") so users get actionable, non-personalized guidance; keep the
original raised-from (from e) behavior when re-raising the RuntimeError.

Comment on lines +193 to +202
audio_array = curr["audio"]["array"]
audio_sr = int(curr["audio"]["sampling_rate"])
audio_array_int16 = np.int16(audio_array * 32767)
audio_segment = AudioSegment(
audio_array_int16.tobytes(),
frame_rate=audio_sr,
sample_width=2,
channels=1,
)
audio_segment.export(str(wav_path), format="wav")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential int16 overflow/wrap when audio samples exceed [-1, 1].

np.int16(audio_array * 32767) will silently wrap around if any sample exceeds the [-1.0, 1.0] range (e.g., a value of 1.001 * 32767 = 32799 wraps to -32737 as int16). This produces audible clicks/artifacts. Clip before converting.

Proposed fix
-                audio_array_int16 = np.int16(audio_array * 32767)
+                audio_array_int16 = np.int16(np.clip(audio_array, -1.0, 1.0) * 32767)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/prepare.py` around lines 193 - 202, The
conversion from float samples to int16 can wrap if samples exceed [-1,1]; update
the code that creates audio_array_int16 (working with variables audio_array,
audio_array_int16, and the AudioSegment export to wav_path) to first clamp/clip
audio_array to [-1.0, 1.0] (e.g., via np.clip), then scale and convert to int16
(preferably round then astype) before building AudioSegment and exporting; this
ensures no int16 overflow/wrap and avoids audible artifacts.

Comment on lines +1 to +124
## EmergentTTS-Eval dataset (`emergent_tts`)

This dataset integration lets you:

- **Prepare** the EmergentTTS-Eval test set under a shared `data_dir` (download baseline audios + metadata + MOS model).
- **Generate** TTS outputs with NeMo-Skills (`ns eval` via `run_tts_eval.py`).
- **Score** the generated outputs with EmergentTTS-Eval (WER/MOS/win-rate, depending on config).

### 1) Prepare the test set (requires `HF_TOKEN`)

`prepare.py` downloads the dataset and writes all required artifacts into:

- `<DATA_DIR>/emergent_tts/emergent/test.jsonl`
- `<DATA_DIR>/emergent_tts/data/emergent_tts_eval_data.jsonl`
- `<DATA_DIR>/emergent_tts/data/baseline_audios/*.wav`
- `<DATA_DIR>/emergent_tts/data/wv_mos.ckpt`

Run it from your dev machine (or any environment with network access):

```bash
cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval
. ./.venv/bin/activate

export HF_TOKEN="<your_hf_token>"

python nemo_skills/dataset/emergent_tts/prepare.py \
--output_dir "<DATA_DIR>/emergent_tts"
```

Optional flags:

- `--num_samples 10`: write only the first 10 samples (smoke test).
- `--overwrite`: re-download / regenerate outputs.

### 2) Configure evaluation

Use the example configs in `nemo_skills/dataset/emergent_tts/scripts/config/`.

In `scripts/config/default.yaml`, set:

- `generation.data_dir: <DATA_DIR>`
- `scoring.emergent_data_dir: <DATA_DIR>/emergent_tts/data`
- `scoring.scoring_code_path: <PATH_TO>/EmergentTTS-Eval-public` (on the cluster)

### 3) Clone + patch EmergentTTS-Eval-public for NVIDIA Inference API judging

On EOS (or wherever you run scoring), clone EmergentTTS-Eval:

```bash
cd /lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code
git clone <repo_url> EmergentTTS-Eval-public
```

Then update Emergent’s judge client selection so that **Gemini models are called via NVIDIA’s OpenAI-compatible Inference API**.

Target behavior:

- **Model name** stays as: `gcp/google/gemini-2.5-pro` (or similar).
- **Base URL** is NVIDIA Inference API: `https://inference-api.nvidia.com/v1`
- **API key** comes from: `JUDGER_API_KEY` (or `NVIDIA_API_KEY`)

Minimal patch checklist inside `EmergentTTS-Eval-public`:

- In `api_clients.py` (or wherever the client is chosen), ensure `gcp/google/*` uses an **OpenAI-compatible** client (not the Google SDK client), e.g.:
- `OpenAI(base_url=<judger_base_url>, api_key=os.getenv("JUDGER_API_KEY"))`
- Thread `judger_base_url` through so calls use `https://inference-api.nvidia.com/v1` (not the full `/v1/chat/completions` endpoint).

After patching, set these in `scripts/config/default.yaml`:

- `scoring.judge_model: gcp/google/gemini-2.5-pro`
- `scoring.judger_base_url: https://inference-api.nvidia.com/v1/chat/completions`

### 3) Run evaluation (generation + scoring)

From your dev machine, submit jobs to EOS:

```bash
cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval
. ./.venv/bin/activate
mkdir -p .nemo_run

export NEMORUN_HOME="$PWD/.nemo_run"
export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/ns_eval/cluster_configs
export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1

# Required for win-rate judging (NVIDIA Inference API key)
export JUDGER_API_KEY="<your_nvidia_api_key>"

python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval \
--config nemo_skills/dataset/emergent_tts/scripts/config/default.yaml \
--stage all \
--expname emergent_eval
```

### 4) Smoke test (10 samples, interactive)

```bash
cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval
. ./.venv/bin/activate
mkdir -p .nemo_run

export NEMORUN_HOME="$PWD/.nemo_run"
export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/ns_eval/cluster_configs
export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1

python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval \
--config nemo_skills/dataset/emergent_tts/scripts/config/interactive_10.yaml \
--stage generation \
--expname emergent_smoke10
```

### Outputs

NeMo-Skills generation writes:

- `<output_dir>/eval-results/emergent_tts.emergent/output.jsonl`
- `<output_dir>/eval-results/emergent_tts.emergent/audio/*.wav` (or equivalent)

Emergent scoring writes (in the same benchmark folder):

- `emergent-tts-eval_*_evaluation-predictions.jsonl`
- `emergent-tts-eval_*_evaluation-metrics.json`
- `metrics.json` (a NeMo-Skills-friendly copy of Emergent metrics)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add expected evaluation results for at least one tested model.

The README documents the workflow thoroughly but does not include any sample metric output (e.g. WER, MOS, win-rate) for a reference model run. Based on learnings from CONTRIBUTING.md: "When adding new benchmarks, add documentation with example commands for how to run evaluation, expected results for tested models, and any dataset-specific details like special preparation arguments or non-standard inference arguments."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/README.md` around lines 1 - 124, The README
for emergent_tts is missing example evaluation results; update
nemo_skills/dataset/emergent_tts/README.md to include a short "Expected results"
section showing sample metrics (WER, MOS, win-rate) for at least one tested
model and the exact config used; reference the example config path
scripts/config/default.yaml (and interactive_10.yaml for smoke tests) and the
output filenames (eval-results/.../output.jsonl,
emergent-tts-eval_*_evaluation-metrics.json, metrics.json) so readers can
reproduce the run and compare their numbers.

Comment on lines +20 to +28
```bash
cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval
. ./.venv/bin/activate

export HF_TOKEN="<your_hf_token>"

python nemo_skills/dataset/emergent_tts/prepare.py \
--output_dir "<DATA_DIR>/emergent_tts"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace developer-specific absolute paths with generic placeholders.

The README hard-codes /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval (lines 21, 78, 98, and elsewhere) and /lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code (line 50) throughout every code block. These paths are specific to one developer's environment and will not work for any other contributor.

Replace them with environment variables or clearly marked placeholders, e.g. <REPO_ROOT>, <CLUSTER_WORKDIR>. The <repo_url> on line 51 also needs to be filled in with the actual EmergentTTS-Eval-public repository URL.

Also applies to: 49-52, 77-110

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/README.md` around lines 20 - 28, Replace all
developer-specific absolute paths in the README.md code blocks (e.g.,
`/home/vmendelev/...` and
`/lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code`) with clear
placeholders like `<REPO_ROOT>` and `<CLUSTER_WORKDIR>` and update the repo
reference `<repo_url>` to the actual EmergentTTS-Eval-public repository URL;
ensure every code block that currently contains hard-coded paths (including the
examples around the prepare.py invocation and the git clone block) uses these
placeholders or environment variables (e.g., export REPO_ROOT="/path/to/repo")
so contributors can substitute their own paths.

- `scoring.emergent_data_dir: <DATA_DIR>/emergent_tts/data`
- `scoring.scoring_code_path: <PATH_TO>/EmergentTTS-Eval-public` (on the cluster)

### 3) Clone + patch EmergentTTS-Eval-public for NVIDIA Inference API judging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix duplicate section number — two sections are labeled "3)".

Line 45 is ### 3) Clone + patch EmergentTTS-Eval-public and line 73 is ### 3) Run evaluation. The subsequent section (### 4) Smoke test) is also off by one. The correct numbering should be: 3 → Clone & patch, 4 → Run evaluation, 5 → Smoke test.

Also applies to: 73-73, 95-95

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/README.md` at line 45, The README contains
duplicate section numbering: update the three heading lines so they read "### 3)
Clone + patch EmergentTTS-Eval-public for NVIDIA Inference API judging", change
the second "### 3) Run evaluation" to "### 4) Run evaluation", and increment
"### 4) Smoke test" to "### 5) Smoke test" (these exact heading strings identify
the locations to edit) so the section numbers are sequential.

Comment on lines +11 to +14
output_dir: /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval/_local_runs/emergent_tts_smoke10_base

nemo_code_path: /home/vmendelev/workspace/expressiveness/src/NeMo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hard-coded developer paths make this config non-portable.

output_dir, nemo_code_path, and data_dir (lines 11, 13-14, 22-25) all reference /home/vmendelev/.... Any other contributor must manually edit these before using this file.

Either replace them with clearly-marked placeholders (e.g. ${oc.env:HOME}/... for Hydra, or <YOUR_WORKSPACE>/... as a comment hint) or add a .gitignore pattern such as nemo_skills/dataset/emergent_tts/scripts/config/local_*.yaml to keep local/personal variants out of the repository entirely, consistent with the existing cluster_configs/* pattern in .gitignore.

Also applies to: 22-25

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10_base.yaml`
around lines 11 - 14, Replace hard-coded developer paths in the YAML (keys
output_dir, nemo_code_path, and data_dir) with portable placeholders or
environment-expanded variables and/or exclude local variants via .gitignore;
specifically, change the values for output_dir, nemo_code_path, and data_dir to
use Hydra/env placeholders like ${oc.env:HOME}/path or clearly-marked tokens
such as <YOUR_WORKSPACE>/... so other contributors don't need to edit the file,
and add a gitignore rule (e.g.,
nemo_skills/dataset/emergent_tts/scripts/config/local_*.yaml) so personal
local_*.yaml files are not committed.

Comment on lines +4 to +5
# export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval/cluster_configs
# python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval --config <this_file> --stage generation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same hard-coded developer paths as local_interactive_10_base.yaml.

output_dir, nemo_code_path, data_dir (lines 10, 14, 25), and even the usage comment on line 4 (NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/...) are all absolute paths tied to a specific developer environment. Apply the same fix suggested for local_interactive_10_base.yaml: use placeholder values or exclude these local configs via .gitignore.

Also applies to: 10-10, 14-14, 25-25

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10.yaml`
around lines 4 - 5, Replace the hard-coded developer paths in this config by
converting the absolute values for output_dir, nemo_code_path, and data_dir into
either placeholder variables (e.g., "<PATH_TO_OUTPUT>", "<PATH_TO_NEMO_CODE>",
"<PATH_TO_DATA>") or reference environment variables (e.g., ${ENV_VAR}) and
remove or generalize the usage comment that contains the specific
NEMO_SKILLS_CONFIG_DIR developer path; alternatively, omit this local config
from the repo and add it to .gitignore. Update the entries named output_dir,
nemo_code_path, data_dir and the usage comment so the file no longer contains
absolute developer-specific paths.

Comment on lines +57 to +86
converted = 0
skipped = 0
missing = 0

with open(args.ns_output_jsonl, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
record = json.loads(line)
user_json = _extract_user_json(record) or {}
unique_id = user_json.get("unique_id_eval", record.get("unique_id_eval"))
audio_path = (record.get("audio") or {}).get("path")

if unique_id is None:
skipped += 1
continue
if not audio_path or not os.path.exists(audio_path):
missing += 1
continue

dst = out_dir / f"{unique_id}.wav"
if dst.exists() and not args.overwrite:
continue
_link_or_copy(audio_path, str(dst), args.mode)
converted += 1

print(
f"Converted {converted} files into {out_dir}. "
f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silently skipped "already exists" records are not counted in the summary.

When dst.exists() and not args.overwrite (Line 79), the record is skipped without incrementing any counter. The final summary reports converted, skipped(no unique_id_eval), and missing_audio, but "already-exists" records are invisible. This makes it hard to tell whether a partial run actually processed everything.

Proposed fix
     converted = 0
     skipped = 0
     missing = 0
+    already_exists = 0

     with open(args.ns_output_jsonl, "r", encoding="utf-8") as f:
         for line in f:
@@ -79,6 +80,7 @@
             dst = out_dir / f"{unique_id}.wav"
             if dst.exists() and not args.overwrite:
+                already_exists += 1
                 continue
             _link_or_copy(audio_path, str(dst), args.mode)
             converted += 1

     print(
         f"Converted {converted} files into {out_dir}. "
-        f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}"
+        f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}, already_exists={already_exists}"
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
converted = 0
skipped = 0
missing = 0
with open(args.ns_output_jsonl, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
record = json.loads(line)
user_json = _extract_user_json(record) or {}
unique_id = user_json.get("unique_id_eval", record.get("unique_id_eval"))
audio_path = (record.get("audio") or {}).get("path")
if unique_id is None:
skipped += 1
continue
if not audio_path or not os.path.exists(audio_path):
missing += 1
continue
dst = out_dir / f"{unique_id}.wav"
if dst.exists() and not args.overwrite:
continue
_link_or_copy(audio_path, str(dst), args.mode)
converted += 1
print(
f"Converted {converted} files into {out_dir}. "
f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}"
converted = 0
skipped = 0
missing = 0
already_exists = 0
with open(args.ns_output_jsonl, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
record = json.loads(line)
user_json = _extract_user_json(record) or {}
unique_id = user_json.get("unique_id_eval", record.get("unique_id_eval"))
audio_path = (record.get("audio") or {}).get("path")
if unique_id is None:
skipped += 1
continue
if not audio_path or not os.path.exists(audio_path):
missing += 1
continue
dst = out_dir / f"{unique_id}.wav"
if dst.exists() and not args.overwrite:
already_exists += 1
continue
_link_or_copy(audio_path, str(dst), args.mode)
converted += 1
print(
f"Converted {converted} files into {out_dir}. "
f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}, already_exists={already_exists}"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/convert_ns_outputs_to_emergent.py`
around lines 57 - 86, The loop currently silently skips records when a
destination file already exists (the check at dst.exists() and not
args.overwrite) without incrementing any counter; add a new counter variable
(e.g., existing_skipped or skipped_existing) initialized alongside
converted/skipped/missing, increment it inside the dst.exists() and not
args.overwrite branch, and include that counter in the final print summary along
with converted, skipped (no unique_id_eval), and missing_audio so the user can
see how many files were skipped because they already existed; modify references
around _link_or_copy, dst, args.overwrite, and out_dir accordingly.

Comment on lines +119 to +131
score_cmd = (
(f"cd {emergent_data_base_dir} && " if emergent_data_base_dir else "")
+ f"JUDGER_API_KEY={judger_api_key} "
+ f"PYTHONPATH={scoring_code_path}:$PYTHONPATH "
+ "python -m nemo_skills.dataset.emergent_tts.scripts.score "
+ f"--results_dir {output_dir} "
+ f"--benchmark {benchmark} "
+ f"--emergent_data_dir {emergent_data_dir} "
+ f"--judge_model {scoring.get('judge_model', 'gcp/google/gemini-2.5-pro')} "
+ f"--judger_base_url {scoring.get('judger_base_url', 'https://inference-api.nvidia.com/v1/chat/completions')} "
+ f"--num_threads {int(scoring.get('num_threads', 8))} "
+ f"--evaluate_function {scoring.get('evaluate_function', 'win_rate')}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

API key embedded in plain text in the shell command string.

Line 121 interpolates judger_api_key directly into the command string: JUDGER_API_KEY={judger_api_key}. This command is passed to ns_run_cmd, which likely persists it in Slurm job scripts and log files, leaking the secret. Prefer passing the key via the environment rather than inlining it in the command.

Possible approach

If ns_run_cmd supports environment variable propagation, pass JUDGER_API_KEY as an env var. Alternatively, have the scoring script read the key from the environment itself (it already can via os.environ), and ensure the Slurm job inherits the caller's environment. Remove the inline JUDGER_API_KEY=... prefix from score_cmd.

             score_cmd = (
                 (f"cd {emergent_data_base_dir} && " if emergent_data_base_dir else "")
-                + f"JUDGER_API_KEY={judger_api_key} "
                 + f"PYTHONPATH={scoring_code_path}:$PYTHONPATH "
                 + "python -m nemo_skills.dataset.emergent_tts.scripts.score "

Then in score.py, read JUDGER_API_KEY from the environment at scoring time (which it already does indirectly via the Emergent client).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py` around lines 119 -
131, The command string score_cmd in run_tts_eval.py currently inlines the
secret via `JUDGER_API_KEY={judger_api_key}`; remove that prefix from the
constructed `score_cmd` and instead pass the key via the environment when
invoking `ns_run_cmd` (or ensure the Slurm job inherits the caller env), e.g.,
add an env dict containing "JUDGER_API_KEY": judger_api_key to the ns_run_cmd
call or rely on the scoring script reading os.environ; leave the rest of the
command (flags built from scoring.get(...)) unchanged and ensure no other code
interpolates judger_api_key into any logged strings or job script content.

Comment on lines +152 to +168
if args.stage == "aggregation":
print("\n" + "=" * 60)
print("Stage 3: AGGREGATION")
print("=" * 60)
agg_cmd = f"python -m nemo_skills.dataset.emergent_tts.scripts.score --results_dir {output_dir} --aggregation_only"
ns_run_cmd(
ctx=MockContext(),
cluster=cfg["cluster"],
container=cfg["container"],
partition=cfg["partition"],
num_gpus=0,
mount_paths=cfg["mount_paths"],
command=agg_cmd,
reuse_code=False,
expname=f"{args.expname}_agg",
log_dir=f"{output_dir}/eval-logs",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--stage all does not include aggregation.

Line 152 uses == (if args.stage == "aggregation"), so running --stage all executes generation and scoring but skips aggregation. The docstring mentions "Generation -> Scoring (-> Aggregation)" which hints this is by design (aggregation requires scoring to complete first and may be a separate step), but this will surprise users who expect "all" to mean all stages. Consider adding a clarifying comment or including aggregation in "all" with a run_after dependency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py` around lines 152 -
168, The aggregation branch currently runs only when args.stage ==
"aggregation", so --stage all skips aggregation; modify the condition in
run_tts_eval.py (the block that builds agg_cmd and calls ns_run_cmd) to also run
when args.stage == "all" (e.g., if args.stage in ("aggregation","all")) or add a
clear inline comment near args.stage explaining that aggregation is
intentionally separate and must be invoked with "aggregation"; update references
to agg_cmd and the ns_run_cmd call accordingly so aggregation runs after scoring
when requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants