Add emergent_tts dataset and evaluation scripts#1249
Conversation
Introduce the emergent_tts dataset package with prepare/generate/score helpers and default configs to run EmergentTTS evaluation via NeMo-Skills. Co-authored-by: Cursor <cursoragent@cursor.com>
Install google-genai for EmergentTTS-Eval, run scoring from the dataset base dir so relative paths resolve, and avoid shipping large local caches/data. Document EmergentTTS-Eval usage in nv_tts guide. Co-authored-by: Cursor <cursoragent@cursor.com>
Document dataset preparation (HF_TOKEN) and evaluation workflow, including cloning and patching EmergentTTS-Eval for NVIDIA Inference API judging. Co-authored-by: Cursor <cursoragent@cursor.com>
📝 WalkthroughWalkthroughThis PR adds EmergentTTS-Eval dataset integration to NeMo-Skills, including data preparation, dependency checking, configuration management, audio conversion, and pipeline orchestration for end-to-end TTS evaluation workflows. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Orchestrator as Orchestrator<br/>(run_tts_eval.py)
participant Prepare as Prepare<br/>(prepare.py)
participant TTSServer as TTS Server<br/>(NeMo-Skills)
participant Scorer as Scorer<br/>(score.py)
participant Emergent as EmergentTTS-Eval
User->>Orchestrator: run with config
Note over Orchestrator: Load config & setup
Orchestrator->>Prepare: Stage 1: run_generation()
Prepare->>Prepare: Load HF dataset
Prepare->>Prepare: Export baseline audio
Prepare->>Prepare: Write JSONL & download checkpoint
Prepare-->>Orchestrator: Ready for generation
Orchestrator->>TTSServer: Stage 2: Submit TTS inference
Note over TTSServer: Generate outputs via<br/>NeMo-Skills
TTSServer-->>Orchestrator: outputs.jsonl
Orchestrator->>Scorer: Stage 2: run_scoring()
Scorer->>Scorer: Convert NS outputs to Emergent
Scorer->>Emergent: Run evaluation
Note over Emergent: Score WER, MOS, Win-rate
Emergent-->>Scorer: metrics.json
Scorer-->>Orchestrator: Scoring complete
Orchestrator->>Scorer: Stage 3: run_aggregation()
Scorer->>Scorer: Aggregate metrics
Scorer-->>Orchestrator: Summary results
Orchestrator-->>User: Done!
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (7)
nemo_skills/dataset/emergent_tts/scripts/check_deps.py (2)
19-24: Narrow the exception toImportError; movereturn Nonetoelse.
except Exceptionis flagged by the linter (BLE001) and violates the project guideline: "Do not catch exceptions when they are not normally expected to be raised." Python's import system documentation states: "If the loader cannot execute the module, it should raise anImportError, although any other exception raised duringexec_module()will be propagated." Catching allExceptionsilently swallows unexpected errors (e.g. a brokentorchCUDA extension raisingOSError) that should bubble up. The static analysis also flags TRY300 for thereturn Noneplacement.♻️ Proposed fix
def _try_import(module: str) -> str | None: try: importlib.import_module(module) - return None - except Exception as e: + except ImportError as e: return f"{module} ({type(e).__name__}: {e})" + else: + return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py` around lines 19 - 24, The _try_import function currently catches all Exception and returns None inside the try; change it to only catch ImportError so unexpected errors propagate, and move the successful-return None into an else block after the try/except. Specifically, in _try_import use importlib.import_module(module) in the try, catch only ImportError to return the formatted error string, and place return None in an else clause so import failures are handled but other exceptions (e.g., OSError from extensions) are not swallowed.
27-28: Clarify the comment onparents[4]— it describes the file location, not the variable.The inline comment
# .../nemo_skills/dataset/emergent_tts/scriptsreads as though it describes the resolved value ofrepo_root, butparents[4]fromcheck_deps.pywalks four levels up (scripts→emergent_tts→dataset→nemo_skills→ repo root). The comment actually describes the file's own location.🔧 Suggested clarification
- repo_root = Path(__file__).resolve().parents[4] # .../nemo_skills/dataset/emergent_tts/scripts + # __file__ is at nemo_skills/dataset/emergent_tts/scripts/check_deps.py → parents[4] is the repo root + repo_root = Path(__file__).resolve().parents[4]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py` around lines 27 - 28, The inline comment next to repo_root in _venv_install_hint is misleading: update the comment so it clearly states that Path(__file__).resolve().parents[4] returns the repository root (walking up from the file location), and if you want to show the file location mention Path(__file__).resolve() separately; e.g., clarify that parents[4] corresponds to the repo root relative to the script and not the script path itself and adjust the comment near repo_root accordingly.nemo_skills/dataset/emergent_tts/scripts/score.py (3)
54-73:sys.argvmanipulation to callconvert_mainis fragile.Monkey-patching
sys.argvto invokeconvert_main()as a library call is error-prone (not thread-safe, breaks if convert_main is changed to use a global parser, etc.). Consider refactoring the converter to expose a function that accepts parameters directly, then call it here.Sketch: refactor converter to accept params
In
convert_ns_outputs_to_emergent.py, extract the core logic into a callable function:def convert(ns_output_jsonl: str, out_dir: str, mode: str = "symlink", overwrite: bool = False): # ... current main() logic with these params instead of argparse ...Then in
score.py:def _convert(ns_output_jsonl: Path, out_dir: Path, overwrite: bool) -> None: - from nemo_skills.dataset.emergent_tts.scripts.convert_ns_outputs_to_emergent import main as convert_main - - # Reuse converter as a library via argv. - import sys - - argv = sys.argv - try: - sys.argv = [ - argv[0], - "--ns_output_jsonl", - str(ns_output_jsonl), - "--out_dir", - str(out_dir), - "--mode", - "symlink", - ] + (["--overwrite"] if overwrite else []) - convert_main() - finally: - sys.argv = argv + from nemo_skills.dataset.emergent_tts.scripts.convert_ns_outputs_to_emergent import convert + convert(str(ns_output_jsonl), str(out_dir), mode="symlink", overwrite=overwrite)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/score.py` around lines 54 - 73, The current _convert function monkey-patches sys.argv and calls convert_main(), which is fragile and not thread-safe; refactor convert_ns_outputs_to_emergent.convert_ns_outputs_to_emergent (the module that currently exposes main as convert_main) to expose a direct API like convert(ns_output_jsonl: str | Path, out_dir: str | Path, mode: str = "symlink", overwrite: bool = False) that implements the core logic, make the existing main() simply parse args and call that convert(...) helper, and then update _convert to import and call convert(ns_output_jsonl, out_dir, mode="symlink", overwrite=overwrite) instead of mutating sys.argv and calling convert_main().
176-186: Consider atomic write formetrics.json.If
json.dumpfails mid-write (e.g., disk full),metrics.jsonwill be left in a corrupted state. Since metrics are fully loaded into memory first (line 181), writing to a temp file and renaming would be safer. This is a minor concern given it's an output file, not input data.As per coding guidelines, "perform all computations before re-opening files for writing to avoid accidental data loss if code fails during execution".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/score.py` around lines 176 - 186, The code writes metrics.json directly which can leave a corrupt file if json.dump fails; instead, after loading metrics (the variable metrics) write to a temporary file in the same directory (derived from bench_dir) and then atomically replace/rename it to bench_dir/"metrics.json" (use Path.replace or os.replace) so that emergent_metrics_path, metrics, bench_dir and bench are still used to locate and log the file; ensure the temp file is created in bench_dir, closed/flushed before replace, and cleaned up on error.
93-122: Global state mutation (os.environ,os.chdir) in_run_emergent_scoring.Setting
os.environ["EMERGENT_TTS_DATA_BASE_PATH"](line 94) and changing the working directory (line 102) are global side effects. Thechdiris properly restored in afinallyblock, but the environment variable is never cleaned up. Ifrun_scoringis called multiple times with differentemergent_data_base_pathvalues, the env var will be stale between calls. Consider restoring the original env var value in thefinallyblock as well.Proposed fix
os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = str(emergent_data_base_path) prev_cwd = os.getcwd() + prev_env = os.environ.get("EMERGENT_TTS_DATA_BASE_PATH") try: os.chdir(str(emergent_data_base_path.parent)) emergent_inference.eval_api_closed_model( ... ) finally: os.chdir(prev_cwd) + if prev_env is None: + os.environ.pop("EMERGENT_TTS_DATA_BASE_PATH", None) + else: + os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = prev_envWait — on re-reading, the env var is set on line 94 before
prev_cwdis captured, so the save/restore should wrap both. Let me adjust:Corrected proposed fix
+ prev_env = os.environ.get("EMERGENT_TTS_DATA_BASE_PATH") os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = str(emergent_data_base_path) prev_cwd = os.getcwd() try: os.chdir(str(emergent_data_base_path.parent)) emergent_inference.eval_api_closed_model( ... ) finally: os.chdir(prev_cwd) + if prev_env is None: + os.environ.pop("EMERGENT_TTS_DATA_BASE_PATH", None) + else: + os.environ["EMERGENT_TTS_DATA_BASE_PATH"] = prev_env🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/score.py` around lines 93 - 122, The code sets os.environ["EMERGENT_TTS_DATA_BASE_PATH"] and chdirs without restoring the environment; modify the routine (the block around os.environ["EMERGENT_TTS_DATA_BASE_PATH"], prev_cwd, and the finally) to first capture the original env value (orig_env = os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")) and prev_cwd, then set the env and chdir, and in the finally restore both (os.environ[...] = orig_env or del os.environ[...] if orig_env is None, and os.chdir(prev_cwd)); keep the existing call to emergent_inference.eval_api_closed_model unchanged.nemo_skills/dataset/emergent_tts/prepare.py (1)
64-89: Download retry catches bareException.Line 84 catches
Exceptionbroadly. For network downloads, the expected failures areurllib.error.URLError,OSError, andContentTooShortError(already handled above). Narrowing the catch to(URLError, OSError)would prevent masking unexpected bugs (e.g.,KeyboardInterruptis not caught byException, butMemoryErrorcould be).Proposed fix
+from urllib.error import ContentTooShortError, URLError ... except ContentTooShortError as e: # Partial download: wait and retry. wait_s = min(5 * attempt, 30) print(f"Warning: partial download for wv_mos.ckpt (attempt {attempt}/{max_attempts}): {e}") time.sleep(wait_s) - except Exception as e: + except (URLError, OSError) as e: wait_s = min(5 * attempt, 30) print(f"Warning: failed downloading wv_mos.ckpt (attempt {attempt}/{max_attempts}): {e}") time.sleep(wait_s)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/prepare.py` around lines 64 - 89, The retry loop in _download_wv_mos currently catches a broad Exception which can mask unrelated errors; change the second except clause to only catch network/filesystem related errors (e.g., except (urllib.error.URLError, OSError) as e) so the retry logic still handles download failures for WV_MOS_URL and tmp_path operations but lets other unexpected exceptions propagate; keep the existing ContentTooShortError handling, wait_s/backoff logic, tmp_path cleanup, and final RuntimeError if attempts exhaust.nemo_skills/dataset/emergent_tts/scripts/config/default.yaml (1)
48-48:pip installwithout version pins ininstallation_command.
pip install editdistance whisper-normalizer json-repair tenacityinstalls unpinned latest versions. This can cause non-reproducible environments. Consider pinning versions (e.g.,editdistance==0.8.1) or at least documenting the known-working versions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/emergent_tts/scripts/config/default.yaml` at line 48, The installation_command currently installs unpinned packages (editdistance, whisper-normalizer, json-repair, tenacity, and google-genai) which makes environments non-reproducible; update the installation_command entry in default.yaml to pin known-good versions for each of these packages (or add a comment documenting tested versions), e.g., specify exact version constraints for editdistance, whisper-normalizer, json-repair, tenacity, and keep the existing --no-deps flag for google-genai if needed so installations are reproducible and predictable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/emergent_tts/prepare.py`:
- Around line 193-202: The conversion from float samples to int16 can wrap if
samples exceed [-1,1]; update the code that creates audio_array_int16 (working
with variables audio_array, audio_array_int16, and the AudioSegment export to
wav_path) to first clamp/clip audio_array to [-1.0, 1.0] (e.g., via np.clip),
then scale and convert to int16 (preferably round then astype) before building
AudioSegment and exporting; this ensures no int16 overflow/wrap and avoids
audible artifacts.
- Around line 46-61: In _require_deps() replace the broad "except Exception"
with "except ImportError" so only failed imports are caught, and update the
RuntimeError message to remove the hardcoded developer path—use a generic
installation instruction (e.g., "activate your virtualenv and run: pip install
datasets numpy pydub tqdm librosa soundfile") so users get actionable,
non-personalized guidance; keep the original raised-from (from e) behavior when
re-raising the RuntimeError.
In `@nemo_skills/dataset/emergent_tts/README.md`:
- Line 45: The README contains duplicate section numbering: update the three
heading lines so they read "### 3) Clone + patch EmergentTTS-Eval-public for
NVIDIA Inference API judging", change the second "### 3) Run evaluation" to "###
4) Run evaluation", and increment "### 4) Smoke test" to "### 5) Smoke test"
(these exact heading strings identify the locations to edit) so the section
numbers are sequential.
- Around line 20-28: Replace all developer-specific absolute paths in the
README.md code blocks (e.g., `/home/vmendelev/...` and
`/lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code`) with clear
placeholders like `<REPO_ROOT>` and `<CLUSTER_WORKDIR>` and update the repo
reference `<repo_url>` to the actual EmergentTTS-Eval-public repository URL;
ensure every code block that currently contains hard-coded paths (including the
examples around the prepare.py invocation and the git clone block) uses these
placeholders or environment variables (e.g., export REPO_ROOT="/path/to/repo")
so contributors can substitute their own paths.
- Around line 64-71: The README has a contradiction between the prose and the
example config about judger_base_url: the prose instructs threading
judger_base_url as the base URL (e.g., https://inference-api.nvidia.com/v1)
while the config example sets scoring.judger_base_url to the full chat
completions path; fix by choosing the base-only form and updating the example
and any code that consumes it (e.g., api_clients.py) to append the endpoint path
when constructing requests; ensure references to judger_base_url and
scoring.judger_base_url consistently use the base (no /v1/chat/completions) and
that OpenAI(...) or equivalent client is created with base_url=judger_base_url
and the code appends the proper /v1/chat/completions suffix where needed, and
update the README snippet and scripts/config/default.yaml to match.
- Around line 1-124: The README for emergent_tts is missing example evaluation
results; update nemo_skills/dataset/emergent_tts/README.md to include a short
"Expected results" section showing sample metrics (WER, MOS, win-rate) for at
least one tested model and the exact config used; reference the example config
path scripts/config/default.yaml (and interactive_10.yaml for smoke tests) and
the output filenames (eval-results/.../output.jsonl,
emergent-tts-eval_*_evaluation-metrics.json, metrics.json) so readers can
reproduce the run and compare their numbers.
In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py`:
- Around line 57-80: The duplicate missing-module entries occur because modules
like "pydub", "librosa", and "soundfile" are checked in both the prepare and
scoring loops and appended to missing twice; modify the checks around args.stage
so each module is only added once by deduplicating before appending (e.g.,
maintain a local seen set or build a single ordered list of modules to check
when args.stage == "all") and continue to use _try_import(module) to produce err
and append to missing only if module not in seen; ensure you reference the
existing variables and functions (_try_import, missing, args.stage) and preserve
existing loop/error handling semantics.
In `@nemo_skills/dataset/emergent_tts/scripts/config/default.yaml`:
- Around line 9-18: default.yaml currently embeds developer-specific absolute
paths for keys like container, output_dir, nemo_code_path, data_dir,
scoring_code_path, and emergent_data_dir and also contains a repeated typo
"containters"; replace those hardcoded paths with clear placeholders (e.g.
<CONTAINER_PATH>, <OUTPUT_DIR>, <NEMO_CODE_PATH>, etc.) or
environment-variable-style tokens and add short inline comments explaining which
must be customized, and fix the spelling of the "container" key (remove
"containters") wherever it appears so consumers get a canonical, editable
default config.
In `@nemo_skills/dataset/emergent_tts/scripts/config/interactive_10.yaml`:
- Around line 3-9: The YAML uses hardcoded user-specific absolute paths and a
typo: change the `container` key value (currently containing "containters") to
`containers` and replace all user-specific absolute paths referenced by keys
`container`, `mount_paths`, `output_dir`, and `nemo_code_path` with reusable
placeholders or environment-interpolated values (e.g., use
`${oc.env:LUSTRE_BASE}` or `${env:LUSTRE_BASE}` depending on your config system)
so other developers can override them at runtime; ensure the updated `container`
key points to the correct container path expression and that `mount_paths`,
`output_dir`, and `nemo_code_path` use the same base variable to construct their
paths.
In
`@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10_base.yaml`:
- Around line 11-14: Replace hard-coded developer paths in the YAML (keys
output_dir, nemo_code_path, and data_dir) with portable placeholders or
environment-expanded variables and/or exclude local variants via .gitignore;
specifically, change the values for output_dir, nemo_code_path, and data_dir to
use Hydra/env placeholders like ${oc.env:HOME}/path or clearly-marked tokens
such as <YOUR_WORKSPACE>/... so other contributors don't need to edit the file,
and add a gitignore rule (e.g.,
nemo_skills/dataset/emergent_tts/scripts/config/local_*.yaml) so personal
local_*.yaml files are not committed.
In `@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10.yaml`:
- Around line 4-5: Replace the hard-coded developer paths in this config by
converting the absolute values for output_dir, nemo_code_path, and data_dir into
either placeholder variables (e.g., "<PATH_TO_OUTPUT>", "<PATH_TO_NEMO_CODE>",
"<PATH_TO_DATA>") or reference environment variables (e.g., ${ENV_VAR}) and
remove or generalize the usage comment that contains the specific
NEMO_SKILLS_CONFIG_DIR developer path; alternatively, omit this local config
from the repo and add it to .gitignore. Update the entries named output_dir,
nemo_code_path, data_dir and the usage comment so the file no longer contains
absolute developer-specific paths.
In `@nemo_skills/dataset/emergent_tts/scripts/convert_ns_outputs_to_emergent.py`:
- Around line 57-86: The loop currently silently skips records when a
destination file already exists (the check at dst.exists() and not
args.overwrite) without incrementing any counter; add a new counter variable
(e.g., existing_skipped or skipped_existing) initialized alongside
converted/skipped/missing, increment it inside the dst.exists() and not
args.overwrite branch, and include that counter in the final print summary along
with converted, skipped (no unique_id_eval), and missing_audio so the user can
see how many files were skipped because they already existed; modify references
around _link_or_copy, dst, args.overwrite, and out_dir accordingly.
In `@nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py`:
- Around line 152-168: The aggregation branch currently runs only when
args.stage == "aggregation", so --stage all skips aggregation; modify the
condition in run_tts_eval.py (the block that builds agg_cmd and calls
ns_run_cmd) to also run when args.stage == "all" (e.g., if args.stage in
("aggregation","all")) or add a clear inline comment near args.stage explaining
that aggregation is intentionally separate and must be invoked with
"aggregation"; update references to agg_cmd and the ns_run_cmd call accordingly
so aggregation runs after scoring when requested.
- Around line 119-131: The command string score_cmd in run_tts_eval.py currently
inlines the secret via `JUDGER_API_KEY={judger_api_key}`; remove that prefix
from the constructed `score_cmd` and instead pass the key via the environment
when invoking `ns_run_cmd` (or ensure the Slurm job inherits the caller env),
e.g., add an env dict containing "JUDGER_API_KEY": judger_api_key to the
ns_run_cmd call or rely on the scoring script reading os.environ; leave the rest
of the command (flags built from scoring.get(...)) unchanged and ensure no other
code interpolates judger_api_key into any logged strings or job script content.
---
Nitpick comments:
In `@nemo_skills/dataset/emergent_tts/prepare.py`:
- Around line 64-89: The retry loop in _download_wv_mos currently catches a
broad Exception which can mask unrelated errors; change the second except clause
to only catch network/filesystem related errors (e.g., except
(urllib.error.URLError, OSError) as e) so the retry logic still handles download
failures for WV_MOS_URL and tmp_path operations but lets other unexpected
exceptions propagate; keep the existing ContentTooShortError handling,
wait_s/backoff logic, tmp_path cleanup, and final RuntimeError if attempts
exhaust.
In `@nemo_skills/dataset/emergent_tts/scripts/check_deps.py`:
- Around line 19-24: The _try_import function currently catches all Exception
and returns None inside the try; change it to only catch ImportError so
unexpected errors propagate, and move the successful-return None into an else
block after the try/except. Specifically, in _try_import use
importlib.import_module(module) in the try, catch only ImportError to return the
formatted error string, and place return None in an else clause so import
failures are handled but other exceptions (e.g., OSError from extensions) are
not swallowed.
- Around line 27-28: The inline comment next to repo_root in _venv_install_hint
is misleading: update the comment so it clearly states that
Path(__file__).resolve().parents[4] returns the repository root (walking up from
the file location), and if you want to show the file location mention
Path(__file__).resolve() separately; e.g., clarify that parents[4] corresponds
to the repo root relative to the script and not the script path itself and
adjust the comment near repo_root accordingly.
In `@nemo_skills/dataset/emergent_tts/scripts/config/default.yaml`:
- Line 48: The installation_command currently installs unpinned packages
(editdistance, whisper-normalizer, json-repair, tenacity, and google-genai)
which makes environments non-reproducible; update the installation_command entry
in default.yaml to pin known-good versions for each of these packages (or add a
comment documenting tested versions), e.g., specify exact version constraints
for editdistance, whisper-normalizer, json-repair, tenacity, and keep the
existing --no-deps flag for google-genai if needed so installations are
reproducible and predictable.
In `@nemo_skills/dataset/emergent_tts/scripts/score.py`:
- Around line 54-73: The current _convert function monkey-patches sys.argv and
calls convert_main(), which is fragile and not thread-safe; refactor
convert_ns_outputs_to_emergent.convert_ns_outputs_to_emergent (the module that
currently exposes main as convert_main) to expose a direct API like
convert(ns_output_jsonl: str | Path, out_dir: str | Path, mode: str = "symlink",
overwrite: bool = False) that implements the core logic, make the existing
main() simply parse args and call that convert(...) helper, and then update
_convert to import and call convert(ns_output_jsonl, out_dir, mode="symlink",
overwrite=overwrite) instead of mutating sys.argv and calling convert_main().
- Around line 176-186: The code writes metrics.json directly which can leave a
corrupt file if json.dump fails; instead, after loading metrics (the variable
metrics) write to a temporary file in the same directory (derived from
bench_dir) and then atomically replace/rename it to bench_dir/"metrics.json"
(use Path.replace or os.replace) so that emergent_metrics_path, metrics,
bench_dir and bench are still used to locate and log the file; ensure the temp
file is created in bench_dir, closed/flushed before replace, and cleaned up on
error.
- Around line 93-122: The code sets os.environ["EMERGENT_TTS_DATA_BASE_PATH"]
and chdirs without restoring the environment; modify the routine (the block
around os.environ["EMERGENT_TTS_DATA_BASE_PATH"], prev_cwd, and the finally) to
first capture the original env value (orig_env =
os.environ.get("EMERGENT_TTS_DATA_BASE_PATH")) and prev_cwd, then set the env
and chdir, and in the finally restore both (os.environ[...] = orig_env or del
os.environ[...] if orig_env is None, and os.chdir(prev_cwd)); keep the existing
call to emergent_inference.eval_api_closed_model unchanged.
| def _require_deps(): | ||
| try: | ||
| import numpy as np # noqa: F401 | ||
| from datasets import load_dataset # noqa: F401 | ||
| import librosa # noqa: F401 | ||
| import soundfile # noqa: F401 | ||
| from pydub import AudioSegment # noqa: F401 | ||
| from tqdm import tqdm # noqa: F401 | ||
| except Exception as e: # pragma: no cover | ||
| raise RuntimeError( | ||
| "Missing dependencies for EmergentTTS-Eval preparation.\n\n" | ||
| "Install into the repo venv:\n" | ||
| " cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval\n" | ||
| " . ./.venv/bin/activate\n" | ||
| " pip install datasets numpy pydub tqdm librosa soundfile\n" | ||
| ) from e |
There was a problem hiding this comment.
Hardcoded developer path in error message; catch ImportError instead of bare Exception.
Two issues:
-
Line 58 contains a hardcoded developer-specific path (
/home/vmendelev/workspace/...). This is meaningless to other users. Replace with a generic instruction. -
The
except Exceptionon line 54 should beexcept ImportError— that's the only exception expected from failed imports. Catching broader exceptions can mask unrelated bugs (e.g., a library that imports successfully but fails during its own init for a different reason).
Proposed fix
def _require_deps():
try:
- import numpy as np # noqa: F401
- from datasets import load_dataset # noqa: F401
- import librosa # noqa: F401
- import soundfile # noqa: F401
- from pydub import AudioSegment # noqa: F401
- from tqdm import tqdm # noqa: F401
- except Exception as e: # pragma: no cover
+ import numpy as np
+ from datasets import load_dataset
+ import librosa
+ import soundfile
+ from pydub import AudioSegment
+ from tqdm import tqdm
+ except ImportError as e: # pragma: no cover
raise RuntimeError(
"Missing dependencies for EmergentTTS-Eval preparation.\n\n"
- "Install into the repo venv:\n"
- " cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval\n"
- " . ./.venv/bin/activate\n"
- " pip install datasets numpy pydub tqdm librosa soundfile\n"
+ "Install the required packages:\n"
+ " pip install datasets numpy pydub tqdm librosa soundfile\n"
) from eAs per coding guidelines, "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving".
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 48-48: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
[warning] 49-49: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
[warning] 50-50: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
[warning] 51-51: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
[warning] 52-52: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
[warning] 53-53: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
[warning] 55-61: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/emergent_tts/prepare.py` around lines 46 - 61, In
_require_deps() replace the broad "except Exception" with "except ImportError"
so only failed imports are caught, and update the RuntimeError message to remove
the hardcoded developer path—use a generic installation instruction (e.g.,
"activate your virtualenv and run: pip install datasets numpy pydub tqdm librosa
soundfile") so users get actionable, non-personalized guidance; keep the
original raised-from (from e) behavior when re-raising the RuntimeError.
| audio_array = curr["audio"]["array"] | ||
| audio_sr = int(curr["audio"]["sampling_rate"]) | ||
| audio_array_int16 = np.int16(audio_array * 32767) | ||
| audio_segment = AudioSegment( | ||
| audio_array_int16.tobytes(), | ||
| frame_rate=audio_sr, | ||
| sample_width=2, | ||
| channels=1, | ||
| ) | ||
| audio_segment.export(str(wav_path), format="wav") |
There was a problem hiding this comment.
Potential int16 overflow/wrap when audio samples exceed [-1, 1].
np.int16(audio_array * 32767) will silently wrap around if any sample exceeds the [-1.0, 1.0] range (e.g., a value of 1.001 * 32767 = 32799 wraps to -32737 as int16). This produces audible clicks/artifacts. Clip before converting.
Proposed fix
- audio_array_int16 = np.int16(audio_array * 32767)
+ audio_array_int16 = np.int16(np.clip(audio_array, -1.0, 1.0) * 32767)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/emergent_tts/prepare.py` around lines 193 - 202, The
conversion from float samples to int16 can wrap if samples exceed [-1,1]; update
the code that creates audio_array_int16 (working with variables audio_array,
audio_array_int16, and the AudioSegment export to wav_path) to first clamp/clip
audio_array to [-1.0, 1.0] (e.g., via np.clip), then scale and convert to int16
(preferably round then astype) before building AudioSegment and exporting; this
ensures no int16 overflow/wrap and avoids audible artifacts.
| ## EmergentTTS-Eval dataset (`emergent_tts`) | ||
|
|
||
| This dataset integration lets you: | ||
|
|
||
| - **Prepare** the EmergentTTS-Eval test set under a shared `data_dir` (download baseline audios + metadata + MOS model). | ||
| - **Generate** TTS outputs with NeMo-Skills (`ns eval` via `run_tts_eval.py`). | ||
| - **Score** the generated outputs with EmergentTTS-Eval (WER/MOS/win-rate, depending on config). | ||
|
|
||
| ### 1) Prepare the test set (requires `HF_TOKEN`) | ||
|
|
||
| `prepare.py` downloads the dataset and writes all required artifacts into: | ||
|
|
||
| - `<DATA_DIR>/emergent_tts/emergent/test.jsonl` | ||
| - `<DATA_DIR>/emergent_tts/data/emergent_tts_eval_data.jsonl` | ||
| - `<DATA_DIR>/emergent_tts/data/baseline_audios/*.wav` | ||
| - `<DATA_DIR>/emergent_tts/data/wv_mos.ckpt` | ||
|
|
||
| Run it from your dev machine (or any environment with network access): | ||
|
|
||
| ```bash | ||
| cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval | ||
| . ./.venv/bin/activate | ||
|
|
||
| export HF_TOKEN="<your_hf_token>" | ||
|
|
||
| python nemo_skills/dataset/emergent_tts/prepare.py \ | ||
| --output_dir "<DATA_DIR>/emergent_tts" | ||
| ``` | ||
|
|
||
| Optional flags: | ||
|
|
||
| - `--num_samples 10`: write only the first 10 samples (smoke test). | ||
| - `--overwrite`: re-download / regenerate outputs. | ||
|
|
||
| ### 2) Configure evaluation | ||
|
|
||
| Use the example configs in `nemo_skills/dataset/emergent_tts/scripts/config/`. | ||
|
|
||
| In `scripts/config/default.yaml`, set: | ||
|
|
||
| - `generation.data_dir: <DATA_DIR>` | ||
| - `scoring.emergent_data_dir: <DATA_DIR>/emergent_tts/data` | ||
| - `scoring.scoring_code_path: <PATH_TO>/EmergentTTS-Eval-public` (on the cluster) | ||
|
|
||
| ### 3) Clone + patch EmergentTTS-Eval-public for NVIDIA Inference API judging | ||
|
|
||
| On EOS (or wherever you run scoring), clone EmergentTTS-Eval: | ||
|
|
||
| ```bash | ||
| cd /lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code | ||
| git clone <repo_url> EmergentTTS-Eval-public | ||
| ``` | ||
|
|
||
| Then update Emergent’s judge client selection so that **Gemini models are called via NVIDIA’s OpenAI-compatible Inference API**. | ||
|
|
||
| Target behavior: | ||
|
|
||
| - **Model name** stays as: `gcp/google/gemini-2.5-pro` (or similar). | ||
| - **Base URL** is NVIDIA Inference API: `https://inference-api.nvidia.com/v1` | ||
| - **API key** comes from: `JUDGER_API_KEY` (or `NVIDIA_API_KEY`) | ||
|
|
||
| Minimal patch checklist inside `EmergentTTS-Eval-public`: | ||
|
|
||
| - In `api_clients.py` (or wherever the client is chosen), ensure `gcp/google/*` uses an **OpenAI-compatible** client (not the Google SDK client), e.g.: | ||
| - `OpenAI(base_url=<judger_base_url>, api_key=os.getenv("JUDGER_API_KEY"))` | ||
| - Thread `judger_base_url` through so calls use `https://inference-api.nvidia.com/v1` (not the full `/v1/chat/completions` endpoint). | ||
|
|
||
| After patching, set these in `scripts/config/default.yaml`: | ||
|
|
||
| - `scoring.judge_model: gcp/google/gemini-2.5-pro` | ||
| - `scoring.judger_base_url: https://inference-api.nvidia.com/v1/chat/completions` | ||
|
|
||
| ### 3) Run evaluation (generation + scoring) | ||
|
|
||
| From your dev machine, submit jobs to EOS: | ||
|
|
||
| ```bash | ||
| cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval | ||
| . ./.venv/bin/activate | ||
| mkdir -p .nemo_run | ||
|
|
||
| export NEMORUN_HOME="$PWD/.nemo_run" | ||
| export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/ns_eval/cluster_configs | ||
| export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1 | ||
|
|
||
| # Required for win-rate judging (NVIDIA Inference API key) | ||
| export JUDGER_API_KEY="<your_nvidia_api_key>" | ||
|
|
||
| python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval \ | ||
| --config nemo_skills/dataset/emergent_tts/scripts/config/default.yaml \ | ||
| --stage all \ | ||
| --expname emergent_eval | ||
| ``` | ||
|
|
||
| ### 4) Smoke test (10 samples, interactive) | ||
|
|
||
| ```bash | ||
| cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval | ||
| . ./.venv/bin/activate | ||
| mkdir -p .nemo_run | ||
|
|
||
| export NEMORUN_HOME="$PWD/.nemo_run" | ||
| export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/ns_eval/cluster_configs | ||
| export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1 | ||
|
|
||
| python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval \ | ||
| --config nemo_skills/dataset/emergent_tts/scripts/config/interactive_10.yaml \ | ||
| --stage generation \ | ||
| --expname emergent_smoke10 | ||
| ``` | ||
|
|
||
| ### Outputs | ||
|
|
||
| NeMo-Skills generation writes: | ||
|
|
||
| - `<output_dir>/eval-results/emergent_tts.emergent/output.jsonl` | ||
| - `<output_dir>/eval-results/emergent_tts.emergent/audio/*.wav` (or equivalent) | ||
|
|
||
| Emergent scoring writes (in the same benchmark folder): | ||
|
|
||
| - `emergent-tts-eval_*_evaluation-predictions.jsonl` | ||
| - `emergent-tts-eval_*_evaluation-metrics.json` | ||
| - `metrics.json` (a NeMo-Skills-friendly copy of Emergent metrics) | ||
|
|
There was a problem hiding this comment.
Add expected evaluation results for at least one tested model.
The README documents the workflow thoroughly but does not include any sample metric output (e.g. WER, MOS, win-rate) for a reference model run. Based on learnings from CONTRIBUTING.md: "When adding new benchmarks, add documentation with example commands for how to run evaluation, expected results for tested models, and any dataset-specific details like special preparation arguments or non-standard inference arguments."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/emergent_tts/README.md` around lines 1 - 124, The README
for emergent_tts is missing example evaluation results; update
nemo_skills/dataset/emergent_tts/README.md to include a short "Expected results"
section showing sample metrics (WER, MOS, win-rate) for at least one tested
model and the exact config used; reference the example config path
scripts/config/default.yaml (and interactive_10.yaml for smoke tests) and the
output filenames (eval-results/.../output.jsonl,
emergent-tts-eval_*_evaluation-metrics.json, metrics.json) so readers can
reproduce the run and compare their numbers.
| ```bash | ||
| cd /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval | ||
| . ./.venv/bin/activate | ||
|
|
||
| export HF_TOKEN="<your_hf_token>" | ||
|
|
||
| python nemo_skills/dataset/emergent_tts/prepare.py \ | ||
| --output_dir "<DATA_DIR>/emergent_tts" | ||
| ``` |
There was a problem hiding this comment.
Replace developer-specific absolute paths with generic placeholders.
The README hard-codes /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval (lines 21, 78, 98, and elsewhere) and /lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code (line 50) throughout every code block. These paths are specific to one developer's environment and will not work for any other contributor.
Replace them with environment variables or clearly marked placeholders, e.g. <REPO_ROOT>, <CLUSTER_WORKDIR>. The <repo_url> on line 51 also needs to be filled in with the actual EmergentTTS-Eval-public repository URL.
Also applies to: 49-52, 77-110
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/emergent_tts/README.md` around lines 20 - 28, Replace all
developer-specific absolute paths in the README.md code blocks (e.g.,
`/home/vmendelev/...` and
`/lustre/fsw/llmservice_nemo_speechlm/users/vmendelev/code`) with clear
placeholders like `<REPO_ROOT>` and `<CLUSTER_WORKDIR>` and update the repo
reference `<repo_url>` to the actual EmergentTTS-Eval-public repository URL;
ensure every code block that currently contains hard-coded paths (including the
examples around the prepare.py invocation and the git clone block) uses these
placeholders or environment variables (e.g., export REPO_ROOT="/path/to/repo")
so contributors can substitute their own paths.
| - `scoring.emergent_data_dir: <DATA_DIR>/emergent_tts/data` | ||
| - `scoring.scoring_code_path: <PATH_TO>/EmergentTTS-Eval-public` (on the cluster) | ||
|
|
||
| ### 3) Clone + patch EmergentTTS-Eval-public for NVIDIA Inference API judging |
There was a problem hiding this comment.
Fix duplicate section number — two sections are labeled "3)".
Line 45 is ### 3) Clone + patch EmergentTTS-Eval-public and line 73 is ### 3) Run evaluation. The subsequent section (### 4) Smoke test) is also off by one. The correct numbering should be: 3 → Clone & patch, 4 → Run evaluation, 5 → Smoke test.
Also applies to: 73-73, 95-95
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/emergent_tts/README.md` at line 45, The README contains
duplicate section numbering: update the three heading lines so they read "### 3)
Clone + patch EmergentTTS-Eval-public for NVIDIA Inference API judging", change
the second "### 3) Run evaluation" to "### 4) Run evaluation", and increment
"### 4) Smoke test" to "### 5) Smoke test" (these exact heading strings identify
the locations to edit) so the section numbers are sequential.
| output_dir: /home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval/_local_runs/emergent_tts_smoke10_base | ||
|
|
||
| nemo_code_path: /home/vmendelev/workspace/expressiveness/src/NeMo | ||
|
|
There was a problem hiding this comment.
Hard-coded developer paths make this config non-portable.
output_dir, nemo_code_path, and data_dir (lines 11, 13-14, 22-25) all reference /home/vmendelev/.... Any other contributor must manually edit these before using this file.
Either replace them with clearly-marked placeholders (e.g. ${oc.env:HOME}/... for Hydra, or <YOUR_WORKSPACE>/... as a comment hint) or add a .gitignore pattern such as nemo_skills/dataset/emergent_tts/scripts/config/local_*.yaml to keep local/personal variants out of the repository entirely, consistent with the existing cluster_configs/* pattern in .gitignore.
Also applies to: 22-25
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10_base.yaml`
around lines 11 - 14, Replace hard-coded developer paths in the YAML (keys
output_dir, nemo_code_path, and data_dir) with portable placeholders or
environment-expanded variables and/or exclude local variants via .gitignore;
specifically, change the values for output_dir, nemo_code_path, and data_dir to
use Hydra/env placeholders like ${oc.env:HOME}/path or clearly-marked tokens
such as <YOUR_WORKSPACE>/... so other contributors don't need to edit the file,
and add a gitignore rule (e.g.,
nemo_skills/dataset/emergent_tts/scripts/config/local_*.yaml) so personal
local_*.yaml files are not committed.
| # export NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/workspace/expressiveness/src/nemo-skills-tts-eval/cluster_configs | ||
| # python -m nemo_skills.dataset.emergent_tts.scripts.run_tts_eval --config <this_file> --stage generation |
There was a problem hiding this comment.
Same hard-coded developer paths as local_interactive_10_base.yaml.
output_dir, nemo_code_path, data_dir (lines 10, 14, 25), and even the usage comment on line 4 (NEMO_SKILLS_CONFIG_DIR=/home/vmendelev/...) are all absolute paths tied to a specific developer environment. Apply the same fix suggested for local_interactive_10_base.yaml: use placeholder values or exclude these local configs via .gitignore.
Also applies to: 10-10, 14-14, 25-25
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/emergent_tts/scripts/config/local_interactive_10.yaml`
around lines 4 - 5, Replace the hard-coded developer paths in this config by
converting the absolute values for output_dir, nemo_code_path, and data_dir into
either placeholder variables (e.g., "<PATH_TO_OUTPUT>", "<PATH_TO_NEMO_CODE>",
"<PATH_TO_DATA>") or reference environment variables (e.g., ${ENV_VAR}) and
remove or generalize the usage comment that contains the specific
NEMO_SKILLS_CONFIG_DIR developer path; alternatively, omit this local config
from the repo and add it to .gitignore. Update the entries named output_dir,
nemo_code_path, data_dir and the usage comment so the file no longer contains
absolute developer-specific paths.
| converted = 0 | ||
| skipped = 0 | ||
| missing = 0 | ||
|
|
||
| with open(args.ns_output_jsonl, "r", encoding="utf-8") as f: | ||
| for line in f: | ||
| line = line.strip() | ||
| if not line: | ||
| continue | ||
| record = json.loads(line) | ||
| user_json = _extract_user_json(record) or {} | ||
| unique_id = user_json.get("unique_id_eval", record.get("unique_id_eval")) | ||
| audio_path = (record.get("audio") or {}).get("path") | ||
|
|
||
| if unique_id is None: | ||
| skipped += 1 | ||
| continue | ||
| if not audio_path or not os.path.exists(audio_path): | ||
| missing += 1 | ||
| continue | ||
|
|
||
| dst = out_dir / f"{unique_id}.wav" | ||
| if dst.exists() and not args.overwrite: | ||
| continue | ||
| _link_or_copy(audio_path, str(dst), args.mode) | ||
| converted += 1 | ||
|
|
||
| print( | ||
| f"Converted {converted} files into {out_dir}. " | ||
| f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}" |
There was a problem hiding this comment.
Silently skipped "already exists" records are not counted in the summary.
When dst.exists() and not args.overwrite (Line 79), the record is skipped without incrementing any counter. The final summary reports converted, skipped(no unique_id_eval), and missing_audio, but "already-exists" records are invisible. This makes it hard to tell whether a partial run actually processed everything.
Proposed fix
converted = 0
skipped = 0
missing = 0
+ already_exists = 0
with open(args.ns_output_jsonl, "r", encoding="utf-8") as f:
for line in f:
@@ -79,6 +80,7 @@
dst = out_dir / f"{unique_id}.wav"
if dst.exists() and not args.overwrite:
+ already_exists += 1
continue
_link_or_copy(audio_path, str(dst), args.mode)
converted += 1
print(
f"Converted {converted} files into {out_dir}. "
- f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}"
+ f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}, already_exists={already_exists}"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| converted = 0 | |
| skipped = 0 | |
| missing = 0 | |
| with open(args.ns_output_jsonl, "r", encoding="utf-8") as f: | |
| for line in f: | |
| line = line.strip() | |
| if not line: | |
| continue | |
| record = json.loads(line) | |
| user_json = _extract_user_json(record) or {} | |
| unique_id = user_json.get("unique_id_eval", record.get("unique_id_eval")) | |
| audio_path = (record.get("audio") or {}).get("path") | |
| if unique_id is None: | |
| skipped += 1 | |
| continue | |
| if not audio_path or not os.path.exists(audio_path): | |
| missing += 1 | |
| continue | |
| dst = out_dir / f"{unique_id}.wav" | |
| if dst.exists() and not args.overwrite: | |
| continue | |
| _link_or_copy(audio_path, str(dst), args.mode) | |
| converted += 1 | |
| print( | |
| f"Converted {converted} files into {out_dir}. " | |
| f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}" | |
| converted = 0 | |
| skipped = 0 | |
| missing = 0 | |
| already_exists = 0 | |
| with open(args.ns_output_jsonl, "r", encoding="utf-8") as f: | |
| for line in f: | |
| line = line.strip() | |
| if not line: | |
| continue | |
| record = json.loads(line) | |
| user_json = _extract_user_json(record) or {} | |
| unique_id = user_json.get("unique_id_eval", record.get("unique_id_eval")) | |
| audio_path = (record.get("audio") or {}).get("path") | |
| if unique_id is None: | |
| skipped += 1 | |
| continue | |
| if not audio_path or not os.path.exists(audio_path): | |
| missing += 1 | |
| continue | |
| dst = out_dir / f"{unique_id}.wav" | |
| if dst.exists() and not args.overwrite: | |
| already_exists += 1 | |
| continue | |
| _link_or_copy(audio_path, str(dst), args.mode) | |
| converted += 1 | |
| print( | |
| f"Converted {converted} files into {out_dir}. " | |
| f"skipped(no unique_id_eval)={skipped}, missing_audio={missing}, already_exists={already_exists}" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/emergent_tts/scripts/convert_ns_outputs_to_emergent.py`
around lines 57 - 86, The loop currently silently skips records when a
destination file already exists (the check at dst.exists() and not
args.overwrite) without incrementing any counter; add a new counter variable
(e.g., existing_skipped or skipped_existing) initialized alongside
converted/skipped/missing, increment it inside the dst.exists() and not
args.overwrite branch, and include that counter in the final print summary along
with converted, skipped (no unique_id_eval), and missing_audio so the user can
see how many files were skipped because they already existed; modify references
around _link_or_copy, dst, args.overwrite, and out_dir accordingly.
| score_cmd = ( | ||
| (f"cd {emergent_data_base_dir} && " if emergent_data_base_dir else "") | ||
| + f"JUDGER_API_KEY={judger_api_key} " | ||
| + f"PYTHONPATH={scoring_code_path}:$PYTHONPATH " | ||
| + "python -m nemo_skills.dataset.emergent_tts.scripts.score " | ||
| + f"--results_dir {output_dir} " | ||
| + f"--benchmark {benchmark} " | ||
| + f"--emergent_data_dir {emergent_data_dir} " | ||
| + f"--judge_model {scoring.get('judge_model', 'gcp/google/gemini-2.5-pro')} " | ||
| + f"--judger_base_url {scoring.get('judger_base_url', 'https://inference-api.nvidia.com/v1/chat/completions')} " | ||
| + f"--num_threads {int(scoring.get('num_threads', 8))} " | ||
| + f"--evaluate_function {scoring.get('evaluate_function', 'win_rate')}" | ||
| ) |
There was a problem hiding this comment.
API key embedded in plain text in the shell command string.
Line 121 interpolates judger_api_key directly into the command string: JUDGER_API_KEY={judger_api_key}. This command is passed to ns_run_cmd, which likely persists it in Slurm job scripts and log files, leaking the secret. Prefer passing the key via the environment rather than inlining it in the command.
Possible approach
If ns_run_cmd supports environment variable propagation, pass JUDGER_API_KEY as an env var. Alternatively, have the scoring script read the key from the environment itself (it already can via os.environ), and ensure the Slurm job inherits the caller's environment. Remove the inline JUDGER_API_KEY=... prefix from score_cmd.
score_cmd = (
(f"cd {emergent_data_base_dir} && " if emergent_data_base_dir else "")
- + f"JUDGER_API_KEY={judger_api_key} "
+ f"PYTHONPATH={scoring_code_path}:$PYTHONPATH "
+ "python -m nemo_skills.dataset.emergent_tts.scripts.score "Then in score.py, read JUDGER_API_KEY from the environment at scoring time (which it already does indirectly via the Emergent client).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py` around lines 119 -
131, The command string score_cmd in run_tts_eval.py currently inlines the
secret via `JUDGER_API_KEY={judger_api_key}`; remove that prefix from the
constructed `score_cmd` and instead pass the key via the environment when
invoking `ns_run_cmd` (or ensure the Slurm job inherits the caller env), e.g.,
add an env dict containing "JUDGER_API_KEY": judger_api_key to the ns_run_cmd
call or rely on the scoring script reading os.environ; leave the rest of the
command (flags built from scoring.get(...)) unchanged and ensure no other code
interpolates judger_api_key into any logged strings or job script content.
| if args.stage == "aggregation": | ||
| print("\n" + "=" * 60) | ||
| print("Stage 3: AGGREGATION") | ||
| print("=" * 60) | ||
| agg_cmd = f"python -m nemo_skills.dataset.emergent_tts.scripts.score --results_dir {output_dir} --aggregation_only" | ||
| ns_run_cmd( | ||
| ctx=MockContext(), | ||
| cluster=cfg["cluster"], | ||
| container=cfg["container"], | ||
| partition=cfg["partition"], | ||
| num_gpus=0, | ||
| mount_paths=cfg["mount_paths"], | ||
| command=agg_cmd, | ||
| reuse_code=False, | ||
| expname=f"{args.expname}_agg", | ||
| log_dir=f"{output_dir}/eval-logs", | ||
| ) |
There was a problem hiding this comment.
--stage all does not include aggregation.
Line 152 uses == (if args.stage == "aggregation"), so running --stage all executes generation and scoring but skips aggregation. The docstring mentions "Generation -> Scoring (-> Aggregation)" which hints this is by design (aggregation requires scoring to complete first and may be a separate step), but this will surprise users who expect "all" to mean all stages. Consider adding a clarifying comment or including aggregation in "all" with a run_after dependency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/emergent_tts/scripts/run_tts_eval.py` around lines 152 -
168, The aggregation branch currently runs only when args.stage ==
"aggregation", so --stage all skips aggregation; modify the condition in
run_tts_eval.py (the block that builds agg_cmd and calls ns_run_cmd) to also run
when args.stage == "all" (e.g., if args.stage in ("aggregation","all")) or add a
clear inline comment near args.stage explaining that aggregation is
intentionally separate and must be invoked with "aggregation"; update references
to agg_cmd and the ns_run_cmd call accordingly so aggregation runs after scoring
when requested.
Summary
Files added
nemo_skills/dataset/emergent_tts/__init__.py— dataset registrationnemo_skills/dataset/emergent_tts/prepare.py— data preparationnemo_skills/dataset/emergent_tts/README.md— documentationnemo_skills/dataset/emergent_tts/emergent/— emergent framework integrationnemo_skills/dataset/emergent_tts/scripts/— scoring, evaluation, and conversion scriptsnemo_skills/dataset/emergent_tts/scripts/config/— Hydra config files.gitignore— minor additions for TTS artifactsTest plan
python -m pytest tests/to verify no regressions🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation