Dataclass help message by i-vainn · Pull Request #4 · NVIDIA-NeMo/Skills

i-vainn · 2024-02-20T17:59:11Z

This PR allows to show customized help message for the dataclasses defined in .py files.

This is done via comment and argument parsing from the dataclass code.

Usage example:
python3 nemo_skills/finetuning/prepare_sft_data.py --help

prediction_jsonl_files: can specify multiple patters separated by space
Type: Optional[str], Default: None

preprocessed_dataset_files: can specify datasets from HF instead of prediction_jsonl_files
Type: Optional[str], Default: None

output_path: 
Type: str, Default: ???

metadata: can provide additional metadata to store (e.g. dataset or generation_type)
Type: dict[], Default Factory: dict

skip_first: useful for skipping validation set from train_full generation (it's always first)
Type: int, Default: 0

add_correct: can set to False if only want to export incorrect solutions
Type: bool, Default: True

add_incorrect: if True, saves only incorrect solutions instead of correct
Type: bool, Default: False

downsampling_method: random or fair
Type: Optional[str], Default: None

num_output_samples: 
Type: int, Default: -1

prompt: 
Type: PromptConfig, Default Factory: get_default_prompt_config

  prompt.delimiter: 
  Type: str, Default: ???

  prompt.prefix: 
  Type: str, Default: ???

  prompt.template: 
  Type: str, Default: ???

  prompt.context_type: 
  Type: str, Default: empty

  prompt.examples_type: 
  Type: Optional[str], Default: None

  prompt.num_few_shots: 
  Type: int, Default: 5

filters: 
Type: list[str], Default Factory: <lambda>

text_filter_type: 
Type: Optional[str], Default: None

trim_solutions: 
Type: bool, Default: False

Kipok · 2024-02-20T18:45:38Z

Nice! Thanks @i-vainn! A couple of high-level comments.

Can we preserve the new lines in the comments? Right now, seems that everything is replaced with spaces, but for some longer docs might be better to show the same as in code with new lines as in the original comments.
Can you please move the type/default on the same line and do a similar format to the docs in code. E.g. instead of

batch_size:
Type: int, Default: 16

you'd have

batch_size (int, default: 16) - <description>

Let's not show default factory, at least for dataclasses. They are recursively displayed, so that default factory is just confusing. Maybe try to instantiate it and if it's dataclass ignore, otherwise, show as normal default?
Can you please modify the code, such that it's easy to get this help message as a string? We want to directly include it in the pipeline scripts alongside their own help to let people easily see all available parameters for pass-through scripts

Kipok · 2024-02-20T18:48:25Z

Also, default: ??? might be a bit confusing, let's instead just say "required" or something like that. And it's best to retain some reference to hydra in the help message to let people know that they can do whatever cmd magic hydra supports direclty

i-vainn · 2024-02-21T10:52:43Z

Thanks for your comments, here is a refined version.

Here is how python nemo_skills/inference/generate_solutions.py --help output looks now:

output_file (str, default: None) 

server (dict, default: None) - will be directly passed to model.get_client function

sandbox (dict, default: None) - will be directly passed to sandbox.get_sandbox function

prompt (PromptConfig) 

  prompt.delimiter (str, required) 

  prompt.prefix (str, required) 

  prompt.template (str, required) 

  prompt.context_type (str, default: empty) 

  prompt.examples_type (Optional[str], default: None) 

  prompt.num_few_shots (int, default: 5) 

inference (InferenceConfig) 

  inference.temperature (float, default: 0.0) - greedy

  inference.top_k (int, default: 0) 

  inference.top_p (float, default: 0.95) 

  inference.random_seed (int, default: 0) 

  inference.tokens_to_generate (int, default: 512) 

  inference.repetition_penalty (float, default: 1.0) 

dataset (Optional[str], default: None) 

split_name (Optional[str], default: None) 

data_file (Optional[str], default: None) 

batch_size (int, default: 16) 

max_samples (int, default: -1) 

skip_filled (bool, default: False) 

offset (int, default: 0)

Kipok · 2024-02-24T00:02:14Z

Thanks Ivan, looks great! For future reference, the final format is like this:

python nemo_skills/inference/generate_solutions.py --help

This script uses Hydra (https://hydra.cc/) for dynamic configuration management.
You can apply Hydra's command-line syntax for overriding configuration values directly.
Below are the available configuration options and their default values:
---------------------------------------------------------------------------
output_file: str = None
server: dict = None - will be directly passed to model.get_client function
sandbox: dict = None - will be directly passed to sandbox.get_sandbox function
prompt: PromptConfig = PromptConfig()
  prompt.delimiter: str = MISSING
  prompt.prefix: str = MISSING
  prompt.template: str = MISSING
  prompt.context_type: str = empty
  prompt.examples_type: Optional[str] = None
  prompt.num_few_shots: int = 5
inference: InferenceConfig = InferenceConfig()
  inference.temperature: float = 0.0 - greedy
  inference.top_k: int = 0
  inference.top_p: float = 0.95
  inference.random_seed: int = 0
  inference.tokens_to_generate: int = 512
  inference.repetition_penalty: float = 1.0
dataset: Optional[str] = None
split_name: Optional[str] = None
data_file: Optional[str] = None
batch_size: int = 16
max_samples: int = -1
skip_filled: bool = False
offset: int = 0

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Per review feedback: all benchmark-specific packages should go to core for now since JIT install is not yet implemented. Previously only PythonTool-specific deps were in core while benchmark deps like datasets, sacrebleu, faiss-cpu, etc. were only in main.txt. This led to an inconsistent boundary where math grader deps were in core but BFCL deps were not, despite both being benchmark-specific. Addresses review comments #1, NVIDIA-NeMo#4, NVIDIA-NeMo#6 on PR NVIDIA-NeMo#1229. Signed-off-by: George Armstrong <georgea@nvidia.com>

Rewrite the dependency boundary section to: - Define core as "everything needed for inference + evaluation" (not just PythonTool-specific deps) - Remove references to deleted requirements/main.txt - Clarify that all benchmark evaluator deps go to core until JIT install is implemented - Improve dataset module separation guidance (pipeline = cluster I/O only, core = all local logic) - Add note about summarize-results refactor (issue NVIDIA-NeMo#779) Addresses review comments NVIDIA-NeMo#3, NVIDIA-NeMo#4, NVIDIA-NeMo#6, NVIDIA-NeMo#7 on PR NVIDIA-NeMo#1229. Signed-off-by: George Armstrong <georgea@nvidia.com>

i-vainn added 5 commits February 20, 2024 08:58

added dataclass help utils

0d8545f

format fixes

bcb61fd

added help printing

46d0a3b

added missing requirements

deaffe4

improved comment parsing

303561d

i-vainn added 2 commits February 21, 2024 02:45

improved help readability

a572f7f

added help as a string

dc0cd0c

i-vainn added 2 commits February 22, 2024 02:50

added reference to hydra in help msg

eb9e504

formatting fixes

0708b35

Kipok self-requested a review February 24, 2024 00:02

Kipok approved these changes Feb 24, 2024

View reviewed changes

Kipok merged commit 351c8f0 into NVIDIA-NeMo:main Feb 24, 2024

dgtm777 pushed a commit that referenced this pull request Apr 2, 2025

Dataclass help message (#4)

5c385ba

wasiahmad pushed a commit that referenced this pull request Oct 1, 2025

Dataclass help message (#4)

7e802ba

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps bot mentioned this pull request Jan 27, 2026

Fixes for slurm tests #1192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataclass help message#4

Dataclass help message#4
Kipok merged 9 commits intoNVIDIA-NeMo:mainfrom
i-vainn:dataclass_docstring

i-vainn commented Feb 20, 2024

Uh oh!

Kipok commented Feb 20, 2024

Uh oh!

Kipok commented Feb 20, 2024

Uh oh!

i-vainn commented Feb 21, 2024

Uh oh!

Kipok commented Feb 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

i-vainn commented Feb 20, 2024

Uh oh!

Kipok commented Feb 20, 2024

Uh oh!

Kipok commented Feb 20, 2024

Uh oh!

i-vainn commented Feb 21, 2024

Uh oh!

Kipok commented Feb 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants