Skip to content

Dataclass help message#4

Merged
Kipok merged 9 commits intoNVIDIA-NeMo:mainfrom
i-vainn:dataclass_docstring
Feb 24, 2024
Merged

Dataclass help message#4
Kipok merged 9 commits intoNVIDIA-NeMo:mainfrom
i-vainn:dataclass_docstring

Conversation

@i-vainn
Copy link
Collaborator

@i-vainn i-vainn commented Feb 20, 2024

This PR allows to show customized help message for the dataclasses defined in .py files.

This is done via comment and argument parsing from the dataclass code.

Usage example:
python3 nemo_skills/finetuning/prepare_sft_data.py --help

prediction_jsonl_files: can specify multiple patters separated by space
Type: Optional[str], Default: None

preprocessed_dataset_files: can specify datasets from HF instead of prediction_jsonl_files
Type: Optional[str], Default: None

output_path: 
Type: str, Default: ???

metadata: can provide additional metadata to store (e.g. dataset or generation_type)
Type: dict[], Default Factory: dict

skip_first: useful for skipping validation set from train_full generation (it's always first)
Type: int, Default: 0

add_correct: can set to False if only want to export incorrect solutions
Type: bool, Default: True

add_incorrect: if True, saves only incorrect solutions instead of correct
Type: bool, Default: False

downsampling_method: random or fair
Type: Optional[str], Default: None

num_output_samples: 
Type: int, Default: -1

prompt: 
Type: PromptConfig, Default Factory: get_default_prompt_config

  prompt.delimiter: 
  Type: str, Default: ???

  prompt.prefix: 
  Type: str, Default: ???

  prompt.template: 
  Type: str, Default: ???

  prompt.context_type: 
  Type: str, Default: empty

  prompt.examples_type: 
  Type: Optional[str], Default: None

  prompt.num_few_shots: 
  Type: int, Default: 5

filters: 
Type: list[str], Default Factory: <lambda>

text_filter_type: 
Type: Optional[str], Default: None

trim_solutions: 
Type: bool, Default: False

@Kipok
Copy link
Collaborator

Kipok commented Feb 20, 2024

Nice! Thanks @i-vainn! A couple of high-level comments.

  1. Can we preserve the new lines in the comments? Right now, seems that everything is replaced with spaces, but for some longer docs might be better to show the same as in code with new lines as in the original comments.
  2. Can you please move the type/default on the same line and do a similar format to the docs in code. E.g. instead of
batch_size:
Type: int, Default: 16

you'd have

batch_size (int, default: 16) - <description>
  1. Let's not show default factory, at least for dataclasses. They are recursively displayed, so that default factory is just confusing. Maybe try to instantiate it and if it's dataclass ignore, otherwise, show as normal default?
  2. Can you please modify the code, such that it's easy to get this help message as a string? We want to directly include it in the pipeline scripts alongside their own help to let people easily see all available parameters for pass-through scripts

@Kipok
Copy link
Collaborator

Kipok commented Feb 20, 2024

Also, default: ??? might be a bit confusing, let's instead just say "required" or something like that. And it's best to retain some reference to hydra in the help message to let people know that they can do whatever cmd magic hydra supports direclty

@i-vainn
Copy link
Collaborator Author

i-vainn commented Feb 21, 2024

Thanks for your comments, here is a refined version.

Here is how python nemo_skills/inference/generate_solutions.py --help output looks now:

output_file (str, default: None) 

server (dict, default: None) - will be directly passed to model.get_client function

sandbox (dict, default: None) - will be directly passed to sandbox.get_sandbox function

prompt (PromptConfig) 

  prompt.delimiter (str, required) 

  prompt.prefix (str, required) 

  prompt.template (str, required) 

  prompt.context_type (str, default: empty) 

  prompt.examples_type (Optional[str], default: None) 

  prompt.num_few_shots (int, default: 5) 

inference (InferenceConfig) 

  inference.temperature (float, default: 0.0) - greedy

  inference.top_k (int, default: 0) 

  inference.top_p (float, default: 0.95) 

  inference.random_seed (int, default: 0) 

  inference.tokens_to_generate (int, default: 512) 

  inference.repetition_penalty (float, default: 1.0) 

dataset (Optional[str], default: None) 

split_name (Optional[str], default: None) 

data_file (Optional[str], default: None) 

batch_size (int, default: 16) 

max_samples (int, default: -1) 

skip_filled (bool, default: False) 

offset (int, default: 0)

@Kipok
Copy link
Collaborator

Kipok commented Feb 24, 2024

Thanks Ivan, looks great! For future reference, the final format is like this:

python nemo_skills/inference/generate_solutions.py --help

This script uses Hydra (https://hydra.cc/) for dynamic configuration management.
You can apply Hydra's command-line syntax for overriding configuration values directly.
Below are the available configuration options and their default values:
---------------------------------------------------------------------------
output_file: str = None
server: dict = None - will be directly passed to model.get_client function
sandbox: dict = None - will be directly passed to sandbox.get_sandbox function
prompt: PromptConfig = PromptConfig()
  prompt.delimiter: str = MISSING
  prompt.prefix: str = MISSING
  prompt.template: str = MISSING
  prompt.context_type: str = empty
  prompt.examples_type: Optional[str] = None
  prompt.num_few_shots: int = 5
inference: InferenceConfig = InferenceConfig()
  inference.temperature: float = 0.0 - greedy
  inference.top_k: int = 0
  inference.top_p: float = 0.95
  inference.random_seed: int = 0
  inference.tokens_to_generate: int = 512
  inference.repetition_penalty: float = 1.0
dataset: Optional[str] = None
split_name: Optional[str] = None
data_file: Optional[str] = None
batch_size: int = 16
max_samples: int = -1
skip_filled: bool = False
offset: int = 0

@Kipok Kipok self-requested a review February 24, 2024 00:02
@Kipok Kipok merged commit 351c8f0 into NVIDIA-NeMo:main Feb 24, 2024
dgtm777 pushed a commit that referenced this pull request Apr 2, 2025
wasiahmad pushed a commit that referenced this pull request Oct 1, 2025
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@greptile-apps greptile-apps bot mentioned this pull request Jan 27, 2026
gwarmstrong added a commit to gwarmstrong/NeMo-Skills that referenced this pull request Feb 14, 2026
Per review feedback: all benchmark-specific packages should go to core
for now since JIT install is not yet implemented. Previously only
PythonTool-specific deps were in core while benchmark deps like datasets,
sacrebleu, faiss-cpu, etc. were only in main.txt. This led to an
inconsistent boundary where math grader deps were in core but BFCL deps
were not, despite both being benchmark-specific.

Addresses review comments #1, NVIDIA-NeMo#4, NVIDIA-NeMo#6 on PR NVIDIA-NeMo#1229.

Signed-off-by: George Armstrong <georgea@nvidia.com>
gwarmstrong added a commit to gwarmstrong/NeMo-Skills that referenced this pull request Feb 14, 2026
Rewrite the dependency boundary section to:
- Define core as "everything needed for inference + evaluation" (not
  just PythonTool-specific deps)
- Remove references to deleted requirements/main.txt
- Clarify that all benchmark evaluator deps go to core until JIT
  install is implemented
- Improve dataset module separation guidance (pipeline = cluster I/O
  only, core = all local logic)
- Add note about summarize-results refactor (issue NVIDIA-NeMo#779)

Addresses review comments NVIDIA-NeMo#3, NVIDIA-NeMo#4, NVIDIA-NeMo#6, NVIDIA-NeMo#7 on PR NVIDIA-NeMo#1229.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants