Conversation
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Kipok
left a comment
There was a problem hiding this comment.
Thanks! Just a few small changes are needed
nemo_skills/dataset/mrcr/prepare.py
Outdated
| from tqdm import tqdm | ||
| import tempfile | ||
|
|
||
| subprocess.run(["pip install tiktoken"], check=True, shell=True) |
There was a problem hiding this comment.
please move it inside the function where it's needed. Otherwise this is going to run on every import even when the script isn't called
There was a problem hiding this comment.
Good point! changed.
nemo_skills/dataset/mrcr/prepare.py
Outdated
| output_file = data_dir / f"{setup}.jsonl" | ||
|
|
||
| with open(data_dir / "__init__.py", "w", encoding="utf-8") as init_file: | ||
| init_file.write(f"EVAL_SPLIT = '{setup}'\n") |
There was a problem hiding this comment.
best not to override init dynamically here. Users can always provide --split argument to change this, so no need to change defaults
revert test Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
|
Or is it supposed to use that messages list directly (so complete the last turn of a large multi-turn generation)? In that case you should put it as a list in "messages" key and then set |
Wait do we support that? I wanted it but I didn't know. |
|
yes, that should be supported with the parameters I shared |
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
OpenAI MRCR (Multi-round co-reference resolution): Long context multiple needle in a haystack benchmark
By dafault it prepares all 2400 samples up to 1M tokens.
Or you can prepare subset.
Specific eval split or use what saved in
__init__.py(default is all)