-
Can I split the JSON file into the train, and dev splits on the fly? Or what would be the suggested way to do that if I didn't do it while creating the JSON? |
Beta Was this translation helpful? Give feedback.
Answered by
pzelasko
Apr 15, 2023
Replies: 1 comment 8 replies
-
from lhotse import load_manifest_lazy
cs = load_manifest_lazy("<path-to-file.jsonl.gz>")
N = len(cs)
# if you have a pre-defined train/dev split
train_ids = # list of train cuts
dev_ids = # list of dev cuts
cs_train = cs.subset(cut_ids=train_ids)
cs_dev = cs.subset(cut_ids=dev_ids)
cs_train.to_file("<out-train.jsonl.gz>")
cs_dev.to_file("<out-dev.jsonl.gz>")
# if you don't have pre-defined splits
N_dev = 100
N_train = N - N_dev
cs = cs.shuffle()
cs_train = cs.subset(first=N_train)
cs_dev = cs.subset(last=N_dev)
# save to file as before This should probably do it. You could also just do it using bash tools (such as awk) on the jsonl file. |
Beta Was this translation helpful? Give feedback.
8 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As the error says you can call .to_eager() and you’ll be able to use len again