Is there a way to split `jsonl` into train, dev on the fly #1018

erogol · 2023-04-06T09:32:37Z

erogol
Apr 6, 2023

Can I split the JSON file into the train, and dev splits on the fly? Or what would be the suggested way to do that if I didn't do it while creating the JSON?

Answered by pzelasko

Apr 15, 2023

As the error says you can call .to_eager() and you’ll be able to use len again

View full answer

desh2608 · 2023-04-06T13:01:46Z

desh2608
Apr 6, 2023
Collaborator

from lhotse import load_manifest_lazy

cs = load_manifest_lazy("<path-to-file.jsonl.gz>")
N = len(cs)

# if you have a pre-defined train/dev split
train_ids = # list of train cuts
dev_ids = # list of dev cuts
cs_train = cs.subset(cut_ids=train_ids)
cs_dev = cs.subset(cut_ids=dev_ids)
cs_train.to_file("<out-train.jsonl.gz>")
cs_dev.to_file("<out-dev.jsonl.gz>")

# if you don't have pre-defined splits
N_dev = 100
N_train = N - N_dev
cs = cs.shuffle()
cs_train = cs.subset(first=N_train)
cs_dev = cs.subset(last=N_dev)
# save to file as before

This should probably do it. You could also just do it using bash tools (such as awk) on the jsonl file.

8 replies

erogol Apr 15, 2023
Author

I think even after the fix N = len(cs) gets you

NotImplementedError: LazyFilter does not support __len__ because it would require iterating over the whole iterator, which is not possible in a lazy fashion. If you really need to know the length, convert to eager mode first using `.to_eager()`. Note that this will require loading the whole iterator into memory.

desh2608 Apr 15, 2023
Collaborator

I think even after the fix N = len(cs) gets you


NotImplementedError: LazyFilter does not support __len__ because it would require iterating over the whole iterator, which is not possible in a lazy fashion. If you really need to know the length, convert to eager mode first using `.to_eager()`. Note that this will require loading the whole iterator into memory.

Can you show the code you are trying to run?

pzelasko Apr 15, 2023
Maintainer

You seem to have used .filter somewhere, we don’t support len of filtered manifests, if they were representing sth like 10k hours of speech it would take ages to count since you don’t know how many elements will be filtered out until you iterate.

pzelasko Apr 15, 2023
Maintainer

As the error says you can call .to_eager() and you’ll be able to use len again

Answer selected by erogol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to split `jsonl` into train, dev on the fly #1018

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is there a way to split jsonl into train, dev on the fly #1018

erogol Apr 6, 2023

Replies: 1 comment · 8 replies

desh2608 Apr 6, 2023 Collaborator

erogol Apr 15, 2023 Author

desh2608 Apr 15, 2023 Collaborator

pzelasko Apr 15, 2023 Maintainer

pzelasko Apr 15, 2023 Maintainer

Is there a way to split `jsonl` into train, dev on the fly #1018

erogol
Apr 6, 2023

Replies: 1 comment 8 replies

desh2608
Apr 6, 2023
Collaborator

erogol Apr 15, 2023
Author

desh2608 Apr 15, 2023
Collaborator

pzelasko Apr 15, 2023
Maintainer

pzelasko Apr 15, 2023
Maintainer