Add RecordingChunkIterableDataset #985

pzelasko · 2023-02-21T14:57:33Z

CC @danpovey this is what I suggest to efficiently load chunks of long recordings. I just need to add support for chunk_shift to fit your requirements.

Resolves #983

…' into feature/mixed-cut-move-to-memory

pzelasko · 2023-02-21T20:28:35Z

It scales pretty much linearly, on a local SSD I'm loading 3h of audio in one second with 4 workers (batch_size=32 * chunk_size=5s * 91 it/s / 3600s ~= 3h 14min, some of it overlapped), not a bad result...

pzelasko added 4 commits February 21, 2023 09:54

Add RecordingChunkIterableDataset

146db16

Merge branch 'master' into feature/mixed-cut-move-to-memory

8347f8c

Add support for chunk_shift

2326057

Merge remote-tracking branch 'origin/feature/mixed-cut-move-to-memory…

dac0fe2

…' into feature/mixed-cut-move-to-memory

pzelasko marked this pull request as ready for review February 21, 2023 18:44

pzelasko added 4 commits February 21, 2023 13:45

Fix unit tests for earlier PyTorch versions

9f29f50

Remove unused code

45c3ae6

Handle an edge case that would result in an empty chunk

d9f3cc1

Disable the unit test for earlier PyTorch versions

2b02721

pzelasko merged commit 9fb25d6 into master Feb 21, 2023

pzelasko deleted the feature/mixed-cut-move-to-memory branch February 21, 2023 20:31

pzelasko added this to the v1.13 milestone Feb 21, 2023

pzelasko mentioned this pull request Feb 21, 2023

Large audio files processing #983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RecordingChunkIterableDataset #985

Add RecordingChunkIterableDataset #985

pzelasko commented Feb 21, 2023

pzelasko commented Feb 21, 2023 •

edited

Loading

Add RecordingChunkIterableDataset #985

Add RecordingChunkIterableDataset #985

Conversation

pzelasko commented Feb 21, 2023

pzelasko commented Feb 21, 2023 • edited Loading

pzelasko commented Feb 21, 2023 •

edited

Loading