Releases · lhotse-speech/lhotse

08 Dec 18:26

pzelasko

v1.11

2332b7e

v1.11 - Llama Llama Red Pajama

This release has three new recipes and mostly bug fixes.

What's Changed

[recipe] DiPCo -- dinner party corpus from Amazon by @desh2608 in #893
[recipe] CHiME-6 dinner party corpus by @desh2608 in #895
[recipe] Add xbmu amdo31 by @sendream in #902
Shar: allow per node+worker randomization of shards order by @pzelasko in #905
Shar: fix shuffling/splitting when cut_map_fn is provided to CutSet.from_shar by @pzelasko in #907
Shar: tracking epochs in shard iterator with option for shard re-shuffling each epoch by @pzelasko in #894
Shar: missing param in CutSet.from_shar + better error msg by @pzelasko in #901
Fix an edge case with BucketingSampler and a small amount data/buckets by @pzelasko in #898
Remove some deprecated methods by @desh2608 in #900
More details in cuts.describe() + fix for trim_to_unsupervised_segments() by @desh2608 in #899
Fix save_audios by @pkufool in #896
Fix audio save for parallel workers by @pkufool in #903
Fix bug in load audio (multi-channel) by @desh2608 in #906
Fix SNR sampling error in CutSet.mix by @pzelasko in #915

New Contributors

@sendream made their first contribution in #902

Full Changelog: v1.10...v1.11

Contributors

desh2608, pkufool, and 2 other contributors

Assets 2

16 Nov 20:24

pzelasko

v1.10

91cb71b

v1.10 - Lhotse Shar

[experimental] Lhotse Shar -- a modular, sharded, sequential I/O data storage format

This release has a major (experimental) feature called Lhotse Shar. It's a data format inspired by WebDataset tar files that's intended to be very fast for sequential reading of data stored in tarfile shards. It extends the ideas of WebDataset by allowing to store multiple types of features and metadata in separate tar archives that are iterated and loaded together with cuts. It allows to extend existing data with new fields (think different feature extractors, alignments, embeddings, etc.) without triggering a hard copy as would be the case with previous sequential formats supported by Lhotse. Preliminary benchmarking indicated it is as fast as WebDataset both with local disks and cloud storage.

A tutorial notebook about Lhotse Shar is planned to be released later this year.

What's Changed

Sharded tar writers for Lhotse Shar format by @pzelasko in #850
load ark directly in KaldiReader by @csukuangfj in #862
Add a concrete example showing how to import a Kaldi data directory by @csukuangfj in #864
Fixing shuffling of CutSet with a single cut by @Tomiinek in #869
Fixed an erroneous assertion by @JinZr in #874
Small changes to make channel attribute hashable by @desh2608 in #875
Safe extract tarballs by @desh2608 in #876
Shar: tarfiles now also contain metadata by @pzelasko in #870
Shar: support dynamically attaching custom non-data attributes by @pzelasko in #877
Option not to save cuts in SharWriter by @pzelasko in #878
Minor changes in some recipes by @desh2608 in #880
add ssl feature extractor by @DongjiGao in #881
Shar: a way to attach shard-specific metadata to cuts from each shard by @pzelasko in #884
Always return integer sampling rate when reading audio by @pzelasko in #885
Add option to split AMI segments similar to Kaldi by @desh2608 in #889

New Contributors

@JinZr made their first contribution in #874
@DongjiGao made their first contribution in #881

Full Changelog: v1.9...v1.10

Contributors

csukuangfj, desh2608, and 4 other contributors

Assets 2

20 Oct 18:32

pzelasko

v1.9

7d9fd0d

v1.9 Neighboring Peaks

Major features

MultiCut data type: simplifies working with multi-channel data (contribution from @desh2608)
CSJ recipe (contribution from @teowenshen)
lots of bug fixes

What's Changed

create proper wav_id in the segments file for multichannel recording by @jtrmal in #831
kaldi: add an switch/option to read the durations from kaldi utt2dur … by @jtrmal in #832
Update test packages by @pzelasko in #837
MultiCut to store multi-channel recordings with shared supervision by @desh2608 in #822
Use CutSet for whisper annotation workflow by @desh2608 in #834
use spawn() as the strategy to prevent heisenbug by @jtrmal in #841
Compatibility for reading alignments saved before Lhotse v1.8 by @pzelasko in #842
make regexp string raw by @jtrmal in #836
Use absolute recording paths in yesno recipe by @pzelasko in #845
Fix CutSet.compute_and_store_features support for lazy CutSets by @pzelasko in #844
Fixing some QA functions for lazy manifests by @desh2608 in #848
Fix timestamps in Whisper annotation workflow by @pzelasko in #847
Update supervisions channels in multi-channel recipes by @desh2608 in #838
Allow retaining or trimming channels in trim_to_supervisions by @desh2608 in #852
Match cut_id to utt_id if there is exactly one supervision per cut by @wgb14 in #853
forced alignment: use num2words to get word timestamps for numbers by @eschmidbauer in #849
Prepare CSJ by @teowenshen in #851
Small changes in trim_to_supervisions() by @desh2608 in #855
Fix checkpoints of samplers that were iterated over more than once within the same epoch by @pzelasko in #854
Update fisher_english.py by @maxlvov in #858

New Contributors

@eschmidbauer made their first contribution in #849
@teowenshen made their first contribution in #851
@maxlvov made their first contribution in #858

Full Changelog: v1.8...v1.9

Contributors

desh2608, eschmidbauer, and 5 other contributors

Assets 2

30 Sep 13:18

pzelasko

v1.8

8db6a02

v1.8 Sudden Avalanche

Breaking changes

Python 3.6 is no longer supported as of Lhotse v1.8. If you need to use Python 3.6, please revert to Lhotse 1.7 and earlier.

Highlights

New experimental module of lhotse: workflows, now integrates optional third party packages that assist corpus creators in automated data curation. With release 1.8, we support OpenAI Whisper for automatic transcription and segmentation, and torchaudio Wav2Vec2/Hubert ASR bundles for forced alignment.

What's Changed

Fix read and write in piped CLI by @desh2608 in #807
Default behavior of CutSet.mix by @ZuoyunZheng in #809
Adding more info about resampling options by @RuABraun in #815
Add pad_silence option to extend_by by @desh2608 in #816
Message when calling len() on LazyFilter by @desh2608 in #817
Refactor cut and retain git blame history by @desh2608 in #820
Audio backend refactoring and a workaround for FLAC reading from/writing to in-memory buffers by @pzelasko in #814
Experimental Lhotse feature: corpus creation tools (workflows), starting with OpenAI Whisper support by @pzelasko in #824
Drop support for Python 3.6 by @pzelasko in #829
[workflow] Word-level forced alignment with pretrained models from Torchaudio by @pzelasko in #827

New Contributors

@ZuoyunZheng made their first contribution in #809

Full Changelog: v1.7...v1.8

Contributors

desh2608, RuABraun, and 2 other contributors

Assets 2

12 Sep 21:38

pzelasko

v1.7

695abb6

v1.7 - Rejuvenation Potion

What's Changed

add test data to bvcc by @oplatek in #797
Add reverb with fast RIR generator by @desh2608 in #799
Support snip_edges=True in online_inference of Kaldi feature extractors by @pzelasko in #802
Remove warning about Lhotse not being stable from README.md by @pzelasko in #804
Update the documentation related to optional packages by @pzelasko in #805

Full Changelog: v1.6...v1.7

Contributors

oplatek, desh2608, and pzelasko

Assets 2

27 Aug 21:13

pzelasko

v1.6

5e734e5

1.6 - Frozen Palm Tree

What's Changed

Feature/fix 754 voxceleb download by @mikuchar in #776
Support Kaldi data dierectories without segments file. by @MartinKocour in #789
Add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc by @shanguanma in #760
Improve support for custom Recordings by @pzelasko in #791
Add Cut.has(field) method to query Cuts for custom attributes by @pzelasko in #792
Add normalization for aishell2 recipe by @shanguanma in #790

New Contributors

@mikuchar made their first contribution in #776
@MartinKocour made their first contribution in #789

Full Changelog: 1.5...v1.6

Contributors

MartinKocour, pzelasko, and 2 other contributors

Assets 2

09 Aug 01:04

pzelasko

1.5

445cf01

1.5 - Little Leaf

What's Changed

Describe more information about cuts by @pzelasko in #772
Change vctk.py to adapt the vctk dataset downloaded from edinburgh url by @luomingshuang in #775
Fix restoring sampler state with world_size>1 by @pzelasko in #773
Revert #738 to use aidatatang as the prefix for aidatatang_200zh. by @csukuangfj in #782
use tolerance when checking duration mismatch by @shaynemei in #781

New Contributors

@shaynemei made their first contribution in #781

Full Changelog: v1.4...v1.5

Contributors

csukuangfj, pzelasko, and 2 other contributors

Assets 2

07 Jul 01:08

pzelasko

1.4

609af97

v1.4 - Candescent Crust

What's Changed

Fix lambda warnings from lazy manifests + leverage dill if installed for pickling lambdas by @pzelasko in #748
multi_cn recipes: aishell2, magicdata, primewords, stcmds, tal_asr, tal_csasr, thchs_30 by @shanguanma in #738
Deprecate strict, proportional_sampling, and bucket_method arguments by @pzelasko in #756
Fix lhotse cut simple CLI by @pzelasko in #759
Fix issues with eager CutSet creation from lazy manifests by @pzelasko in #763
DailyTalk recipe by @pzelasko in #767
add aishell2 dev test by @yuekaizhang in #766
Enable GlobalMVN computation with on-the-fly feature extraction by @pzelasko in #769
Add support for Python 3.10 and PyTorch 1.12 by @pzelasko in #764

New Contributors

@yuekaizhang made their first contribution in #766

Full Changelog: v1.3...1.4

Contributors

pzelasko, yuekaizhang, and shanguanma

Assets 2

11 Jun 03:35

pzelasko

v1.3

4d22c32

v1.3 - Curiously Inviting Icicles

What's Changed

Fix plotting MixedCut audio tracks by @pzelasko in #723
[continued] Fixes Bucketing sampler equal duration method that drops cuts by @m-wiesner in #724
feature extraction will read RecordingSet from a file, not just json. by @RuABraun in #728
Use lilcom_chunky as default in CLI by @pzelasko in #729
Set CLI torch number of threads to 1 by @pzelasko in #732
Update wenet_speech.py by @fanlu in #731
Fix heroico regex strings by @jtrmal in #734
Update mgb2.py by @AmirHussein96 in #725
Remove file handle caching from LilcomChunkyReader by @pzelasko in #737
Make h5py an optional dependency by @pzelasko in #741
Assert CutSet.mix() argument cuts is not a lazy manifest by @pzelasko in #742
CutSet: more methods are lazy + two simplified common use-cases attach_tensor and load_audio by @pzelasko in #744
Collections: support reading from/writing to "-" (including webdataset) by @pzelasko in #745
fix CommonVoice prepare by @mohsen-goodarzi in #743

New Contributors

@RuABraun made their first contribution in #728
@mohsen-goodarzi made their first contribution in #743

Full Changelog: v1.2...v1.3

Contributors

fanlu, jtrmal, and 5 other contributors

Assets 2

19 May 17:09

pzelasko

v1.2

024890f

v1.2 - Winter in the South

New Recipes

Adding lhotse recipe to prepare eval2000 data by @GoVivace in #679
adding Earnings-21 dataset from rev-dot-com by @jtrmal in #709
Adding the second revdotcom's earnings corpus by @jtrmal in #713
MGB2 recipe by @AmirHussein96 in #718

What's Changed

Fix import namespaces by @pzelasko in #698
.repeat(..., preserve_id=...) option for repeating manifests by @pzelasko in #699
Kaldi impex: remove invalid test by @jtrmal in #700
Minor fix in base url for AliMeeting download by @desh2608 in #702
[aidatatang_200zh] Avoid being converted to ASCII when preparing manifest by @luomingshuang in #703
[ali_meeting] Fix some path errors for ali_meeting.py by @luomingshuang in #705
[ali_meeting] Avoid being converted to ASCII by @luomingshuang in #704
Test for webdataset data de-duplication across ranks by @pzelasko in #706
Fixing data duplication with WebDataset in multi-node multi-worker training by @pzelasko in #707
Fix epoch setting for WebDataset shard shuffling by @pzelasko in #708
Full shard shuffling with webdataset by @pzelasko in #711
Raise an error when BucketingSampler is used with a lazy CutSet by @pzelasko in #710
Normalize output path names for recipes by @desh2608 in #712
[webdataset] Add shard of origin to Cut.shard_origin custom field by @pzelasko in #714
Update examples of combining datasets with RoundRobinSampler and add stop_early option. by @pzelasko in #716
pre-commit, isort + CI checks + running it on all code by @pzelasko in #720

New Contributors

@GoVivace made their first contribution in #679
@AmirHussein96 made their first contribution in #718

Full Changelog: v1.1...v1.2

Contributors

desh2608, jtrmal, and 4 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

[experimental] Lhotse Shar -- a modular, sharded, sequential I/O data storage format

What's Changed

New Contributors

Contributors

Major features

What's Changed

New Contributors

Contributors

Breaking changes

Highlights

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

New Recipes

What's Changed

New Contributors

Contributors

Releases: lhotse-speech/lhotse

v1.11 - Llama Llama Red Pajama

What's Changed

New Contributors

Contributors

v1.10 - Lhotse Shar

[experimental] Lhotse Shar -- a modular, sharded, sequential I/O data storage format

What's Changed

New Contributors

Contributors

v1.9 Neighboring Peaks

Major features

What's Changed

New Contributors

Contributors

v1.8 Sudden Avalanche

Breaking changes

Highlights

What's Changed

New Contributors

Contributors

v1.7 - Rejuvenation Potion

What's Changed

Contributors

1.6 - Frozen Palm Tree

What's Changed

New Contributors

Contributors

1.5 - Little Leaf

What's Changed

New Contributors

Contributors

v1.4 - Candescent Crust

What's Changed

New Contributors

Contributors

v1.3 - Curiously Inviting Icicles

What's Changed

New Contributors

Contributors

v1.2 - Winter in the South

New Recipes

What's Changed

New Contributors

Contributors