Releases: lhotse-speech/lhotse
v1.11 - Llama Llama Red Pajama
This release has three new recipes and mostly bug fixes.
What's Changed
- [recipe] DiPCo -- dinner party corpus from Amazon by @desh2608 in #893
- [recipe] CHiME-6 dinner party corpus by @desh2608 in #895
- [recipe] Add xbmu amdo31 by @sendream in #902
- Shar: allow per node+worker randomization of shards order by @pzelasko in #905
- Shar: fix shuffling/splitting when cut_map_fn is provided to
CutSet.from_shar
by @pzelasko in #907 - Shar: tracking epochs in shard iterator with option for shard re-shuffling each epoch by @pzelasko in #894
- Shar: missing param in
CutSet.from_shar
+ better error msg by @pzelasko in #901 - Fix an edge case with BucketingSampler and a small amount data/buckets by @pzelasko in #898
- Remove some deprecated methods by @desh2608 in #900
- More details in
cuts.describe()
+ fix fortrim_to_unsupervised_segments()
by @desh2608 in #899 - Fix save_audios by @pkufool in #896
- Fix audio save for parallel workers by @pkufool in #903
- Fix bug in load audio (multi-channel) by @desh2608 in #906
- Fix SNR sampling error in CutSet.mix by @pzelasko in #915
New Contributors
Full Changelog: v1.10...v1.11
v1.10 - Lhotse Shar
[experimental] Lhotse Shar -- a modular, sharded, sequential I/O data storage format
This release has a major (experimental) feature called Lhotse Shar. It's a data format inspired by WebDataset tar files that's intended to be very fast for sequential reading of data stored in tarfile shards. It extends the ideas of WebDataset by allowing to store multiple types of features and metadata in separate tar archives that are iterated and loaded together with cuts. It allows to extend existing data with new fields (think different feature extractors, alignments, embeddings, etc.) without triggering a hard copy as would be the case with previous sequential formats supported by Lhotse. Preliminary benchmarking indicated it is as fast as WebDataset both with local disks and cloud storage.
A tutorial notebook about Lhotse Shar is planned to be released later this year.
What's Changed
- Sharded tar writers for Lhotse Shar format by @pzelasko in #850
- load ark directly in KaldiReader by @csukuangfj in #862
- Add a concrete example showing how to import a Kaldi data directory by @csukuangfj in #864
- Fixing shuffling of CutSet with a single cut by @Tomiinek in #869
- Fixed an erroneous assertion by @JinZr in #874
- Small changes to make channel attribute hashable by @desh2608 in #875
- Safe extract tarballs by @desh2608 in #876
- Shar: tarfiles now also contain metadata by @pzelasko in #870
- Shar: support dynamically attaching custom non-data attributes by @pzelasko in #877
- Option not to save cuts in SharWriter by @pzelasko in #878
- Minor changes in some recipes by @desh2608 in #880
- add ssl feature extractor by @DongjiGao in #881
- Shar: a way to attach shard-specific metadata to cuts from each shard by @pzelasko in #884
- Always return integer sampling rate when reading audio by @pzelasko in #885
- Add option to split AMI segments similar to Kaldi by @desh2608 in #889
New Contributors
- @JinZr made their first contribution in #874
- @DongjiGao made their first contribution in #881
Full Changelog: v1.9...v1.10
v1.9 Neighboring Peaks
Major features
MultiCut
data type: simplifies working with multi-channel data (contribution from @desh2608)- CSJ recipe (contribution from @teowenshen)
- lots of bug fixes
What's Changed
- create proper wav_id in the segments file for multichannel recording by @jtrmal in #831
- kaldi: add an switch/option to read the durations from kaldi utt2dur … by @jtrmal in #832
- Update test packages by @pzelasko in #837
MultiCut
to store multi-channel recordings with shared supervision by @desh2608 in #822- Use CutSet for whisper annotation workflow by @desh2608 in #834
- use spawn() as the strategy to prevent heisenbug by @jtrmal in #841
- Compatibility for reading alignments saved before Lhotse v1.8 by @pzelasko in #842
- make regexp string raw by @jtrmal in #836
- Use absolute recording paths in yesno recipe by @pzelasko in #845
- Fix CutSet.compute_and_store_features support for lazy CutSets by @pzelasko in #844
- Fixing some QA functions for lazy manifests by @desh2608 in #848
- Fix timestamps in Whisper annotation workflow by @pzelasko in #847
- Update supervisions channels in multi-channel recipes by @desh2608 in #838
- Allow retaining or trimming channels in trim_to_supervisions by @desh2608 in #852
- Match
cut_id
toutt_id
if there is exactly one supervision per cut by @wgb14 in #853 - forced alignment: use
num2words
to get word timestamps for numbers by @eschmidbauer in #849 - Prepare CSJ by @teowenshen in #851
- Small changes in
trim_to_supervisions()
by @desh2608 in #855 - Fix checkpoints of samplers that were iterated over more than once within the same epoch by @pzelasko in #854
- Update fisher_english.py by @maxlvov in #858
New Contributors
- @eschmidbauer made their first contribution in #849
- @teowenshen made their first contribution in #851
- @maxlvov made their first contribution in #858
Full Changelog: v1.8...v1.9
v1.8 Sudden Avalanche
Breaking changes
- Python 3.6 is no longer supported as of Lhotse v1.8. If you need to use Python 3.6, please revert to Lhotse 1.7 and earlier.
Highlights
- New experimental module of lhotse:
workflows
, now integrates optional third party packages that assist corpus creators in automated data curation. With release 1.8, we support OpenAI Whisper for automatic transcription and segmentation, and torchaudio Wav2Vec2/Hubert ASR bundles for forced alignment.
What's Changed
- Fix read and write in piped CLI by @desh2608 in #807
- Default behavior of CutSet.mix by @ZuoyunZheng in #809
- Adding more info about resampling options by @RuABraun in #815
- Add
pad_silence
option toextend_by
by @desh2608 in #816 - Message when calling len() on LazyFilter by @desh2608 in #817
- Refactor cut and retain
git blame
history by @desh2608 in #820 - Audio backend refactoring and a workaround for FLAC reading from/writing to in-memory buffers by @pzelasko in #814
- Experimental Lhotse feature: corpus creation tools (
workflows
), starting with OpenAI Whisper support by @pzelasko in #824 - Drop support for Python 3.6 by @pzelasko in #829
- [workflow] Word-level forced alignment with pretrained models from Torchaudio by @pzelasko in #827
New Contributors
- @ZuoyunZheng made their first contribution in #809
Full Changelog: v1.7...v1.8
v1.7 - Rejuvenation Potion
What's Changed
- add test data to bvcc by @oplatek in #797
- Add reverb with fast RIR generator by @desh2608 in #799
- Support
snip_edges=True
inonline_inference
of Kaldi feature extractors by @pzelasko in #802 - Remove warning about Lhotse not being stable from README.md by @pzelasko in #804
- Update the documentation related to optional packages by @pzelasko in #805
Full Changelog: v1.6...v1.7
1.6 - Frozen Palm Tree
What's Changed
- Feature/fix 754 voxceleb download by @mikuchar in #776
- Support Kaldi data dierectories without segments file. by @MartinKocour in #789
- Add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc by @shanguanma in #760
- Improve support for custom Recordings by @pzelasko in #791
- Add
Cut.has(field)
method to query Cuts for custom attributes by @pzelasko in #792 - Add normalization for aishell2 recipe by @shanguanma in #790
New Contributors
- @mikuchar made their first contribution in #776
- @MartinKocour made their first contribution in #789
Full Changelog: 1.5...v1.6
1.5 - Little Leaf
What's Changed
- Describe more information about cuts by @pzelasko in #772
- Change vctk.py to adapt the vctk dataset downloaded from edinburgh url by @luomingshuang in #775
- Fix restoring sampler state with
world_size>1
by @pzelasko in #773 - Revert #738 to use aidatatang as the prefix for aidatatang_200zh. by @csukuangfj in #782
- use tolerance when checking duration mismatch by @shaynemei in #781
New Contributors
- @shaynemei made their first contribution in #781
Full Changelog: v1.4...v1.5
v1.4 - Candescent Crust
What's Changed
- Fix lambda warnings from lazy manifests + leverage
dill
if installed for pickling lambdas by @pzelasko in #748 multi_cn
recipes:aishell2
,magicdata
,primewords
,stcmds
,tal_asr
,tal_csasr
,thchs_30
by @shanguanma in #738- Deprecate
strict
,proportional_sampling
, andbucket_method
arguments by @pzelasko in #756 - Fix
lhotse cut simple
CLI by @pzelasko in #759 - Fix issues with eager CutSet creation from lazy manifests by @pzelasko in #763
- DailyTalk recipe by @pzelasko in #767
- add aishell2 dev test by @yuekaizhang in #766
- Enable GlobalMVN computation with on-the-fly feature extraction by @pzelasko in #769
- Add support for Python 3.10 and PyTorch 1.12 by @pzelasko in #764
New Contributors
- @yuekaizhang made their first contribution in #766
Full Changelog: v1.3...1.4
v1.3 - Curiously Inviting Icicles
What's Changed
- Fix plotting MixedCut audio tracks by @pzelasko in #723
- [continued] Fixes Bucketing sampler equal duration method that drops cuts by @m-wiesner in #724
- feature extraction will read RecordingSet from a file, not just json. by @RuABraun in #728
- Use
lilcom_chunky
as default in CLI by @pzelasko in #729 - Set CLI torch number of threads to 1 by @pzelasko in #732
- Update wenet_speech.py by @fanlu in #731
- Fix heroico regex strings by @jtrmal in #734
- Update mgb2.py by @AmirHussein96 in #725
- Remove file handle caching from LilcomChunkyReader by @pzelasko in #737
- Make
h5py
an optional dependency by @pzelasko in #741 - Assert
CutSet.mix()
argumentcuts
is not a lazy manifest by @pzelasko in #742 CutSet
: more methods are lazy + two simplified common use-casesattach_tensor
andload_audio
by @pzelasko in #744- Collections: support reading from/writing to "-" (including webdataset) by @pzelasko in #745
- fix CommonVoice prepare by @mohsen-goodarzi in #743
New Contributors
- @RuABraun made their first contribution in #728
- @mohsen-goodarzi made their first contribution in #743
Full Changelog: v1.2...v1.3
v1.2 - Winter in the South
New Recipes
- Adding lhotse recipe to prepare eval2000 data by @GoVivace in #679
- adding Earnings-21 dataset from rev-dot-com by @jtrmal in #709
- Adding the second revdotcom's earnings corpus by @jtrmal in #713
- MGB2 recipe by @AmirHussein96 in #718
What's Changed
- Fix import namespaces by @pzelasko in #698
.repeat(..., preserve_id=...)
option for repeating manifests by @pzelasko in #699- Kaldi impex: remove invalid test by @jtrmal in #700
- Minor fix in base url for AliMeeting download by @desh2608 in #702
- [aidatatang_200zh] Avoid being converted to ASCII when preparing manifest by @luomingshuang in #703
- [ali_meeting] Fix some path errors for ali_meeting.py by @luomingshuang in #705
- [ali_meeting] Avoid being converted to ASCII by @luomingshuang in #704
- Test for webdataset data de-duplication across ranks by @pzelasko in #706
- Fixing data duplication with WebDataset in multi-node multi-worker training by @pzelasko in #707
- Fix epoch setting for WebDataset shard shuffling by @pzelasko in #708
- Full shard shuffling with webdataset by @pzelasko in #711
- Raise an error when
BucketingSampler
is used with a lazyCutSet
by @pzelasko in #710 - Normalize output path names for recipes by @desh2608 in #712
- [webdataset] Add shard of origin to Cut.shard_origin custom field by @pzelasko in #714
- Update examples of combining datasets with RoundRobinSampler and add
stop_early
option. by @pzelasko in #716 pre-commit
,isort
+ CI checks + running it on all code by @pzelasko in #720
New Contributors
- @GoVivace made their first contribution in #679
- @AmirHussein96 made their first contribution in #718
Full Changelog: v1.1...v1.2