Skip to content

Releases: lhotse-speech/lhotse

v1.11 - Llama Llama Red Pajama

08 Dec 18:26
2332b7e
Compare
Choose a tag to compare

This release has three new recipes and mostly bug fixes.

What's Changed

  • [recipe] DiPCo -- dinner party corpus from Amazon by @desh2608 in #893
  • [recipe] CHiME-6 dinner party corpus by @desh2608 in #895
  • [recipe] Add xbmu amdo31 by @sendream in #902
  • Shar: allow per node+worker randomization of shards order by @pzelasko in #905
  • Shar: fix shuffling/splitting when cut_map_fn is provided to CutSet.from_shar by @pzelasko in #907
  • Shar: tracking epochs in shard iterator with option for shard re-shuffling each epoch by @pzelasko in #894
  • Shar: missing param in CutSet.from_shar + better error msg by @pzelasko in #901
  • Fix an edge case with BucketingSampler and a small amount data/buckets by @pzelasko in #898
  • Remove some deprecated methods by @desh2608 in #900
  • More details in cuts.describe() + fix for trim_to_unsupervised_segments() by @desh2608 in #899
  • Fix save_audios by @pkufool in #896
  • Fix audio save for parallel workers by @pkufool in #903
  • Fix bug in load audio (multi-channel) by @desh2608 in #906
  • Fix SNR sampling error in CutSet.mix by @pzelasko in #915

New Contributors

Full Changelog: v1.10...v1.11

v1.10 - Lhotse Shar

16 Nov 20:24
91cb71b
Compare
Choose a tag to compare

[experimental] Lhotse Shar -- a modular, sharded, sequential I/O data storage format

This release has a major (experimental) feature called Lhotse Shar. It's a data format inspired by WebDataset tar files that's intended to be very fast for sequential reading of data stored in tarfile shards. It extends the ideas of WebDataset by allowing to store multiple types of features and metadata in separate tar archives that are iterated and loaded together with cuts. It allows to extend existing data with new fields (think different feature extractors, alignments, embeddings, etc.) without triggering a hard copy as would be the case with previous sequential formats supported by Lhotse. Preliminary benchmarking indicated it is as fast as WebDataset both with local disks and cloud storage.

A tutorial notebook about Lhotse Shar is planned to be released later this year.

What's Changed

New Contributors

Full Changelog: v1.9...v1.10

v1.9 Neighboring Peaks

20 Oct 18:32
Compare
Choose a tag to compare

Major features

  • MultiCut data type: simplifies working with multi-channel data (contribution from @desh2608)
  • CSJ recipe (contribution from @teowenshen)
  • lots of bug fixes

What's Changed

  • create proper wav_id in the segments file for multichannel recording by @jtrmal in #831
  • kaldi: add an switch/option to read the durations from kaldi utt2dur … by @jtrmal in #832
  • Update test packages by @pzelasko in #837
  • MultiCut to store multi-channel recordings with shared supervision by @desh2608 in #822
  • Use CutSet for whisper annotation workflow by @desh2608 in #834
  • use spawn() as the strategy to prevent heisenbug by @jtrmal in #841
  • Compatibility for reading alignments saved before Lhotse v1.8 by @pzelasko in #842
  • make regexp string raw by @jtrmal in #836
  • Use absolute recording paths in yesno recipe by @pzelasko in #845
  • Fix CutSet.compute_and_store_features support for lazy CutSets by @pzelasko in #844
  • Fixing some QA functions for lazy manifests by @desh2608 in #848
  • Fix timestamps in Whisper annotation workflow by @pzelasko in #847
  • Update supervisions channels in multi-channel recipes by @desh2608 in #838
  • Allow retaining or trimming channels in trim_to_supervisions by @desh2608 in #852
  • Match cut_id to utt_id if there is exactly one supervision per cut by @wgb14 in #853
  • forced alignment: use num2words to get word timestamps for numbers by @eschmidbauer in #849
  • Prepare CSJ by @teowenshen in #851
  • Small changes in trim_to_supervisions() by @desh2608 in #855
  • Fix checkpoints of samplers that were iterated over more than once within the same epoch by @pzelasko in #854
  • Update fisher_english.py by @maxlvov in #858

New Contributors

Full Changelog: v1.8...v1.9

v1.8 Sudden Avalanche

30 Sep 13:18
Compare
Choose a tag to compare

Breaking changes

  • Python 3.6 is no longer supported as of Lhotse v1.8. If you need to use Python 3.6, please revert to Lhotse 1.7 and earlier.

Highlights

  • New experimental module of lhotse: workflows, now integrates optional third party packages that assist corpus creators in automated data curation. With release 1.8, we support OpenAI Whisper for automatic transcription and segmentation, and torchaudio Wav2Vec2/Hubert ASR bundles for forced alignment.

ctxG6RI

What's Changed

  • Fix read and write in piped CLI by @desh2608 in #807
  • Default behavior of CutSet.mix by @ZuoyunZheng in #809
  • Adding more info about resampling options by @RuABraun in #815
  • Add pad_silence option to extend_by by @desh2608 in #816
  • Message when calling len() on LazyFilter by @desh2608 in #817
  • Refactor cut and retain git blame history by @desh2608 in #820
  • Audio backend refactoring and a workaround for FLAC reading from/writing to in-memory buffers by @pzelasko in #814
  • Experimental Lhotse feature: corpus creation tools (workflows), starting with OpenAI Whisper support by @pzelasko in #824
  • Drop support for Python 3.6 by @pzelasko in #829
  • [workflow] Word-level forced alignment with pretrained models from Torchaudio by @pzelasko in #827

New Contributors

Full Changelog: v1.7...v1.8

v1.7 - Rejuvenation Potion

12 Sep 21:38
Compare
Choose a tag to compare

What's Changed

  • add test data to bvcc by @oplatek in #797
  • Add reverb with fast RIR generator by @desh2608 in #799
  • Support snip_edges=True in online_inference of Kaldi feature extractors by @pzelasko in #802
  • Remove warning about Lhotse not being stable from README.md by @pzelasko in #804
  • Update the documentation related to optional packages by @pzelasko in #805

Full Changelog: v1.6...v1.7

1.6 - Frozen Palm Tree

27 Aug 21:13
Compare
Choose a tag to compare

What's Changed

  • Feature/fix 754 voxceleb download by @mikuchar in #776
  • Support Kaldi data dierectories without segments file. by @MartinKocour in #789
  • Add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc by @shanguanma in #760
  • Improve support for custom Recordings by @pzelasko in #791
  • Add Cut.has(field) method to query Cuts for custom attributes by @pzelasko in #792
  • Add normalization for aishell2 recipe by @shanguanma in #790

New Contributors

Full Changelog: 1.5...v1.6

1.5 - Little Leaf

09 Aug 01:04
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.4...v1.5

v1.4 - Candescent Crust

07 Jul 01:08
Compare
Choose a tag to compare

What's Changed

  • Fix lambda warnings from lazy manifests + leverage dill if installed for pickling lambdas by @pzelasko in #748
  • multi_cn recipes: aishell2, magicdata, primewords, stcmds, tal_asr, tal_csasr, thchs_30 by @shanguanma in #738
  • Deprecate strict, proportional_sampling, and bucket_method arguments by @pzelasko in #756
  • Fix lhotse cut simple CLI by @pzelasko in #759
  • Fix issues with eager CutSet creation from lazy manifests by @pzelasko in #763
  • DailyTalk recipe by @pzelasko in #767
  • add aishell2 dev test by @yuekaizhang in #766
  • Enable GlobalMVN computation with on-the-fly feature extraction by @pzelasko in #769
  • Add support for Python 3.10 and PyTorch 1.12 by @pzelasko in #764

New Contributors

Full Changelog: v1.3...1.4

v1.3 - Curiously Inviting Icicles

11 Jun 03:35
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.2...v1.3

v1.2 - Winter in the South

19 May 17:09
Compare
Choose a tag to compare

New Recipes

What's Changed

  • Fix import namespaces by @pzelasko in #698
  • .repeat(..., preserve_id=...) option for repeating manifests by @pzelasko in #699
  • Kaldi impex: remove invalid test by @jtrmal in #700
  • Minor fix in base url for AliMeeting download by @desh2608 in #702
  • [aidatatang_200zh] Avoid being converted to ASCII when preparing manifest by @luomingshuang in #703
  • [ali_meeting] Fix some path errors for ali_meeting.py by @luomingshuang in #705
  • [ali_meeting] Avoid being converted to ASCII by @luomingshuang in #704
  • Test for webdataset data de-duplication across ranks by @pzelasko in #706
  • Fixing data duplication with WebDataset in multi-node multi-worker training by @pzelasko in #707
  • Fix epoch setting for WebDataset shard shuffling by @pzelasko in #708
  • Full shard shuffling with webdataset by @pzelasko in #711
  • Raise an error when BucketingSampler is used with a lazy CutSet by @pzelasko in #710
  • Normalize output path names for recipes by @desh2608 in #712
  • [webdataset] Add shard of origin to Cut.shard_origin custom field by @pzelasko in #714
  • Update examples of combining datasets with RoundRobinSampler and add stop_early option. by @pzelasko in #716
  • pre-commit, isort + CI checks + running it on all code by @pzelasko in #720

New Contributors

Full Changelog: v1.1...v1.2