-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize save_audios() #1131
optimize save_audios() #1131
Conversation
2753f9a
to
534b626
Compare
Hi @pzelasko @csukuangfj , did you have a chance to "peek" into this PR ? |
It looks good to me 👍 💯 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Karel! Somehow this slipped through. I left a couple of comments.
lhotse/cut/set.py
Outdated
@@ -2374,6 +2387,11 @@ def save_audios( | |||
) | |||
executor = None | |||
|
|||
logging.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't emit logs in lhotse methods (for better or worse), can we remove it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, actually, by default the logging.info(.)
level does not print anything.
The first "active" verbosity level by default is : logging.warn(.)
.
So, technically, it does not print anything, unless a debug mode is started by lhotse user...
Piotr, do you still want to remove it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If lhotse is used in icefall and if icefall uses a logging level >= info, I think this message would still be printed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is also true, and info
is the default log-level in icefall
.
https://github.com/k2-fsa/icefall/blob/master/icefall/utils.py#L113
@@ -1413,6 +1414,17 @@ def combine_same_recording_channels(self) -> "CutSet": | |||
groups = groupby(lambda cut: (cut.recording.id, cut.start, cut.end), self) | |||
return CutSet.from_cuts(MultiCut.from_mono(*cuts) for cuts in groups.values()) | |||
|
|||
def sort_by_recording_id(self, ascending: bool = True) -> "CutSet": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a unit test to cover this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to that.
I found a similar unit-test test_cut_set_sort_by_duration().
Could you point me to the place, where this test is called ?
I did not find any code calling test_cut_set_sort_by_duration()
in the /test
folder and there are some arguments necessary to be filled...
Thx, Karel
@@ -2326,6 +2338,7 @@ def save_audios( | |||
executor: Optional[Executor] = None, | |||
augment_fn: Optional[AugmentFn] = None, | |||
progress_bar: bool = True, | |||
shuffle_on_split: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is undocumented
- added sorted method : CutSet::sort_by_recording_id() - allow to disable shuffling of CutSet inside `CutSet::save_audios()` - both changes improve cache hit ratio - `CutSet::save_audios()` : show in log if caching was active - caching.py : replace Union[] by Optional[]
534b626
to
bf1e663
Compare
The code is updated, but the unit-test is not yet complete (it is missing the calling code). |
We're using |
- adding unit-test, removing logging.info(), documenting `shuffle_on_split`
bf1e663
to
4035844
Compare
Hi @pzelasko @csukuangfj , |
Cool, thanks! |
Hi Piotr and Fangjun,
i have also some improvements for the
CutSet::save_audios()
function.This is mainly to improve the cache hit ratio for long audio files with
many utterances in each audio file.
I have also a calling script for that, would you be interested in having that too ?
Cheres,
Karel
CutSet::save_audios()
CutSet::save_audios()
: show in log if caching was active