-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Relative Audio Paths #4470
Conversation
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Minor comments v
manifest_dir = Path(manifest_file).parent | ||
audio_file = Path(item['audio_file']) | ||
if not audio_file.is_file() and not audio_file.is_absolute() and audio_file.parent != Path("."): | ||
# assume the wavs/ dir and manifest are under the same parent dir | ||
if not audio_file.is_file() and not audio_file.is_absolute(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To note, I think even this is_file() check will fail for too long file names. Any idea ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can try to avoid too long file names. When creating tarred datasets, we can trim the common path of all audio files and use the trimmed paths as the new file names.
For example,
/a/b/c/d/e/xxxxxx.wav
/a/b/c/d/f/yyyyyy.wav
can be trimmed as
e/xxxxxx.wav
f/yyyyyy.wav
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No let's not do that, there are many datasets of ASR in mcc and MLS where there is a lot of common folders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing the code to the following? We just keep the original path if the input file is 255 chars or longer.
if (len(str(audio_file) < 255) and not audio_file.is_file() and not audio_file.is_absolute():
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this will not work with ASR data loaders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They need exact match to filename which is globally unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some confusion here (or maybe I am the confused one!). The 255 character limit refers to individual "nodes" in the file system tree. You can certainly have a path to a file with a length greater than 255, assuming there are directories leading to it. It's just that each file name and directory name must individually not exceed 255 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and ASR requires exact full path match between all files in the manifest, and they must be globally unique. The issue with slicing off means there are cases where things are no longer globally unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (len(str(audio_file) < 255) and not audio_file.is_file() and not audio_file.is_absolute():
audio_file = manifest_dir / audio_file
item['audio_file'] = str(audio_file.absolute())
if audio_file.is_file():
item['audio_file'] = str(audio_file.absolute())
else:
item['audio_file'] = expanduser(item['audio_file'])
else:
item['audio_file'] = expanduser(item['audio_file'])
No, this will not work with ASR data loaders.
@titu1994 Could you please help me understand why this won't work? If a filename is >= 255 chars, we just use the given filename and don't add the prefix, would this break something?
* update Signed-off-by: stevehuang52 <[email protected]> * update Signed-off-by: stevehuang52 <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: arendu <[email protected]>
* update Signed-off-by: stevehuang52 <[email protected]> * update Signed-off-by: stevehuang52 <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: David Mosallanezhad <[email protected]>
* update Signed-off-by: stevehuang52 <[email protected]> * update Signed-off-by: stevehuang52 <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: Hainan Xu <[email protected]>
Signed-off-by: stevehuang52 [email protected]
This PR fixes the second problem in this issue 4455.
Previously, the code will fail when there are "/" in the filepaths of tarred datasets. Now we only add the manifest directory to the audio filepath if the resulting file exists, thus fixing the problem.