You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We will always need a way to refer embeddings back to the original audio.
Currently, embeddings are saved as tables with an offset column, either as parquet or csv.
We will also have an audio_recording_id column where it's available.
Where it is not available, i.e. the input is a file on disk only, not from the workbench, we need to save the path to that file.
One option is to store the path in the audio_recording_id column. Or store the path in a "path" column. The advantage of this is that it's simple. The disadvantage is that it will add a fair bit of size to the files. Audio filenames and folder names tend to be quite long. For each output file, the path to original will be the same for every row, so that will add a lot unecessarily.
Another option is to add metadata to each parquet or csv file, and include the path to the original there.
Parquet has this built in. CSV can kind of do it with comments depending on the library that is reading it.
If files are ever combined, this information will need to be reshaped into a column anyway.
The text was updated successfully, but these errors were encountered:
The current layout is using the 'end' of filenames with some user-selected depth as the file_id. (for example, 'my_species/xc12345.mp3'.)
Part of the idea was to allow moving the whole set of audio files between different storage locations seamlessly, by joining the file_id to the new storage root using a formulation like: file_location = os.path.join(root, file_id).
We will always need a way to refer embeddings back to the original audio.
Currently, embeddings are saved as tables with an offset column, either as parquet or csv.
We will also have an audio_recording_id column where it's available.
Where it is not available, i.e. the input is a file on disk only, not from the workbench, we need to save the path to that file.
One option is to store the path in the audio_recording_id column. Or store the path in a "path" column. The advantage of this is that it's simple. The disadvantage is that it will add a fair bit of size to the files. Audio filenames and folder names tend to be quite long. For each output file, the path to original will be the same for every row, so that will add a lot unecessarily.
Another option is to add metadata to each parquet or csv file, and include the path to the original there.
Parquet has this built in. CSV can kind of do it with comments depending on the library that is reading it.
If files are ever combined, this information will need to be reshaped into a column anyway.
The text was updated successfully, but these errors were encountered: