embedding metadata #5

peichins · 2023-12-19T00:54:41Z

We will always need a way to refer embeddings back to the original audio.

Currently, embeddings are saved as tables with an offset column, either as parquet or csv.

We will also have an audio_recording_id column where it's available.
Where it is not available, i.e. the input is a file on disk only, not from the workbench, we need to save the path to that file.

One option is to store the path in the audio_recording_id column. Or store the path in a "path" column. The advantage of this is that it's simple. The disadvantage is that it will add a fair bit of size to the files. Audio filenames and folder names tend to be quite long. For each output file, the path to original will be the same for every row, so that will add a lot unecessarily.

Another option is to add metadata to each parquet or csv file, and include the path to the original there.
Parquet has this built in. CSV can kind of do it with comments depending on the library that is reading it.

If files are ever combined, this information will need to be reshaped into a column anyway.

atruskie · 2023-12-19T01:04:33Z

another idea; perhaps we should generalize this concept?

A source column:

ecosounds: https://api.ecosounds.org/audio_recordings/1234
a2o: https://api.acousticobservatory.org/audio_recordings/5678
random file: file://20220527T220000+1000_Yourka-Dry-A_1101188.flac

sdenton4 · 2024-03-01T16:41:24Z

The current layout is using the 'end' of filenames with some user-selected depth as the file_id. (for example, 'my_species/xc12345.mp3'.)

Part of the idea was to allow moving the whole set of audio files between different storage locations seamlessly, by joining the file_id to the new storage root using a formulation like: file_location = os.path.join(root, file_id).

This scheme is actually pretty agnostic to file vs url (so long as we aren't using windows, I guess - I hope that the WLS is sane enough to use proper slashes instead of backslashes for directories). Our usual audio loading functions are also quite happy to take a URL:
https://github.com/google-research/perch/blob/main/chirp/audio_utils.py#L49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embedding metadata #5

embedding metadata #5

peichins commented Dec 19, 2023

atruskie commented Dec 19, 2023

sdenton4 commented Mar 1, 2024

embedding metadata #5

embedding metadata #5

Comments

peichins commented Dec 19, 2023

atruskie commented Dec 19, 2023

sdenton4 commented Mar 1, 2024