Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embedding metadata #5

Open
peichins opened this issue Dec 19, 2023 · 2 comments
Open

embedding metadata #5

peichins opened this issue Dec 19, 2023 · 2 comments

Comments

@peichins
Copy link
Member

We will always need a way to refer embeddings back to the original audio.

Currently, embeddings are saved as tables with an offset column, either as parquet or csv.

We will also have an audio_recording_id column where it's available.
Where it is not available, i.e. the input is a file on disk only, not from the workbench, we need to save the path to that file.

One option is to store the path in the audio_recording_id column. Or store the path in a "path" column. The advantage of this is that it's simple. The disadvantage is that it will add a fair bit of size to the files. Audio filenames and folder names tend to be quite long. For each output file, the path to original will be the same for every row, so that will add a lot unecessarily.

Another option is to add metadata to each parquet or csv file, and include the path to the original there.
Parquet has this built in. CSV can kind of do it with comments depending on the library that is reading it.

If files are ever combined, this information will need to be reshaped into a column anyway.

@atruskie
Copy link
Member

another idea; perhaps we should generalize this concept?

A source column:

  • ecosounds: https://api.ecosounds.org/audio_recordings/1234
  • a2o: https://api.acousticobservatory.org/audio_recordings/5678
  • random file: file://20220527T220000+1000_Yourka-Dry-A_1101188.flac

@sdenton4
Copy link
Collaborator

sdenton4 commented Mar 1, 2024

The current layout is using the 'end' of filenames with some user-selected depth as the file_id. (for example, 'my_species/xc12345.mp3'.)

Part of the idea was to allow moving the whole set of audio files between different storage locations seamlessly, by joining the file_id to the new storage root using a formulation like: file_location = os.path.join(root, file_id).

This scheme is actually pretty agnostic to file vs url (so long as we aren't using windows, I guess - I hope that the WLS is sane enough to use proper slashes instead of backslashes for directories). Our usual audio loading functions are also quite happy to take a URL:
https://github.com/google-research/perch/blob/main/chirp/audio_utils.py#L49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants