Lightweight ocean passive acoustic data query #32
Replies: 6 comments 2 replies
-
ThoughtsCataloging StrategyTo work around shortcomings of individual data sources (e.g. some may not support intelligent queries such as discovering data by datetime or location). Given the number of datasets, we will likely want to split up the datasets and create ingestors for each dataset responsible for:
Catalog FormatIt seems appropriate to store metadata regarding all relevant data in a harmonized format. It feels like a STAC catalog might be an appropriate format to store the data in a harmonized format. By adhering to an open standard, we can lean on pre-existing tooling that interacts with data in the commonly known format. For example:
How will this operate?I assumed this would be a web-service that would run in a cloud environment that would poll the source data at a regular interval. Does this seem sufficient or does that have shortcomings? |
Beta Was this translation helpful? Give feedback.
-
Maybe look at Ocean Network Canada data For Orcasound: https://github.com/orgs/orcasound/projects/2/views/1 |
Beta Was this translation helpful? Give feedback.
-
I was actually thinking thatit would be nice for this not be a web service, and just be a package that people can install and do the query. We could probably have a repo that use github actions to automate poll the metadata from different sources and store it somewhere (this part would have to be worked out...). One reason is that there are so many data portals out there and it would be nice if we don't just create another one, just like it's usually best practice to revise an existing data standard/convention instead of creating a new one. I didn't include in the above the large number of data portals for terrestrial passive acoustic monitoring, but the sprawling problem is the same there. |
Beta Was this translation helpful? Give feedback.
-
I also wonder if the intake catalogue would be a good choice for just the file existence and metadata -- the audio files themselves are too huge. For a very minimal set of metadata to just build out and test the mechanism: lat-lon, time, and data source (which database, portal, or bucket). Just because we're in GVE at the moment: What @ocefpaf is covering in the Data Access tutorials are/were also what inspired me for the project idea in addition to the colocate package. |
Beta Was this translation helpful? Give feedback.
-
@leewujung I love this concept. Have you discussed with @valentina-s ? In anticipation of revisiting your idea in 2023, I wanted to call you attention to the efforts of Karan last summer (via Google Summer of Code with Orcasound) -- https://www.orcasound.net/2022/08/04/making-hydrophone-data-accessible/ ...In terms of developing a catalog for OOI hydrophones, one might be able to build upon his code that was aspiring to transcode the mseed audio data for playback in near-real-time via the Orcasound web app. I think this approach might also help improve access to and analysis of Canadian open data. Despite their best efforts, the Data Viewer for hydrophone audio provided by Ocean Networks Canada makes it difficult to determine all of the deployments within a region hold data for a particular period. |
Beta Was this translation helpful? Give feedback.
-
Hey all, I'm a 4th yr Ph.D. Candidate in Applied Math and stumbled here from an escience institue email looking to propose a project. I saw that there's a deadline of Jun 2nd. So I'd like to revive this thread, if that's ok. My original intent was slightly different than @leewujung's proposal. Rather than lightweight catalogue of data, I was motivated by the problem of knowing there was a lot of data at ONC, but it being difficult to work with. I had been trying to build an ML model to detect and classify shipping, but the ONC python library/API downloads .wav and .mp3 files. So I built a package to manage and track downloads, stitching together acoustic files, and return all the data as numpy arrays. This made it easy to build up a lot of training data programmatically. I originally did this for an employer, so I was trying to rebuild an open-source version at tehom. While the closed-source version lead to a conference poster of our model, the open-source one is about 50% done, since it's unrelated to my PhD and a perennial side project, but I'd be happy to lead a mentored sprint on it for OceanHackWeek 2023. I also have maintainer access to ONC's python library and have worked with it pretty extensively. Is this a useful project idea? Is this the kind of thing that would garner interest at an OHW? @leewujung @scottveirs @emiliom. |
Beta Was this translation helpful? Give feedback.
-
Title
Lightweight ocean passive acoustic data query
Summary
Create a lightweight data query service to help people find publicly-available passive acoustic data from the ocean. This idea is extensible to include other terrestrial publicly-available datasets.
Personnel
Wu-Jung Lee + anyone interested
Data sets and infrastructure support
Datasets to potentially get started on this:
Would be good to follow an existing convention for metadata.
Some more good references:
The problem
Ocean audio datasets are available in many places -- another realization of the "there are so many portals!" problem in the subdomain of passive acoustic monitoring.
It'll be nice to have a data query service that can check existing databases/data portals to grab at least metadata (need definitions on minimum set too) and let users know what is available out there on the web or some organization's buckets. Getting the actual data is much more "expensive" in terms of resources and logistics, so this proposal is just aimed toward knowing the existence of data.
Proposed methods/tools
The colocate package may be a good reference: https://github.com/ioos/colocate
That was also a project past OHW.
Beta Was this translation helpful? Give feedback.
All reactions