refactor to latest Legolas version + onda.signal v2 (#133)

Co-authored-by: Eric Hanson <[email protected]>
beacon-biosignals · Nov 2, 2022 · 36a7be9 · 36a7be9 · jrevels · Nov 2, 2022
1 parent aa21c24
commit 36a7be9
Show file tree

Hide file tree

Showing 20 changed files with 623 additions and 666 deletions.
diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml
@@ -9,15 +9,11 @@ on:
   pull_request:
 jobs:
   test:
-    name: Julia ${{ matrix.version }} - Legolas ${{ matrix.legolas-version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }}
+    name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }}
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false
       matrix:
-        legolas-version:
-          - '0.2'
-          - '0.3'
-          - '0.4'
         version:
           - '1.6'
           - '1'
@@ -33,11 +29,6 @@ jobs:
         with:
           version: ${{ matrix.version }}
           arch: ${{ matrix.arch }}
-      - name: "Install Legolas version"
-        shell: julia --color=yes --project=. {0}
-        run: |
-          using Pkg
-          Pkg.add(Pkg.PackageSpec(; name="Legolas", version="${{ matrix.legolas-version }}"))
       - uses: actions/cache@v2
         with:
           path: ~/.julia/artifacts

diff --git a/Project.toml b/Project.toml
@@ -1,13 +1,13 @@
 name = "Onda"
 uuid = "e853f5be-6863-11e9-128d-476edb89bfb5"
 authors = ["Beacon Biosignals, Inc."]
-version = "0.14.10"
+version = "0.15.0"
+
 
 [deps]
 Arrow = "69666777-d1a9-59fb-9406-91d4454c9d45"
 CodecZstd = "6b39b394-51ab-5f42-8807-6242bab2b4c2"
 Compat = "34da2185-b29b-5c13-b0c7-acf172513d20"
-ConstructionBase = "187b0558-2788-49d3-abe0-74a17ed4e7c9"
 Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
 Legolas = "741b9549-f6ed-4911-9fbf-4a1c0c97f0cd"
 Mmap = "a63ad114-7e13-5084-954f-fe012c677804"
@@ -21,12 +21,11 @@ UUIDs = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"
 Arrow = "1.6.2, 2"
 CodecZstd = "0.6, 0.7"
 Compat = "3.32, 4"
-ConstructionBase = "1.3"
 DataFrames = "1.2"
 FLAC_jll = "1.3.3"
-Legolas = "0.2.2, 0.3, 0.4"
+Legolas = "0.5"
 Tables = "1.4"
-TimeSpans = "0.2.2"
+TimeSpans = "0.3.4"
 TranscodingStreams = "0.9"
 julia = "1.6"
 

diff --git a/README.md b/README.md
@@ -24,7 +24,7 @@ This document uses the term...
 
 - ...**"LPCM"** to refer to [linear pulse code modulation](https://en.wikipedia.org/wiki/Pulse-code_modulation), a form of signal encoding where multivariate waveforms are digitized as a series of samples uniformly spaced over time and quantized to a uniformly spaced grid.
 
-- ...**"signal"** to refer to the digitized output of a process. A signal is comprised of metadata (e.g. LPCM encoding, channel information, sample data path/format information, etc.) and associated multi-channel sample data.
+- ...**"signal"** to refer to the digitized output of a process. We refer to the "devices" (physical, or virtual) that sample processes to generate signals as *sensors*. A signal is comprised of metadata (e.g. LPCM encoding, sensor information, channel information, sample data path/format information, etc.) and associated multi-channel sample data.
 
 - ...**"recording"** to refer a collection of one or more signals recorded simultaneously over some time period.
 
@@ -72,7 +72,7 @@ These schemas are largely orthogonal to one another - here's nothing inherent to
 
 The following sections provide [the version integer](https://beacon-biosignals.github.io/Legolas.jl/stable/schema/), per-column documentation, and examples for each of the above Legolas schemas. In accordance with the Legolas framework, a table is considered to comply with a given schema as long as the specified required columns for that schema are present in any order. While per-column documentation refers to the [logical types defined by the Arrow specification](https://github.com/apache/arrow/blob/master/format/Schema.fbs), Onda reader/writer implementations may additionally employ Arrow extension types that directly alias a column's specified logical type in order to support application-level features (first-class UUID support, custom `file_path` type support, etc.).
 
-#### `onda.signal@1`
+#### `onda.signal@2`
 
 ##### Columns
 
@@ -84,14 +84,15 @@ The following sections provide [the version integer](https://beacon-biosignals.g
 - `span` (`Struct`): The signal's time span within the recording. This structure has two fields:
     - `start` (`Duration` w/ `NANOSECOND` unit): The start offset in nanoseconds from the beginning of the recording. The minimum possible value is `0`.
     - `stop` (`Duration` w/ `NANOSECOND` unit): The stop offset in nanoseconds (exclusive) from the beginning of the recording. This value must be greater than `start`.
-- `kind` (`Utf8`): A string identifying the kind of signal that the row represents. Valid `kind` values are alphanumeric, nonempty, lowercase, `snake_case`, and contain no whitespace, punctuation, or leading/trailing underscores.
+- `sensor_type` (`Utf8`): A string identifying the "type" of the multichannel sensor that generated the signal. The notion of sensor type is somewhat application-specific; it may refer to a kind of device, a particular data modality, etc. Valid `sensor_type` values are alphanumeric, nonempty, lowercase, `snake_case`, and contain no whitespace, punctuation, or leading/trailing underscores.
+- `sensor_label` (`Utf8`): A string that, within the context a given recording, uniquely identifies the multichannel sensor that generated the signal. This field is often equivalent to `sensor_type` in applications where each sensor in a given recording is of a different type, but is especially useful in contexts where a single recording contains multiple signals with the same `sensor_type`. Valid values of `sensor_label` follow the same format as valid `sensor_type` values.
 - `channels` (`List` of `Utf8`): A list of strings where the `i`th element is the name of the signal's `i`th channel. A valid channel name...
     - ...is alphanumeric, nonempty, lowercase, `snake_case`, and contain no whitespace, punctuation, or leading/trailing underscores. Additional allowed characters include: `-`, `+`, `(`, `)`, `/`, `.`. If parentheses are used, they must be balanced, i.e. a matching closing parenthesis must be included for every open parenthesis.
-    -  To allow arbitrary cross-signal referencing, a channel name may reference channel names from other signals contained in the recording. Any such reference should take the form `signal_name.channel_name`. For example, an `eog` signal might have a channel named `left-eeg.m1` (the left eye electrode referenced to the mastoid electrode from a 10-20 EEG signal).
+    -  To allow arbitrary cross-signal referencing, a channel name may reference channel names from other signals contained in the recording. Any such reference should take the form `sensor_label.channel_name`. For example, an `eog` signal might have a channel named `left-eeg.m1` (the left eye electrode referenced to the mastoid electrode from a 10-20 EEG signal).
     - ...is unique amongst the other channel names in the signal. In other words, duplicate channel names within the same signal are disallowed.
 - `sample_unit` (`Utf8`): The name of the signal's canonical unit as a string. This string should conform to the same format as `kind` (alphanumeric, nonempty, lowercase, `snake_case`, and contain no whitespace, punctuation, or leading/trailing underscores), should be singular and not contain abbreviations (e.g. `"uV"` is bad, `"microvolt"` is good; `"l/m"` is bad, `"liter_per_minute"` is good).
-- `sample_resolution_in_unit` (`Int` or `FloatingPoint`): The signal's resolution in its canonical unit. This value, along with the signal's `sample_type` and `sample_offset_in_unit` fields, determines the signal's LPCM quantization scheme.
-- `sample_offset_in_unit`  (`Int` or `FloatingPoint`): The signal's zero-offset in its canonical unit (thus allowing LPCM encodings that are centered around non-zero values).
+- `sample_resolution_in_unit` (`FloatingPoint{DOUBLE}`): The signal's resolution in its canonical unit. This value, along with the signal's `sample_type` and `sample_offset_in_unit` fields, determines the signal's LPCM quantization scheme.
+- `sample_offset_in_unit`  (`FloatingPoint{DOUBLE}`): The signal's zero-offset in its canonical unit (thus allowing LPCM encodings that are centered around non-zero values).
 - `sample_type` (`Utf8`): The primitive scalar type used to encode each sample in the signal. Valid values are:
     - `"int8"`: signed little-endian 1-byte integer
     - `"int16"`: signed little-endian 2-byte integer
@@ -103,19 +104,21 @@ The following sections provide [the version integer](https://beacon-biosignals.g
     - `"uint64"`: unsigned little-endian 8-byte integer
     - `"float32"`: 32-bit floating point number
     - `"float64"`: 64-bit floating point number
-- `sample_rate` (`Int` or `FloatingPoint`): The signal's sample rate.
+- `sample_rate` (`FloatingPoint{DOUBLE}`): The signal's sample rate in samples per second.
 
-Note that this schema allows for the existence of multiple `onda.signal` instances with the same `kind` and `recording`. In this instance, these `onda.signal` instances should be interpreted as digitized outputs of the same underlying process at their respective `span`s, thus enabling the representation/storage of discontiguous/overlapping sample data. Beyond this definition, further specification for the resolution of sample data discontinuities and/or overlaps for specific `kind`s/`recording`s/etc. is left to downstream, use-case-specific extensions of the `onda.signal` schema. For example, there may exist an `onda.signal` with `kind="eeg"` and `span=(start=Nanosecond(0), stop=Nanosecond(1e9))`, and another with the same `recording`/`kind` but with `span=(start=Nanosecond(2e9), stop=Nanosecond(3e9))`; downstream consumers may interpret this as a single EEG signal that is sampled for 1 second starting at the beginning of the recording, followed by a 1 second gap, followed by another second of sampling.
+Note that the `onda.signal@2` specification allows for the simultaneous existence of multiple `onda.signal@2` instances with the same `sensor_label` and `recording`, even though `sensor_label` is defined to act as a unique (within the context of the recording) identifier of the signal's sensor. This is because a recording may contain multiple *discontiguous* signals generated from the same underlying sensor at different time points, as specified by each signal's `span`. Thus, signals that share a common sensor within the same recording in this manner should have non-overlapping `span`s, but note that this property might not be holistically enforceable by Onda reader/writer implementations in all cases. Beyond this definition, further specification for the intepretation of discontinuous sample data for specific `sensor_type`s/`recording`s/etc. is left to downstream, use-case-specific extensions of `onda.signal@2`.
 
-When feasible in practice, it is recommended that data producers manually concatenate discontiguous sample data into a single `onda.signal` and use `NaN` values to represent unsampled regions, rather than represent discontiguous segments via separate `onda.signal`s, as the former approach is often more convenient than the latter for downstream consumers.
+For example, there may exist an `onda.signal@2` with `sensor_label="eeg"` and `span=(start=Nanosecond(0), stop=Nanosecond(1e9))`, and another with the same `recording`/`sensor_label` but with `span=(start=Nanosecond(2e9), stop=Nanosecond(3e9))`. Downstream consumers may interpret this as two EEG signals from the same sensor, sampled at different time points: the sensor generated the first 1-second signal at the beginning of the recording, followed by a 1 second gap, followed by another second of sampling.
+
+When feasible in practice, it is recommended that data producers manually concatenate discontiguous sample data into a single signal and use `NaN` values to represent unsampled regions, rather than represent discontiguous segments via separate signals, as the former approach is often more convenient than the latter for downstream consumers.
 
 ##### Examples
 
 | `recording`                          | `file_path`                                        | `file_format`                                            | `span`                       | `kind`     | `channels`                              | `sample_unit` | `sample_resolution_in_unit` | `sample_offset_in_unit` | `sample_type` | `sample_rate` | `my_custom_value`             |
 |--------------------------------------|----------------------------------------------------|----------------------------------------------------------|------------------------------|------------|-----------------------------------------|---------------|-----------------------------|-------------------------|---------------|---------------|-------------------------------|
-| `0xb14d2c6d8d844e46824f5c5d857215b4` | `"./relative/path/to/samples.lpcm"`                | `"lpcm"`                                                 | `(start=10e9, stop=10900e9)` | `"eeg"`    | `["fp1", "f3", "f7", "fz", "f4", "f8"]` | `"microvolt"` | `0.25`                      | `3.6`                   | `"int16"`     | `256`         | `"this is a value"`           |
+| `0xb14d2c6d8d844e46824f5c5d857215b4` | `"./relative/path/to/samples.lpcm"`                | `"lpcm"`                                                 | `(start=10e9, stop=10900e9)` | `"eeg"`    | `["fp1", "f3", "f7", "fz", "f4", "f8"]` | `"microvolt"` | `0.25`                      | `3.6`                   | `"int16"`     | `256.0`       | `"this is a value"`           |
 | `0xb14d2c6d8d844e46824f5c5d857215b4` | `"s3://bucket/prefix/obj.lpcm.zst"`                | `"lpcm.zst"`                                             | `(start=0, stop=10800e9)`    | `"ecg"`    | `["avl", "avr"]`                        | `"microvolt"` | `0.5`                       | `1.0`                   | `"int16"`     | `128.3`       | `"this is a different value"` |
-| `0x625fa5eadfb24252b58d1eb350fa7df6` | `"s3://other-bucket/prefix/obj_with_no_extension"` | `"flac"`                                                 | `(start=100e9, stop=500e9)`  | `"audio"`  | `["left", "right"]`                     | `"scalar"`    | `1.0`                       | `0.0`                   | `"float32"`   | `44100`       | `"this is another value"`     |
+| `0x625fa5eadfb24252b58d1eb350fa7df6` | `"s3://other-bucket/prefix/obj_with_no_extension"` | `"flac"`                                                 | `(start=100e9, stop=500e9)`  | `"audio"`  | `["left", "right"]`                     | `"scalar"`    | `1.0`                       | `0.0`                   | `"float32"`   | `44100.0`     | `"this is another value"`     |
 | `0xa5c01f0e50fe4acba065fcf474e263f5` | `"./another-relative/path/to/samples"`             | `"custom_price_format:{\"parseable_json_parameter\":3}"` | `(start=0, stop=3600e9)`     | `"price"`  | `["price"]`                             | `"dollar"`    | `0.01`                      | `0.0`                   | `"uint32"`    | `50.75`       | `"wow what a great value"`    |
 
 ##### Sample Data Files
@@ -158,6 +161,10 @@ where the division is followed/preceded by whatever quantization strategy is cho
 decoded_value = (encoded_value * sample_resolution_in_unit) + sample_offset_in_unit
 ```
 
+##### Previous Versions
+
+- [`onda.signal@1`](https://github.com/beacon-biosignals/Onda.jl/tree/v0.14.10#ondasignal1)
+
 #### `onda.annotation@1`
 
 ##### Columns

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -22,19 +22,18 @@ Onda.read_byte_range
 ## `onda.annotation`
 
 ```@docs
-Annotation
+AnnotationV1
 write_annotations
-validate_annotations
+MergedAnnotationV1
 merge_overlapping_annotations
 ```
 
 ## `onda.signal`
 
 ```@docs
-Signal
-SamplesInfo
+SignalV2
+SamplesInfoV2
 write_signals
-validate_signals
 channel(x, name)
 channel(x, i::Integer)
 channel_count(x)
@@ -88,6 +87,7 @@ Onda.file_format_string
 
 ```@docs
 VALIDATE_SAMPLES_DEFAULT
+Onda.upgrade
 ```
 
 ## Developer Installation

diff --git a/docs/src/upgrading.md b/docs/src/upgrading.md
@@ -1,5 +1,19 @@
 # Upgrading From Older Versions Of Onda
 
+## To v0.15 From v0.14
+
+First, ensure your codebase is fully upgraded to Onda v0.14, including resolving all deprecation warnings.
+
+From there, breaking changes include:
+
+- Functionality that was "soft-deprecated" in Onda v0.14 (i.e. code would still work, but warnings would be raised) is now "hard-deprecated" as of Onda v0.15.
+
+- Onda is now built atop Legolas v0.5, instead of Legolas v0.4. Users of Onda v0.15 must therefore upgrade their usage of Legolas to Legolas v0.5 (see [here for details](https://github.com/beacon-biosignals/Legolas.jl/pull/54)). One of the most significant implications of this breaking change is that `Annotation` and `Signal` (formerly, aliases of the old `Legolas.Row` type) have been replaced with `AnnotationV1` and `SignalV2` (subtypes of the new `Legolas.AbstractRecord` type). The latter types have slightly different semantics than the former, especially in that the new types do not propagate non-required fields in the same manner as the old types. Even though a comprehensive deprecation path isn't provided, invocations of `Annotation(...)`/`Signal(...)` will now throw a descriptive error with suggestions on how to upgrade.
+
+- `onda.samples-info@2` and `onda.signal@2` have been introduced, which replace `onda.samples-info@1` and `onda.signal@1` respectively. Onda still declares/provides these first-generation schema versions, so that corresponding `Legolas.read` invocations may still work as expected, but all other Onda v0.15 API structures/functions utilize (and/or expect) the second-generation schema versions. Onda v0.15 provides [`Onda.upgrade`](@ref) to conveniently upgrade first-generation data to the new generation.
+
+- The functions `Onda.write_signals`, `Onda.write_annotations`, and `Onda.validate` have been soft-deprecated, and will thus continue to work - but will raise deprecation warnings - until the next breaking version update.
+
 ## To v0.14 From v0.13
 
 Potentially breaking changes include:

diff --git a/examples/flac.jl b/examples/flac.jl
@@ -34,8 +34,7 @@ struct FLACFormat{S} <: Onda.AbstractLPCMFormat
     end
 end
 
-FLACFormat(info::SamplesInfo; kwargs...) = FLACFormat(LPCMFormat(info); sample_rate=info.sample_rate,
-                                                      kwargs...)
+FLACFormat(info; kwargs...) = FLACFormat(LPCMFormat(info); sample_rate=info.sample_rate, kwargs...)
 
 Onda.register_lpcm_format!(file_format -> file_format == "flac" ? FLACFormat : nothing)
 
@@ -104,12 +103,12 @@ saws(info, duration) = [(j + i) % 100 * info.sample_resolution_in_unit for
 
 if VERSION >= v"1.1.0"
     @testset "FLAC example" begin
-        info = SamplesInfo(; kind="test", channels=["a", "b", "c"],
-                           sample_unit="unit",
-                           sample_resolution_in_unit=0.25,
-                           sample_offset_in_unit=-0.5,
-                           sample_type=Int16,
-                           sample_rate=50)
+        info = SamplesInfoV2(; sensor_type="test", channels=["a", "b", "c"],
+                             sample_unit="unit",
+                             sample_resolution_in_unit=0.25,
+                             sample_offset_in_unit=-0.5,
+                             sample_type=Int16,
+                             sample_rate=50)
         data = saws(info, Minute(3))
         samples = encode(Samples(data, info, false))
         fmt = FLACFormat(info)