Skip to content

Commit

Permalink
Added notes on chunk size choices
Browse files Browse the repository at this point in the history
[no_ci]
  • Loading branch information
atruskie committed Aug 9, 2019
1 parent 0ba6fb0 commit abb00dd
Showing 1 changed file with 113 additions and 0 deletions.
113 changes: 113 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,119 @@ mind the target audience. You're in the right ballpark if:

More than likely if you're stuck we can help 😊.

## Why do we analyze data in 1-minute chunks?

There are a couple reasons, but mainly, because when we first started doing this,
computers were far less efficient than they are now. Computers are fundamentally
limited by the amount of data they can work with at one time; in technical terms,
the amount of data that can fit into main memory (RAM). By breaking the data into
smaller chunks, we could _stream_ the data through the analysis and our overall
analysis speed was greatly improved.

We could have chosen any size from 30-seconds to 5-minutes, but
one-minute blocks also had nice temporal features (they compose well with data
at different resolutions) and are still detailed enough to answer questions in
our multi-day analyses.

Today it seems to be the de facto standard to analyze data in one-minute blocks.
We suggest that it is still a good default for most use cases:

- Computers don't have the same limitations as they did when we started, but small
blocks of data allow for parallel analysis that effectively utilizes all CPU cores
- While computer's are getting better, we're also doing more complex analyses. In
parallel we can use a large amount of RAM and most of the computer's CPU(s) for
the quickest analysis
- One-minute blocks still retain the nice temporally composable attributes detailed
above.
- And since one-minute blocks seem to be a defacto standard it does (by happenstance)
provide common ground for comparing data

## What effect does chunk size have on data?

For acoustic event recognition, typically boundary effects are the only effect
that are affected by chunk-size choice.
That is, if an acoustic event occurs is clipped
by either the start or end of the chunk, and is now only a partial vocalization,
a typical event recognizer may not detect it.

For acoustic indices, from a theoretical point of view, chunk-size has the same kinds of issues as the choice of FFT frame length in speech processing. Because an FFT
assumes signal stationarity, one chooses a frame length over which the spectral
content in the signal of interest is approximately constant. In the case of
acoustic indices, one chooses an index calculation duration which captures
a sufficient amount of data for the acoustic feature that you are interested in.
The longer the analysis duration the more you blur or average out features of
interest. However, if you choose too short an interval in then the calculated
index may be dominated by "noise"---that is, features that are not of interest.
We find this is particularly the case with the ACI index; one-minute seems to be
an appropriate duration for the indices that are typically calculated.

## Can I change the chunk size?

The [config files](https://github.com/QutEcoacoustics/audio-analysis/tree/master/src/AnalysisConfigFiles) for most of our analyses contain these common settings:

### Chunk-size:

```yaml
# SegmentDuration: units=seconds;
# Long duration recordings are cut into short segments for more
# efficient processing. Default segment length = 60 seconds.
SegmentDuration: 60
```
Here, `SegmentDuration` is our name for the chunk-size. If, for example, you wanted
to process data in 5 minute chunks, you could change the configuration to
`SegmentDuration: 300`.

### Chunk-overlap:

```yaml
# SegmentOverlap: units=seconds;
SegmentOverlap: 0
```

If you're doing event recognition and you're concerned about boundary effects,
you could change the overlap to `SegmentOverlap: 10`, which would ensure every
`SegmentDuration`-sized-chunk (typically one-minute in size) would be cut with
an extra, trailing, 10-second buffer.

Note: we rarely change this setting and setting too much overlap may produce
duplicate events.

### Index Calculation Duration (for the indices analysis only):

For acoustic indices in particular, their calculation resolution depends on a
second setting that is limited by chunk-size (`SegmentDuration`):

```yaml
# IndexCalculationDuration: units=seconds (default=60 seconds)
# The Timespan (in seconds) over which summary and spectral indices are calculated
# This value MUST not exceed value of SegmentDuration.
# Default value = 60 seconds, however can be reduced down to 0.1 seconds for higher resolution.
# IndexCalculationDuration should divide SegmentDuration with MODULO zero
IndexCalculationDuration: 60.0
```

If you wanted indices calculated over a duration longer than one-minute, you
could change the `SegmentDuration` and the `IndexCalculation` duration to higher
values:

```yaml
SegmentDuration: 300
IndexCalculationDuration: 300
```

However, we suggest that there are better methods for calculating low-resolution
indices. A method we often use is to calculate indices at a 60.0 seconds resolution
and aggregate the values into lower resolution chunks. The aggregation method
can provide some interesting choices:

- We've seen the maximum, median, or minimum value for a block of indices
chosen (and sometimes all 3).
- though be cautious when using a mean function, it can skew the value of
logarithmic indices
- And we've seen a block of indices flattened into a larger feature vector and
fed to a machine learning or clustering algorithm


## We collect metrics/statistics; what information is collected and how is it used?

**NOTE: this is an upcoming feature and has not been released yet**
Expand Down

0 comments on commit abb00dd

Please sign in to comment.