Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 23 additions & 9 deletions doc/BufHPSS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,23 @@
:see-also: HPSS, BufSines, BufTransients
:description: FluidBufHPSS performs Harmonic-Percussive Source Separation (HPSS) on the contents of a Buffer.
:discussion:
HPSS works by using median filters on the spectral magnitudes of a sound. It hinges on a simple modelling assumption that tonal components will tend to yield concentrations of energy across time, spread out in frequency, and percussive components will manifest as concentrations of energy across frequency, spread out in time. By using median filters across time and frequency respectively, we get initial esitmates of the tonal-ness / transient-ness of a point in time and frequency. These are then combined into 'masks' that are applied to the orginal spectral data in order to produce a separation.

The maskingMode parameter provides different approaches to combinging estimates and producing masks. Some settings (especially in modes 1 & 2) will provide better separation but with more artefacts. These can, in principle, be ameliorated by applying smoothing filters to the masks before transforming back to the time-domain (not yet implemented).
HPSS takes in audio and divides it into two or three outputs, depending on the ``maskingMode``
* an harmonic component
* a percussive component
* a residual of the previous two if ``maskingMode`` is set to 2 (inter-dependant thresholds). See below.

HPSS works by using median filters on the magnitudes of a spectrogram. It makes certain assumptions about what it is looking for in a sound: that in a spectrogram “percussive” elements tend to form vertical “ridges” (tall in frequency band, narrow in time), while stable “harmonic” elements tend to form horizontal “ridges” (narrow in frequency band, long in time). By using median filters across time and frequency respectively, we get initial estimates of the "harmonic-ness" and "percussive-ness" for every spectral bin of every spectral frame in the spectrogram. These are then combined into 'masks' that are applied to the original spectrogram in order to produce a harmonic and percussive output (and residual if ``maskingMode`` = 2).

The maskingMode parameter provides different approaches to combining estimates and producing masks. Some settings (especially in modes 1 & 2) will provide better separation but with more artefacts.

Driedger (2014) suggests that the size of the median filters don't affect the outcome as much as the ``fftSize``. with large FFT sizes, short percussive sounds have less representation, therefore the harmonic component is more strongly represented. The result is that many of the percussive sounds leak into the harmonic component. Small FFT sizes have less resolution in the frequency domain and often lead to a blurring of horizontal structures, therefore harmonic sounds tend to leak into the percussive component. As with all FFT based-processes, finding an FFT size that balances spectral and temporal resolution for a given source sound will benefit the use of this object.

For more details visit https://learn.flucoma.org/reference/hpss

Fitzgerald, Derry. 2010. ‘Harmonic/Percussive Separation Using Median Filtering’. (In Proceedings DaFx 10. https://arrow.dit.ie/argcon/67.)

Driedger, Jonathan, Meinard Müller, and Sascha Disch. 2014. ‘Extending Harmonic-Percussive Separation of Audio Signals’. (In Proc. ISMIR. http://www.terasoft.com.tw/conf/ismir2014/proceedings/T110_127_Paper.pdf.)

:process: This is the method that calls for the HPSS to be calculated on a given source buffer.
:output: Nothing, as the various destination buffers are declared in the function call.
Expand Down Expand Up @@ -59,18 +73,18 @@

:enum:

:0:
The traditional soft mask used in Fitzgerald's original method of 'Wiener-inspired' filtering. Complimentary, soft masks are made for the harmonic and percussive parts by allocating some fraction of a point in time-frequency to each. This provides the fewest artefacts, but the weakest separation. The two resulting buffers will sum to exactly the original material.
:0:
Soft masks provide the fewest artefacts, but the weakest separation. Complimentary, soft masks are made for the harmonic and percussive parts by allocating some fraction of every magnitude in the spectrogram to each mask. The two resulting buffers will sum to exactly the original material. This mode uses soft mask in Fitzgerald's (2010) original method of 'Wiener-inspired' filtering.

:1:
Relative mode - Better separation, with more artefacts. The harmonic mask is constructed using a binary decision, based on whether a threshold is exceeded at a given time-frequency point (these are set using harmThreshFreq1, harmThreshAmp1, harmThreshFreq2, harmThreshAmp2, see below). The percussive mask is then formed as the inverse of the harmonic one, meaning that as above, the two components will sum to the original sound.
:1:
Binary masks provide better separation, but with more artefacts. The harmonic mask is constructed using a binary decision, based on whether a threshold is exceeded for every magnitude in the spectrogram (these are set using ``harmThreshFreq1``, ``harmThreshAmp1``, ``harmThreshFreq2``, ``harmThreshAmp2``, see below). The percussive mask is then formed as the inverse of the harmonic one, meaning that as above, the two components will sum to the original sound.

:2:
Inter-dependent mode - Thresholds can be varied independently, but are coupled in effect. Binary masks are made for each of the harmonic and percussive components, and the masks are converted to soft at the end so that everything null sums even if the params are independent, that is what makes it harder to control. These aren't guranteed to cover the whole sound; in this case the 'leftovers' will placed into a third buffer.
:2:
Soft masks (with a third stream containing a residual component). First, binary masks are made separately for the harmonic and percussive components using different thresholds (set with the respective ``harmThresh-`` and ``percThresh-`` parameters below). Because these masks aren't guaranteed to represent the entire spectrogram, any residual energy is considered as a third output. The independently created binary masks are converted to soft masks at the end of the process so that everything null sums.

:control harmThresh:

When maskingmode is 1 or 2, set the threshold curve for classifying an FFT bin as harmonic. Takes a list of two frequency-amplitude pairs as coordinates: between these coordinates the threshold is linearly interpolated, and is kept constant between DC and coordinate 1, and coordinate 2 and Nyquist.
When ``maskingMode`` is 1 or 2, set the threshold curve for classifying an FFT bin as harmonic. Takes a list of two frequency-amplitude pairs as coordinates: between these coordinates the threshold is linearly interpolated, and is kept constant between DC and coordinate 1, and coordinate 2 and Nyquist.

:control percThresh:

Expand Down
32 changes: 16 additions & 16 deletions doc/HPSS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,25 @@
:see-also: BufHPSS, Sines, Transients
:description: Harmonic-Percussive Source Separation (HPSS) on an audio input.
:discussion:
HPSS takes in audio and divides it into two or three outputs, depending on the mode:
* an harmonic component
* a percussive component
* a residual of the previous two if the flag is set to inter-dependant thresholds. See the maskingMode below.
HPSS takes in audio and divides it into two or three outputs, depending on the ``maskingMode``:
* an harmonic component
* a percussive component
* a residual of the previous two if ``maskingMode`` is set to 2.

HPSS works by using median filters on the magnitudes of a spectrogram. It makes certain assumptions about what it is looking for in a sound: it is based on the observation that in a spectrogram “percussive” elements tend to form vertical “ridges” (tall in frequency band, narrow in time), while stable “harmonic” elements tend to form horizontal “ridges” (narrow in frequency band, long in time). By using median filters across time and frequency respectively, we get initial esitmates of the harominc-ness and percussive-ness of a point in time and frequency. These are then combined into 'masks' that are applied to the orginal spectrogram in order to produce a harmonic and percussive output (and residual if maskingMode = 2).
HPSS works by using median filters on the magnitudes of a spectrogram. It makes certain assumptions about what it is looking for in a sound: that in a spectrogram “percussive” elements tend to form vertical “ridges” (tall in frequency band, narrow in time), while stable “harmonic” elements tend to form horizontal “ridges” (narrow in frequency band, long in time). By using median filters across time and frequency respectively, we get initial estimates of the "harmonic-ness" and "percussive-ness" for every spectral bin of every spectral frame in the spectrogram. These are then combined into 'masks' that are applied to the original spectrogram in order to produce a harmonic and percussive output (and residual if ``maskingMode`` = 2).

The maskingMode parameter provides different approaches to combining estimates and producing masks. Some settings (especially in modes 1 & 2) will provide better separation but with more artefacts.

Driedger (2014) suggests that the size of the median filters don't affect the outcome as much as the ``fftSize``. with large FFT sizes, short percussive sounds have less representation, therefore the harmonic component is more strongly represented. The result is that many of the percussive sounds leak into the harmonic component. Small FFT sizes have less resolution in the frequency domain and often lead to a blurring of horizontal structures, therefore harmonic sounds tend to leak into the percussive component. As with all FFT based-processes, finding an FFT size that balances spectral and temporal resolution for a given source sound will benefit the use of this object.

For more details visit https://learn.flucoma.org/reference/hpss

These processes are described in:

Fitzgerald, Derry. 2010. ‘Harmonic/Percussive Separation Using Median Filtering’. (In Proceedings DaFx 10. https://arrow.dit.ie/argcon/67.)
It also provides the variation detailed in Driedger, Jonathan, Meinard Müller, and Sascha Disch. 2014. ‘Extending Harmonic-Percussive Separation of Audio Signals’. (In Proc. ISMIR. http://www.terasoft.com.tw/conf/ismir2014/proceedings/T110_127_Paper.pdf.)

Driedger, Jonathan, Meinard Müller, and Sascha Disch. 2014. ‘Extending Harmonic-Percussive Separation of Audio Signals’. (In Proc. ISMIR. http://www.terasoft.com.tw/conf/ismir2014/proceedings/T110_127_Paper.pdf.)

:process: The audio rate version of the object.
:output: An array of three audio streams: [0] is the harmonic part extracted, [1] is the percussive part extracted, [2] is the rest. The latency between the input and the output is ((harmFilterSize - 1) * hopSize) + windowSize) samples.

:output: An array of three audio streams: [0] is the harmonic part extracted, [1] is the percussive part extracted, [2] is the residual. This object will always have a residual output stream, but when using ``maskingMode`` 0 or 1 this stream will be silent. The latency between the input and the output is ((harmFilterSize - 1) * hopSize) + windowSize) samples.

:control in:

Expand All @@ -43,21 +44,21 @@
:enum:

:0:
The traditional soft mask used in Fitzgerald's original method of 'Wiener-inspired' filtering. Complimentary, soft masks are made for the harmonic and percussive parts by allocating some fraction of a point in time-frequency to each. This provides the fewest artefacts, but the weakest separation. The two resulting buffers will sum to exactly the original material.
Soft masks provide the fewest artefacts, but the weakest separation. Complimentary, soft masks are made for the harmonic and percussive parts by allocating some fraction of every magnitude in the spectrogram to each mask. The two resulting buffers will sum to exactly the original material. This mode uses soft mask in Fitzgerald's (2010) original method of 'Wiener-inspired' filtering.

:1:
Relative mode - Better separation, with more artefacts. The harmonic mask is constructed using a binary decision, based on whether a threshold is exceeded at a given time-frequency point (these are set using harmThreshFreq1, harmThreshAmp1, harmThreshFreq2, harmThreshAmp2, see below). The percussive mask is then formed as the inverse of the harmonic one, meaning that as above, the two components will sum to the original sound.
Binary masks provide better separation, but with more artefacts. The harmonic mask is constructed using a binary decision, based on whether a threshold is exceeded for every magnitude in the spectrogram (these are set using ``harmThreshFreq1``, ``harmThreshAmp1``, ``harmThreshFreq2``, ``harmThreshAmp2``, see below). The percussive mask is then formed as the inverse of the harmonic one, meaning that as above, the two components will sum to the original sound.

:2:
Inter-dependent mode - Thresholds can be varied independently, but are coupled in effect. Binary masks are made for each of the harmonic and percussive components, and the masks are converted to soft at the end so that everything null sums even if the params are independent, that is what makes it harder to control. These aren't guranteed to cover the whole sound; in this case the 'leftovers' will placed into a third buffer.
Soft masks (with a third stream containing a residual component). First, binary masks are made separately for the harmonic and percussive components using different thresholds (set with the respective ``harmThresh-`` and ``percThresh-`` parameters below). Because these masks aren't guaranteed to represent the entire spectrogram, any residual energy is considered as a third output. The independently created binary masks are converted to soft masks at the end of the process so that everything null sums.

:control harmThresh:

When maskingmode is 1 or 2, set the threshold curve for classifying an FFT bin as harmonic. Takes a list of two frequency-amplitude pairs as coordinates: between these coordinates the threshold is linearly interpolated, and is kept constant between DC and coordinate 1, and coordinate 2 and Nyquist.
When ``maskingMode`` is 1 or 2, set the threshold curve for classifying an FFT bin as harmonic. Takes a list of two frequency-amplitude pairs as coordinates: between these coordinates the threshold is linearly interpolated, and is kept constant between DC and coordinate 1, and coordinate 2 and Nyquist.

:control percThresh:

In maskingmode 2, an independant pair of frequency-amplitude pairs defining the threshold for the percussive part. Its format is the same as above.
In ``maskingMode`` 2, an independent pair of frequency-amplitude pairs defining the threshold for the percussive part. Its format is the same as above.

:control windowSize:

Expand All @@ -82,4 +83,3 @@
:control maxPercFilterSize:

How large can the percussive filter be modulated to (percFilterSize), by allocating memory at instantiation time. This cannot be modulated.

119 changes: 64 additions & 55 deletions example-code/sc/BufHPSS.scd
Original file line number Diff line number Diff line change
@@ -1,81 +1,90 @@

code::
STRONG::Mode 0::

CODE::

//load buffers
(
b = Buffer.read(s,FluidFilesPath("Tremblay-AaS-SynthTwoVoices-M.wav"));
c = Buffer.new(s);
d = Buffer.new(s);
e = Buffer.new(s);
~src = Buffer.read(s,FluidFilesPath("Nicol-LoopE-M.wav"));
~harmonic = Buffer.new(s);
~percussive = Buffer.new(s);
)

// run with basic parameters
(
Routine{
t = Main.elapsedTime;
FluidBufHPSS.process(s, b, harmonic: c, percussive: d).wait;
(Main.elapsedTime - t).postln;
}.play
)
c.query
d.query
FluidBufHPSS.processBlocking(s,~src,harmonic:~harmonic,percussive:~percussive,action:{"done".postln;});

//play the harmonic
c.play;
//play the percussive
d.play;
~harmonic.play;

//nullsumming tests
{(PlayBuf.ar(1,c))+(PlayBuf.ar(1,d))+(-1*PlayBuf.ar(1,b,doneAction:2))}.play
//play the percussive
~percussive.play;

//more daring parameters, in mode 2
(
Routine{
t = Main.elapsedTime;
FluidBufHPSS.process(s, b, harmonic: c, percussive: d, residual:e, harmFilterSize:31, maskingMode:2, harmThreshFreq1: 0.005, harmThreshAmp1: 7.5, harmThreshFreq2: 0.168, harmThreshAmp2: 7.5, percThreshFreq1: 0.004, percThreshAmp1: 26.5, percThreshFreq2: 0.152, percThreshAmp2: 26.5,windowSize:4096,hopSize:512)
.wait;
(Main.elapsedTime - t).postln;
}.play
// See which parts of the Waveform are in which component
// blue = harmonic, orange = percussive
~fw = FluidWaveform(bounds:Rect(0,0,1600,400));
~fw.addAudioLayer(~harmonic,FluidViewer.categoryColors[0].alpha_(0.5));
~fw.addAudioLayer(~percussive,FluidViewer.categoryColors[1].alpha_(0.5));
~fw.front;
)

//play the harmonic
c.play;
//play the percussive
d.play;
//play the residual
e.play;

//still nullsumming
{PlayBuf.ar(1,c) + PlayBuf.ar(1,d) + PlayBuf.ar(1,e) - PlayBuf.ar(1,b,doneAction:2)}.play;
::

STRONG::A stereo buffer example.::
STRONG::Separating Components before Analysis (using Mode 1)::

CODE::

// load two very different files
~src = Buffer.read(s,FluidFilesPath("Tremblay-AaS-SynthTwoVoices-M.wav"));

// hear it
~src.play;

// let's look at some pitch analysis first
(
~pitch_analysis = Buffer(s);
FluidBufPitch.processBlocking(s,~src,features:~pitch_analysis,minFreq:40,maxFreq:500,windowSize:4096);
FluidWaveform(~src,featuresBuffer:~pitch_analysis,bounds:Rect(0,400,1600,400),stackFeatures:true);
)
// it's getting the "pitch" of all the clicks (the peaky spikes in the blue pitch plot),
// but perhaps I just want the pitch analysis of the bass line

// now let's do the pitch analysis using just the harmonic component
// because we're interested in strong separation and don't need to care about artefacts, maskingMode = 1
(
b = Buffer.read(s,FluidFilesPath("Tremblay-SA-UprightPianoPedalWide.wav"));
c = Buffer.read(s,FluidFilesPath("Tremblay-AaS-AcousticStrums-M.wav"));
~harmonic = Buffer(s);
FluidBufHPSS.processBlocking(s,~src,harmonic:~harmonic,harmFilterSize:17,percFilterSize:31,maskingMode:1);
FluidBufPitch.processBlocking(s,~harmonic,features:~pitch_analysis,minFreq:40,maxFreq:500,windowSize:4096);
FluidWaveform(~harmonic,featuresBuffer:~pitch_analysis,bounds:Rect(0,0,1600,400),stackFeatures:true);
)
// except for a few spikes at the end, this is much more usable for extracting the pitch of the bass notes.

// take a listen to what it's analyzing
~harmonic.play;

::

STRONG::Mode 2::

CODE::

// composite one on left one on right as test signals
(
Routine{
FluidBufCompose.process(s, c, numFrames:b.numFrames, startFrame:555000,destStartChan:1, destination:b).wait;
b.play
}.play
~src = Buffer.read(s,FluidFilesPath("Tremblay-CF-ChurchBells.wav"));
~residual = Buffer.new(s);
~harmonic = Buffer.new(s);
~percussive = Buffer.new(s);
)
// create 2 new buffers as destinations
d = Buffer.new(s); e = Buffer.new(s);

//run the process on them
// listen to the original;
~src.play

(
Routine{
t = Main.elapsedTime;
FluidBufHPSS.process(s, b, harmonic: d, percussive:e).wait;
(Main.elapsedTime - t).postln;
}.play
// this will take a few seconds (wait for "done" to post):
FluidBufHPSS.processBlocking(s,~src,harmonic:~harmonic,percussive:~percussive,residual:~residual,harmFilterSize:71,percFilterSize:31,maskingMode:2,harmThreshFreq1:0.005,harmThreshAmp1:3,harmThreshFreq2:0.2,harmThreshAmp2:6,percThreshFreq1:0.004,percThreshAmp1:3,percThreshFreq2:0.152,percThreshAmp2:3,windowSize:4096,hopSize:512,action:{"done".postln;})
)

//listen: stereo preserved!
d.play
e.play
// listen to the different parts
~harmonic.play; // some artefacts, but mostly the "harmonic" partials of the bells
~percussive.play; // has most of the "noisy" content of the bell resonance
~residual.play; // a bit of both, more of the attack

::
Loading