Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant Q-Transform #588

Open
2 tasks
vincentqb opened this issue Apr 26, 2020 · 9 comments
Open
2 tasks

Constant Q-Transform #588

vincentqb opened this issue Apr 26, 2020 · 9 comments

Comments

@vincentqb
Copy link
Contributor

vincentqb commented Apr 26, 2020

We would like to have in torchaudio

  • constant q-transform, as librosa
  • inverse constant q-transform, as librosa
@dhgrs
Copy link

dhgrs commented Sep 20, 2020

Hi, I'm interested in implementing CQT and I have questions about it.

  • After LibROSA 0.8, the links in your post are outdated. Maybe cqt and griffinlim_cqt, right?
  • LibROSA has some variants of CQT. If we test the code by comparing the results with LibROSA, these is a difficulty. Because librosa.cqt and librosa.griffinlim include kaiser_* resampling inside of it but torchaudio doesn't have them.
  • How about focusing on librosa.pseudo_cqt? It doesn't include resampling but doesn't support inverse conversion.

@KinWaiCheuk
Copy link

I have implemented the librosa CQT in my project, I hope it would be useful for you guys.
https://github.com/KinWaiCheuk/nnAudio/blob/master/Installation/nnAudio/Spectrogram.py#L990

Several improvements can be made:

  1. The for loop at line 1223 is looping over different octaves, however, is currently the bottleneck.
  2. The lowpass filter at 1032 is not as good as the librosa version
  3. In librosa, they use a sparse matrix to store the frequency-domain CQT kernels. I did not use any sparse matrix in my implementation. Instead, I realized that it is not necessary to obtain the frequency-domain CQT kernels, the time-domain CQT kernels also work as well and even faster. This improved version has been implemented as another PyTorch class called CQT2010v2 in my project.

Regarding pseduo_cqt, I know that they did not use any downsampling, but I am not sure how is it different from the CQT algorithm proposed in 1992. If they are the same, then I also have this version of CQT named as CQT1992 in my project.

I hope it would be useful for you guys, and I am currently also curious about how to implement inverse CQT.

mthrok pushed a commit to mthrok/audio that referenced this issue Feb 26, 2021
@ktatar
Copy link

ktatar commented Mar 23, 2021

Hi all,
It seems like there is an implementation provided by @KinWaiCheuk. Do you plan to add this to the library?

@vincentqb
Copy link
Contributor Author

There is no plan currently, but I'd welcome a pull request from the community that implements CQT and its inverse. If you are interested in working on such a pull request, please feel free to do so :)

  • The pull request does need to test against librosa as a reference.
  • For performance, we should compare the implementation between librosa and torchaudio using timeit and see what is the speed difference.

@d-dawg78
Copy link

d-dawg78 commented Jun 26, 2024

Hey everyone,

I am currently wrapping up torchaudio implementations of the VQT, CQT, and iCQT, that test against librosa (torchaudio resampling changes the signal too much compared to librosa after a few iterations, but the first few octaves have the same or similar values; proposed version is also much much quicker than librosa; all details in a PR to come). Do I have the green light to PR? Just wrapping up the last batch of tests 🧪 Let's get these wonderful transforms to torchaudio!

Edit: link to the forked repo with changes is here

@d-dawg78
Copy link

d-dawg78 commented Jun 27, 2024

Hey everyone,

A quick follow up from the above. The librosa cqt (and vqt and icqt) being matched in my fork is the following:

librosa_vqt = cqt(
    y=y,
    sr=<SAMPLE_RATE>,
    hop_length=<HOP_LENGTH>,
    fmin=<F_MIN>,
    n_bins=<N_BINS>,
    bins_per_octave=<BINS_PER_OCTAVE>,
    sparsity=0.,
    res_type="sinc_best",
    scale=False,
)

Here's a sample figure comparing the proposed and librosa versions using the audio snippet from here, with:

SAMPLE_RATE = 44100
HOP_LENGTH = 512
F_MIN = 32.703
N_BINS = 108
BINS_PER_OCTAVE = 12

cqts

The results are pretty much identical 😃 Opening a draft PR for now.

@mthrok
Copy link
Collaborator

mthrok commented Jun 28, 2024

Hi

I no longer maintain this library, so I'm in a bit awkward position, but with the unit testing and such, this looks low risk/low maintenance cost addition.

@nateanl thoughts?

@nateanl
Copy link
Member

nateanl commented Jun 28, 2024

I'm down to adding this feature to TorchAudio. Although librosa already has implementation of it, enabling the feature with GPU computation can boost the training speed.

@d-dawg78
Copy link

Cool, thanks for the quick answers! I'll finish up the last few details and request your review in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants