-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement comparison method of Pfeifenberger et al 2017 #22
Comments
I'm confused about "kernelized DNN". For each point in the spectrogram, there is a feature vector. But the kernels for different frequency bins are different. Does this mean I need to build 257 different autoencoder layers and merge the output together to feed into the regression layer? |
The slide numbered 14 (actually page 29) in the presentation shows a flowchart of the network structure. Does that answer your question? |
But yes, it looks like there is a separate small DNN for each frequency channel and their outputs are combined by the final regression layer. |
Ok, I see. |
One question, I tried to compute the PSD matrix of clean speech. According to CHiME3's explanation, the reference of the simulated set is in tr05_ORG. However, there are only single-channel audios. |
If the power spectral density is supposed to be a 6x6 matrix per frequency, then you need to use the spatial image of the clean speech, not the original clean speech source signal. The spatial image of the clean speech is in the "reverberated" directory. If you need one power per frequency, then you can just average the speech power in the original clean speech source signals across time. |
Huh, that's strange, but I can confirm it is.gone. It looks like Felix Las modified the |
Tr means trace of the matrix, the sum of the diagonal, which is also the sum of the eigenvalues. |
Got it. |
Wait, there is no reverberated directory for CHiME3, that's just CHiME2. CHiME3 has channel 0 as the reference. |
The PSD matrix of noise can be computed in this way. What about the PSD of speech? Does CHiME3 have 6 channels of speech audio? |
Yes, it is available for (some of?) the simulated mixtures. They are different between training, dev, and eval, so check each one. Also read the CHiME3 paper. Equation (15) in the Pfeifenberger paper is just to show that it works in that visualization (figure 1), you don't actually need it for the deployable version of the algorithm. When the spatial images (6-channel recordings) of the speech and noise are available separately, you can use those directly to compute the PSD of the speech and noise. For an observed mixture, there are several ways to estimate them. |
Lukas Pfeifenberger, Matthias Zohrer, Franz Pernkopf. "DNN-BASED SPEECH MASK ESTIMATION FOR EIGENVECTOR BEAMFORMING." in ICASSP 2017. PDF
Slides from their talk at ICASSP
The text was updated successfully, but these errors were encountered: