Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use dynamic phonation feature with mfcc/fbank feature as the input to feed a DNN. #32

Open
ASR2020Guru opened this issue Apr 8, 2021 · 2 comments
Labels

Comments

@ASR2020Guru
Copy link

Hi @jcvasquezc ,

I would like to use dynamic phonation feature with mfcc/fbank feature as the input to feed a DNN.

The related code is shown as below:
phonafeature=phonation.extract_features_file(filename, static=False, plots=False, fmt="npy") fbankfeature, energies = python_speech_features.fbank(filename, samplerate=16000, nfilt=40, nfft=768,winlen=0.04,winstep=0.02, winfunc=np.hamming)

Because I noticed that the dynamic phonation feature is using winlen=0.04,winstep=0.02, so I set the same parameter value to fbank function.
However, the len(phonafeature) and len(fbankfeature) for one filename input is not same.
e.g.: filename=demo.wav,this demo.wav has 15s long and 16000 sample rate.
the len(phonafeature) for this demo.wav is (430.7), the len(fbankfeature) is (749.40).

For concatenate propose, I have to padding the phonafeature with constant value 0 to match the len(fbankfeature), i.e., from (430.7) to (749.7). Then I can get the concatenated phonation plus fbank feature (749.47) for demo.wav

But I dont think it is the correct way to use dynamic phonation feature with mfcc/fbank feature as the input to feed a DNN.

Could you help me with this issue?
And why is the different in the length of the output phonation feature and fbank feature under same winlen and winstep?

Many thanks

@jcvasquezc
Copy link
Owner

Hi @ASR2020Guru

You are right on the fact that a zero padding is not the right way to combine those features

There are two reasons why you are getting different lengths for the phonation and fbank features

  1. Phonation features are only computed for speech segments where there is F0 values, i.e., only for voiced segments

Check the code in

for l in range(nF):
as follows, which only add feature vectors when f0!=0

for l in range(nF):
    data_frame=data_audio[int(l*size_stepS):int(l*size_stepS+size_frameS)]
    energy=10*logEnergy(data_frame)
    if F0[l]!=0:
        Amp.append(np.max(np.abs(data_frame)))
        logE.append(energy)
        if lnz>=12: 
            amp_arr=np.asarray([Amp[j] for j in range(lnz-12, lnz)])
            #print(amp_arr)
            apq.append(APQ(amp_arr))
        if lnz>=6: # TODO:
            f0arr=np.asarray([F0nz[j] for j in range(lnz-6, lnz)])
            ppq.append(PPQ(1/f0arr))
        lnz=lnz+1

In case you want to combine the features you should add an else: statement and add zero values to variables Amp, logE, apq, and ppq

  1. In addition, you should consider that apq is only computed after the 12th frame because it is a log-term perturbation with respect to the 11th previous frames, thus they have to padd 11 zeros at the beginning for this feature.

  2. The same ocurrs for ppq, but in this case with the first five frames

If you add these padds at the beginning for apq and ppq you should remove this line where it considers only those frames after the 12th, in orderto properly merge apq and ppq with the rest of the features

feat_mat=np.vstack((DF0[11:], DDF0[10:], Jitter[12:], Shimmer[12:], apq, ppq[6:], logE[12:])).T

If you have further questions, let me know and I can help you

@ASR2020Guru
Copy link
Author

Hi @jcvasquezc ,

Thanks for your quick and helpful reply.

Now I managed to combine these features correctly.

I will let you know if I have any further questions.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants