speechbrain/tts-hifigan-libritts-16kHz · Question about reconstruction

Hi, thanks for sharing this vocoder.

I am using this vocoder to reconstruct the melspectrogram. I have a wav of sample rate 16k, and the length of wav is 65280. I first extract the melspectrogram of hopsize=256 and windowsize=1024, and I get a melspec of dimension [80,245]. I turn it back to wav by using this vocoder, but the reconstructed wav has length 62720 (does not match the input)! The difference is always 2560

I checked the config of vocoder, including hopsize and window size, and they are the same as the mel extraction process. Although there is no significant difference when human listening, the objective evaluation, like stoi and snr and sdr are very very bad (stoi is only 0.15 and sisnr, sdr are negative! ). I think it is because the misalignment between input and the output, but how to fix this problem?

Hi, thanks for sharing this vocoder.

I am using this vocoder to reconstruct the melspectrogram. I have a wav of sample rate 16k, and the length of wav is 65280. I first extract the melspectrogram of hopsize=256 and windowsize=1024, and I get a melspec of dimension [80,245]. I turn it back to wav by using this vocoder, but the reconstructed wav has length 62720 (does not match the input)! The difference is always 2560

I checked the config of vocoder, including hopsize and window size, and they are the same as the mel extraction process. Although there is no significant difference when human listening, the objective evaluation, like stoi and snr and sdr are very very bad (stoi is only 0.15 and sisnr, sdr are negative! ). I think it is because the misalignment between input and the output, but how to fix this problem?

Hi there! I am having the same issues on reconstructions from the VCTK dataset. For instance for an input 92167 hifigan returns a 94976-length waveform. Did you manage to resolve this? What kind of pipeline are you using to convert waveforms to mel spectrograms? I do it with librosa.

In general, if someone could provide some assistance of what is the best pipeline to follow (e.g. finetuning, chaning some parameters etc) for improving reconstructions that would be amazing, thank you!