cdminix (Christoph Minixhofer)

Posts 2

Post

475

I just added 5 more models to my open source TTS model benchmark, ttsds/benchmark.
Let's talk about the results!

Over the last couple days, I added jbetker/tortoise-tts-v2, metavoiceio/metavoice-1B-v0.1, audo/HierSpeechpp, and the unofficial implementations of amphion/NaturalSpeech2 and amphion/valle by https://huggingface.co/amphion

Takeaways:
- TorToiSe does very well, falling into second place after StyleTTS 2, which is also ranked first in the human evaluation at TTS-AGI/TTS-Arena.
- MetaVoice-1B's overall score is dragged down by its Intelligibility Score (probably due to utterances being cut short), it achieves #3 in Speaker Score, which indicates good voice cloning ability.
- HierSpeech++ lands in the middle of the road in terms of performance, but excels at the Environment Score, achieving #2 - this means the model is especially good at modeling recording conditions such as microphone and background noise.
- The Amphion models, possibly due to not being trained for the same amount as in the papers, achieve relatively low scores. However, they seem to struggle for different reasons. The autoregressive VALLE models have low Intelligibility Scores (possibly due to "babbling" or early stop tokens) while NaturalSpeech2 has low Speaker and Prosody scores.

What's next?
I'm planning to add more open source TTS models like suno/bark, CAMB-AI/MARS5-TTS and fishaudio/fish-speech-1.2. I'll also write an article on these and all the other results soon, since our paper, TTSDS -- Text-to-Speech Distribution Score (2407.12707), mostly focused on establishing the benchmark itself rather than the indiviual TTS systems.

Post

2160

Since new TTS (Text-to-Speech) systems are coming out what feels like every day, and it's currently hard to compare them, my latest project has focused on doing just that.

I was inspired by the TTS-AGI/TTS-Arena (definitely check it out if you haven't), which compares recent TTS system using crowdsourced A/B testing.

I wanted to see if we can also do a similar evaluation with objective metrics and it's now available here:
ttsds/benchmark
Anyone can submit a new TTS model, and I hope this can provide a way to get some information on which areas models perform well or poorly in.

The paper with all the details is available here: https://arxiv.org/abs/2407.12707