VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

VoiceRestore is a cutting-edge speech restoration model designed to significantly enhance the quality of degraded voice recordings. Leveraging flow-matching transformers, this model excels at addressing a wide range of audio imperfections commonly found in speech, including background noise, reverberation, distortion, and signal loss.

It is based on this repo & demo of audio restorations: VoiceRestore

Usage - using Transformers 🤗

!git lfs install
!git clone https://huggingface.co/jadechoghari/VoiceRestore
%cd VoiceRestore
!pip install -r requirements.txt

from transformers import AutoModel
# path to the model folder (on colab it's as follows)
checkpoint_path = "/content/VoiceRestore"
model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True)
model("test_input.wav", "test_output.wav")
#add short=False if audio is > 10 seconds
model("long.mp3", "long_output.mp3", short=False)

Example

Degraded Input:

Degraded Input Audio

Restored (steps=32, cfg=1.0):

Restored audio - 16 steps, strength 0.5:

Key Features

Universal Restoration: The model can handle any level and type of voice recording degradation. Pure magic.
Easy to Use: Simple interface for processing degraded audio files.
Pretrained Model: Includes a 301 million parameter transformer model with pre-trained weights. (Model is still in the process of training, there will be further checkpoint updates)

Model Details

Architecture: Flow-matching transformer
Parameters: 300M+ parameters
Input: Degraded speech audio (various formats supported)
Output: Restored speech

Limitations and Future Work

Current model is optimized for speech; may not perform optimally on music or other audio types.
Ongoing research to improve performance on extreme degradations.
Future updates may include real-time processing capabilities.

Citation

If you use VoiceRestore in your research, please cite our paper:

@article{kirdey2024voicerestore,
  title={VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration},
  author={Kirdey, Stanislav},
  journal={arXiv},
  year={2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Based on the E2-TTS implementation by Lucidrains
Special thanks to the open-source community for their invaluable contributions.
Credits: This repository is based on the E2-TTS implementation by Lucidrains