L0SG's picture
v2.3
7df4c2e

Model Overview

Description:

BigVGAN is a generative AI model specialized in synthesizing audio waveforms using Mel spectrogram as inputs.

BigVGAN is a fully convolutional architecture with several upsampling blocks using transposed convolution followed by multiple residual dilated convolution layers.

BigVGAN consists of a novel module, called anti-aliased multi-periodicity composition (AMP), which is specifically designed for generating waveforms. AMP is specialized in synthesizing high-frequency and periodic soundwaves drawing inspiration from audio signal processing principles.

It applies a periodic activation function, called Snake, which provides an inductive bias to the architecture in generating periodic soundwaves. It also applies anti-aliasing filters to reduce undesired artifacts in the generated waveforms.

This model is ready for commercial use.

References(s):

Model Architecture:

Architecture Type: Convolution Neural Network (CNN)
Network Architecture: You can see the details of this model on this link: https://github.com/NVIDIA/BigVGAN and the related paper can be found here: https://arxiv.org/abs/2206.04658
Model Version: 2.0

Input:

Input Type: Audio
Input Format: Mel Spectrogram
Input Parameters: None
Other Properties Related to Input: The input mel spectrogram has shape [batch, channels, frames], where channels refers to the number of mel bands defined by the model and frames refers to the temporal length. The model supports arbitrary long frames that fits into the GPU memory.

Output:

Input Type: Audio
Output Format: Audio Waveform
Output Parameters: None
Other Properties Related to Output: The output audio waveform has shape [batch, 1, time], where 1 refers to the mono audio channels and time refers to the temporal length. time is defined as a fixed integer multiple of input frames, which is an upsampling ratio of the model (time = upsampling ratio * frames). The output audio waveform consitutes float values with a range of [-1, 1].

Software Integration:

Runtime Engine(s): PyTorch

Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace, NVIDIA Turing, NVIDIA Volta

Preferred/Supported Operating System(s):

Linux

Model Version(s):

v2.0

Training, Testing, and Evaluation Datasets:

Training Dataset:

The dataset contains diverse audio types, including speech in multiple languages, environmental sounds, and instruments.

Links:

** Data Collection Method by dataset

  • Human

** Labeling Method by dataset (for those with labels)

  • Hybrid: Automated, Human, Unknown

Evaluating Dataset:

Properties: The audio generation quality of BigVGAN is evaluated using dev splits of the LibriTTS dataset and Hi-Fi TTS dataset. The datasets include speech in English language with equal balance of genders.

** Data Collection Method by dataset

  • Human

** Labeling Method by dataset

  • Automated

Inference:

Engine: PyTorch
Test Hardware: NVIDIA A100 GPU

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.