Transcription normalization

#2
by qanastek - opened

Thank you very much for your contribution to the community while sharing both models and training scripts.

You have mentioned that the training dataset consists of private subset with 40K hours of English speech plus 25K hours from the following public datasets:

  • Librispeech 960 hours of English speech
  • Fisher Corpus
  • Switchboard-1 Dataset
  • WSJ-0 and WSJ-1
  • National Speech Corpus (Part 1, Part 6)
  • VCTK
  • VoxPopuli (EN)
  • Europarl-ASR (EN)
  • Multilingual Librispeech (MLS EN) - 2,000 hour subset
  • Mozilla Common Voice (v7.0)
  • People's Speech - 12,000 hour subset

But you haven't mentioned any of the normalization steps applied to the transcriptions, while each corpus have its own annotation protocol. Do you share these pre-processing steps anywhere ? I cannot find them on the GitHub repository of NeMo.

Regards.

NVIDIA org
β€’
edited Jan 3

Some of the dataset preprocessing scripts are made available here: https://github.com/NVIDIA/NeMo/tree/main/scripts/dataset_processing

Eventually we will make all public dataset pre processing scripts available.

smajumdar94 changed discussion status to closed

Sign up or log in to comment