KBLab
/

wav2vec2-large-voxrex-swedish

Automatic Speech Recognition

hf-asr-leaderboard

Inference Endpoints

Model card Files Files and versions Community

marma commited on Aug 13, 2021

Commit

15bce1b

•

1 Parent(s): cb38569

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -24,16 +24,16 @@ model-index:
     metrics:
        - name: Test WER
          type: wer
-         value: 10.72
 ---
 # Wav2vec 2.0 large VoxRex Swedish
-Finetuned version of KBs [VoxRex large](https://huggingface.co/KBLab/wav2vec2-large-voxrex) model using Swedish radio broadcasts, NST and Common Voice data. Evalutation without a language model gives the following: WER for NST + Common Voice test set (2% of total sentences) is **3.40%**. WER for Common Voice test set is **10.72%** directly and **8.71%** with a 4-gram language model.
 When using this model, make sure that your speech input is sampled at 16kHz.
 ## Training
-This model has additionally pretrained on 3500h of a mix of Swedish local radio broadcasts, audio books and other audio sources. It has been fine-tuned for 120000 updates on NST + CommonVoice<!-- and then for an additional 20000 updates on CommonVoice only. The additional fine-tuning on CommonVoice hurts performance on the NST+CommonVoice test set somewhat and, unsurprisingly, improves it on the CommonVoice test set. It seems to perform generally better though [citation needed]-->.
 ![WER during training](chart_1.svg "WER")

     metrics:
        - name: Test WER
          type: wer
+         value: 9.914
 ---
 # Wav2vec 2.0 large VoxRex Swedish
+Finetuned version of KBs [VoxRex large](https://huggingface.co/KBLab/wav2vec2-large-voxrex) model using Swedish radio broadcasts, NST and Common Voice data. Evalutation without a language model gives the following: WER for NST + Common Voice test set (2% of total sentences) is **3.617%**. WER for Common Voice test set is **9.914%** directly and **7.77%** with a 4-gram language model.
 When using this model, make sure that your speech input is sampled at 16kHz.
 ## Training
+This model has additionally pretrained on 3500h of a mix of Swedish local radio broadcasts, audio books and other audio sources. It has been fine-tuned for 120000 updates on NST + CommonVoice and then for an additional 20000 updates on CommonVoice only. The additional fine-tuning on CommonVoice hurts performance on the NST+CommonVoice test set somewhat and, unsurprisingly, improves it on the CommonVoice test set. It seems to perform generally better though [citation needed].
 ![WER during training](chart_1.svg "WER")