Automatic Speech Recognition
Transformers
Safetensors
wav2vec2
mms
xlsr
Inference Endpoints
vineelpratap commited on
Commit
c71c2f4
1 Parent(s): 21fed1f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -3
README.md CHANGED
@@ -1,3 +1,130 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - mms
4
+ - xlsr
5
+
6
+ license: cc-by-nc-4.0
7
+ datasets:
8
+ - google/fleurs
9
+ - mozilla-foundation/common_voice_8_0
10
+ metrics:
11
+ - wer
12
+ - cer
13
+ ---
14
+
15
+ # Massively Multilingual Speech (MMS) - Finetuned ASR - ALL
16
+
17
+ This checkpoint is a model fine-tuned for multi-lingual ASR and part of Facebook's [Massive Multilingual Speech project](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/).
18
+ This checkpoint is based on the [Wav2Vec2 architecture](https://huggingface.co/docs/transformers/model_doc/wav2vec2) and makes use of adapter models to transcribe 1000+ languages.
19
+ The checkpoint consists of **1 billion parameters** and has been fine-tuned from [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) on 1162 languages.
20
+
21
+ ## Table Of Content
22
+
23
+ - [Example](#example)
24
+ - [Supported Languages](#supported-languages)
25
+ - [Model details](#model-details)
26
+ - [Additional links](#additional-links)
27
+
28
+ ## Example
29
+
30
+ This MMS checkpoint can be used with [Transformers](https://github.com/huggingface/transformers) to transcribe audio of 1107 different
31
+ languages. Let's look at a simple example.
32
+
33
+ First, we install transformers and some other libraries
34
+ ```
35
+ pip install torch accelerate torchaudio datasets
36
+ pip install --upgrade transformers
37
+ ````
38
+
39
+ **Note**: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version
40
+ is not yet available [on PyPI](https://pypi.org/project/transformers/) make sure to install `transformers` from
41
+ source:
42
+ ```
43
+ pip install git+https://github.com/huggingface/transformers.git
44
+ ```
45
+
46
+ Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz.
47
+
48
+ ```py
49
+ from datasets import load_dataset, Audio
50
+
51
+ # English
52
+ stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
53
+ stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
54
+ en_sample = next(iter(stream_data))["audio"]["array"]
55
+
56
+ # French
57
+ stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True)
58
+ stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
59
+ fr_sample = next(iter(stream_data))["audio"]["array"]
60
+ ```
61
+
62
+ Next, we load the model and processor
63
+
64
+ ```py
65
+ from transformers import Wav2Vec2ForCTC, AutoProcessor
66
+ import torch
67
+
68
+ model_id = "facebook/mms-1b-all"
69
+
70
+ processor = AutoProcessor.from_pretrained(model_id)
71
+ model = Wav2Vec2ForCTC.from_pretrained(model_id)
72
+ ```
73
+
74
+ Now we process the audio data, pass the processed audio data to the model and transcribe the model output, just like we usually do for Wav2Vec2 models such as [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
75
+
76
+ ```py
77
+ inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
78
+
79
+ with torch.no_grad():
80
+ outputs = model(**inputs).logits
81
+
82
+ ids = torch.argmax(outputs, dim=-1)[0]
83
+ transcription = processor.decode(ids)
84
+ # 'joe keton disapproved of films and buster also had reservations about the media'
85
+ ```
86
+
87
+ We can now keep the same model in memory and simply switch out the language adapters by calling the convenient [`load_adapter()`]() function for the model and [`set_target_lang()`]() for the tokenizer. We pass the target language as an input - "fra" for French.
88
+
89
+ ```py
90
+ processor.tokenizer.set_target_lang("fra")
91
+ model.load_adapter("fra")
92
+
93
+ inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt")
94
+
95
+ with torch.no_grad():
96
+ outputs = model(**inputs).logits
97
+
98
+ ids = torch.argmax(outputs, dim=-1)[0]
99
+ transcription = processor.decode(ids)
100
+ # "ce dernier est volé tout au long de l'histoire romaine"
101
+ ```
102
+
103
+ In the same way the language can be switched out for all other supported languages. Please have a look at:
104
+ ```py
105
+ processor.tokenizer.vocab.keys()
106
+ ```
107
+
108
+ For more details, please have a look at [the official docs](https://huggingface.co/docs/transformers/main/en/model_doc/mms).
109
+
110
+
111
+ ## Model details
112
+
113
+ - **Developed by:** Jinming Zhao et al.
114
+ - **Model type:** Scaling A Simple Approach to Zero-Shot Speech Recognition
115
+ - **License:** CC-BY-NC 4.0 license
116
+ - **Num parameters**: 300 million
117
+ - **Cite as:**
118
+
119
+ @article{zhao2024scaling,
120
+ title={Scaling A Simple Approach to Zero-Shot Speech Recognition},
121
+ author={Zhao, Jinming and Pratap, Vineel and Auli, Michael},
122
+ journal={arXiv preprint arXiv:2407.17852},
123
+ year={2024}
124
+ }
125
+
126
+ ## Additional Links
127
+
128
+ - [Paper](https://arxiv.org/abs/2407.17852)
129
+ - [GitHub Repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms/zero_shot)
130
+ - [Official Space](https://huggingface.co/spaces/mms-meta/mms-zeroshot)