Automatic Speech Recognition
Transformers
Safetensors
wav2vec2
mms
xlsr
Inference Endpoints
vineelpratap commited on
Commit
27decb0
1 Parent(s): c71c2f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -84
README.md CHANGED
@@ -14,99 +14,19 @@ metrics:
14
 
15
  # Massively Multilingual Speech (MMS) - Finetuned ASR - ALL
16
 
17
- This checkpoint is a model fine-tuned for multi-lingual ASR and part of Facebook's [Massive Multilingual Speech project](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/).
18
- This checkpoint is based on the [Wav2Vec2 architecture](https://huggingface.co/docs/transformers/model_doc/wav2vec2) and makes use of adapter models to transcribe 1000+ languages.
19
- The checkpoint consists of **1 billion parameters** and has been fine-tuned from [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) on 1162 languages.
20
 
21
  ## Table Of Content
22
 
23
  - [Example](#example)
24
- - [Supported Languages](#supported-languages)
25
  - [Model details](#model-details)
26
  - [Additional links](#additional-links)
27
 
28
  ## Example
29
 
30
- This MMS checkpoint can be used with [Transformers](https://github.com/huggingface/transformers) to transcribe audio of 1107 different
31
- languages. Let's look at a simple example.
32
-
33
- First, we install transformers and some other libraries
34
- ```
35
- pip install torch accelerate torchaudio datasets
36
- pip install --upgrade transformers
37
- ````
38
-
39
- **Note**: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version
40
- is not yet available [on PyPI](https://pypi.org/project/transformers/) make sure to install `transformers` from
41
- source:
42
- ```
43
- pip install git+https://github.com/huggingface/transformers.git
44
- ```
45
-
46
- Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz.
47
-
48
- ```py
49
- from datasets import load_dataset, Audio
50
-
51
- # English
52
- stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
53
- stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
54
- en_sample = next(iter(stream_data))["audio"]["array"]
55
-
56
- # French
57
- stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True)
58
- stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
59
- fr_sample = next(iter(stream_data))["audio"]["array"]
60
- ```
61
-
62
- Next, we load the model and processor
63
-
64
- ```py
65
- from transformers import Wav2Vec2ForCTC, AutoProcessor
66
- import torch
67
-
68
- model_id = "facebook/mms-1b-all"
69
-
70
- processor = AutoProcessor.from_pretrained(model_id)
71
- model = Wav2Vec2ForCTC.from_pretrained(model_id)
72
- ```
73
-
74
- Now we process the audio data, pass the processed audio data to the model and transcribe the model output, just like we usually do for Wav2Vec2 models such as [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
75
-
76
- ```py
77
- inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
78
-
79
- with torch.no_grad():
80
- outputs = model(**inputs).logits
81
-
82
- ids = torch.argmax(outputs, dim=-1)[0]
83
- transcription = processor.decode(ids)
84
- # 'joe keton disapproved of films and buster also had reservations about the media'
85
- ```
86
-
87
- We can now keep the same model in memory and simply switch out the language adapters by calling the convenient [`load_adapter()`]() function for the model and [`set_target_lang()`]() for the tokenizer. We pass the target language as an input - "fra" for French.
88
-
89
- ```py
90
- processor.tokenizer.set_target_lang("fra")
91
- model.load_adapter("fra")
92
-
93
- inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt")
94
-
95
- with torch.no_grad():
96
- outputs = model(**inputs).logits
97
-
98
- ids = torch.argmax(outputs, dim=-1)[0]
99
- transcription = processor.decode(ids)
100
- # "ce dernier est volé tout au long de l'histoire romaine"
101
- ```
102
-
103
- In the same way the language can be switched out for all other supported languages. Please have a look at:
104
- ```py
105
- processor.tokenizer.vocab.keys()
106
- ```
107
-
108
- For more details, please have a look at [the official docs](https://huggingface.co/docs/transformers/main/en/model_doc/mms).
109
-
110
 
111
  ## Model details
112
 
 
14
 
15
  # Massively Multilingual Speech (MMS) - Finetuned ASR - ALL
16
 
17
+ This is a checkpoint of [MMS Zero-shot project](https://arxiv.org/abs/2407.17852), a model to transcribe the speech of almost any language using only a small amount of unlabeled text in the new language.
18
+ The approach is based on a multilingual acoustic model trained on data in 1,150 languages (leveraging the data of [MMS](https://ai.meta.com/blog/multilingual-model-speech-recognition/)) which outputs transcriptions in an intermediate representation ([uroman](https://github.com/isi-nlp/uroman) tokens).
19
+ A small amount of text in the new, unseen language is then also mapped to the this intermediate representation and at infernce time, this mapping, with an optional language model, enables transcribing a new language.
20
 
21
  ## Table Of Content
22
 
23
  - [Example](#example)
 
24
  - [Model details](#model-details)
25
  - [Additional links](#additional-links)
26
 
27
  ## Example
28
 
29
+ Please have a look at [the official space](https://huggingface.co/spaces/mms-meta/mms-zeroshot/tree/main) for an example on using the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Model details
32