File size: 4,897 Bytes
f1153ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ecd823b
 
f1153ad
 
 
 
 
 
ecd823b
f1153ad
 
2a3f45c
f1153ad
 
2a3f45c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f1153ad
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: cc-by-nc-4.0
tags:
- speech processing
- self-supervision
- african languages
---
## Model description
This self-supervised speech model (a.k.a. SSA-HuBERT-base-60k) is based on a HuBERT Base architecture (~95M params) [1].   
It was trained on nearly 60 000 hours of speech segments and covers 21 languages and variants spoken in Sub-Saharan Africa. 
	
### Pretraining data
- Dataset: The training dataset was composed of both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech). 

- Languages: Bambara (bam), Dyula (dyu), French (fra), Fula (ful), Fulfulde (ffm), Fulfulde (fuh), Gulmancema (gux), Hausa (hau), Kinyarwanda (kin), Kituba (ktu), Lingala (lin), Luba-Lulua (lua), Mossi (mos), Maninkakan (mwk), Sango (sag), Songhai (son), Swahili (swc), Swahili (swh), Tamasheq (taq), Wolof (wol), Zarma (dje).

## ASR fine-tuning
The SpeechBrain toolkit (Ravanelli et al., 2021) is used to fine-tune the model.    
Fine-tuning is done for each language using the FLEURS dataset [2].   
The pretrained model (SSA-HuBERT-base-60k) is considered as a speech encoder and is fully fine-tuned with two 1024 linear layers and a softmax output at the top.     
 
## License
This model is released under the CC-by-NC 4.0 conditions.

## Publication
This model were presented at AfricaNLP 2024. 
The associated paper is available here: [Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context](https://openreview.net/forum?id=zLOhcft2E7)

### Citation
Please cite our paper when using SSA-HuBERT-base-60k model:    

	Caubrière, A., & Gauthier, E. (2024). Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context. In 5th Workshop on African Natural Language Processing (AfricaNLP 2024).

**Bibtex citation:**
```
@inproceedings{caubri{\`e}re2024ssaspeechssl,    
	title={Africa-Centric Self-Supervised Pretraining for Multilingual Speech Representation in a Sub-Saharan Context},    
	author={Antoine Caubri{\`e}re and Elodie Gauthier},    
	booktitle={5th Workshop on African Natural Language Processing},    
	year={2024},    
	url={https://openreview.net/forum?id=zLOhcft2E7}}
```

## Results
The following results are obtained in a greedy mode **(no language model rescoring)**.    
Character error rates (CERs) and Word error rates (WERs) are given in the table below, on the 20 languages of the SSA subpart of the FLEURS dataset.

| **Language**      | **CER**   | **CER (joint finetuning)**   | **WER**   | **WER (joint finetuning)**   |
| :----------------- | :--------- | :--------- | :--------- | :--------- |
| **Afrikaans**     | 23.3      | 20.3 | 68.4      | 62.6 | 
| **Amharic**       | 15.9      | 14.9 | 52.7      | 49.0 |
| **Fula**          | 21.2      | 17.8 | 61.9      | 56.4 |
| **Ganda**         | 11.5      | 10.7 | 52.8      | 50.3 |
| **Hausa**         | 10.5      |  9.0 | 32.5      | 29.4 |
| **Igbo**          | 19.7      | 17.2 | 57.5      | 52.9 |
| **Kamba**         | 16.1      | 15.6 | 53.9      | 53.7 |
| **Lingala**       | 8.7       |  6.9 | 24.7      | 20.9 |
| **Luo**           | 9.9       |  8.2 | 38.9      | 34.9 |
| **Northen-Sotho** | 13.5      | 11.7 | 43.2      | 38.9 |
| **Nyanja**        | 13.3      | 10.9 | 54.2      | 48.3 |
| **Oromo**         | 22.8      | 20.1 | 78.1      | 74.8 |
| **Shona**         | 11.6      |  8.3 | 50.2      | 39.3 |
| **Somali**        | 21.6      | 19.7 | 64.9      | 60.3 |
| **Swahili**       | 7.1       |  5.5 | 23.8      | 20.3 |
| **Umbundu**       | 21.7      | 18.8 | 61.7      | 54.2 |
| **Wolof**         | 19.4      | 17.0 | 55.0      | 50.7 |
| **Xhosa**         | 11.9      |  9.9 | 51.6      | 45.9 |
| **Yoruba**        | 24.3      | 23.5 | 67.5      | 65.7 |
| **Zulu**          | 12.2      |  9.6 | 53.4      | 44.9 |
| *Overall average* | *15.8*    | *13.8* | *52.3*    | *47.7* |


## Reproductibilty
We propose a notebook to reproduce the ASR experiments mentioned in our paper. See `SB_ASR_FLEURS_finetuning.ipynb`.   
By using the `ASR_FLEURS-swahili_hf.yaml` config file, you will be able to run the recipe on Swahili.     

## References
[1] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. In 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp.3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291.   
[2] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.