File size: 1,889 Bytes
ccb3a7f
 
 
 
 
 
 
 
2243492
 
 
 
 
 
21c955c
2e7ed4e
044301b
2e7ed4e
530b017
6ebb332
21c955c
044301b
21c955c
 
6ebb332
2243492
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
datasets:
- mozilla-foundation/common_voice_13_0
language:
- zh
base_model:
- openai/whisper-large-v3-turbo
pipeline_tag: automatic-speech-recognition
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

This model card describes a fine-tuned version of the Whisper-large-v3-turbo model, optimized for Mandarin automatic speech recognition (ASR). The model was fine-tuned on the Common Voice 13.0 dataset using PEFT with LoRA to ensure efficient training while maintaining the performance of the original model. It achieves the following results on the evaluation set:
<br>
- Common Voice 13.0 dataset(test):<br>
Wer before fine-tune: 77.08
<br>
Wer after fine-tune: 41.47 
<br>
- Common Voice 16.1 dataset(test):<br>
Wer before fine-tune: 77.57
<br>
Wer after fine-tune: 41.66


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
```bash
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "sandy1990418/whisper-large-v3-turbo-chinese"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

```