Kosuke-Szk's picture
Update README.md
6b04853
|
raw
history blame
2.13 kB
metadata
license: apache-2.0
datasets:
  - common_voice
language:
  - ja
tags:
  - audio

Fine-tuned Japanese Whisper model for speech recognition using whisper-small

Fine-tuned openai/whisper-small on Japanese using Common Voice, JVS and JSUT. When using this model, make sure that your speech input is sampled at 16kHz.

Usage

The model can be used directly as follows.

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
import librosa
import torch

LANG_ID = "ja"
MODEL_ID = "Ivydata/whisper-small-japanese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="ja", task="transcribe"
)
model.config.suppress_tokens = []

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    batch["sampling_rate"] = sampling_rate
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
sample = test_dataset[0]
input_features = processor(sample["speech"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
# ['<|startoftranscript|><|ja|><|transcribe|><|notimestamps|>ζœ¨ζ‘γ•γ‚“γ«ι›»θ©±γ‚’θ²Έγ—γ¦γ‚‚γ‚‰γ„γΎγ—γŸγ€‚<|endoftext|>']

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# ['ζœ¨ζ‘γ•γ‚“γ«ι›»θ©±γ‚’θ²Έγ—γ¦γ‚‚γ‚‰γ„γΎγ—γŸγ€‚']