Edit model card

reazonspeech-nemo-v2

reazonspeech-nemo-v2 is an automatic speech recognition model trained on ReazonSpeech v2.0 corpus.

This model supports inference of long-form Japanese audio clips up to several hours.

Model Architecture

The model features an improved Conformer architecture from Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition.

  • Subword-based RNN-T model. The total parameter count is 619M.

  • Encoder uses Longformer attention with local context size of 256, and has a single global token.

  • Decoder has a vocabulary space of 3000 tokens constructed by SentencePiece unigram tokenizer.

We trained this model for 1 million steps using AdamW optimizer following Noam annealing schedule.

Usage

We recommend to use this model through our reazonspeech library.

from reazonspeech.nemo.asr import load_model, transcribe, audio_from_path

audio = audio_from_path("speech.wav")
model = load_model()
ret = transcribe(model, audio)
print(ret.text)

License

Apaceh Licence 2.0

Downloads last month
7,115
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.