Edit model card

Model Card for Model ID

This is the baseline model of Khummuang in Thai-dialect corpus.

The training recipe was based on wsj recipe in espnet.

Model Description

This model is a Hybrid CTC/Attention model with pre-trained HuBERT encoder.

The model was pre-trained on Thai-central, Khummuang, Korat, and Pattani and fine-tuned on Khummuang, Korat, and Pattani. (Experiment 3 in the paper)

We provide some demo code to do inference with this model architecture on colab here. (Code is for Thai-Central. Please select the correct model accordingly.)

Evaluation

For evaluation, the metrics are CER and WER. before WER evaluation, transcriptions were re-tokenized using newmm tokenizer in PyThaiNLP

In this reposirity, we also provide the vocabulary for building the newmm tokenizer using this script:

from pythainlp import Tokenizer

def get_tokenizer(vocab):

    custom_vocab = set(vocab)
    custom_tokenizer = Tokenizer(custom_vocab, engine='newmm')
    return custom_tokenizer

with open(<vocab_path>,'r',encoding='utf-8') as f:
        vocab = []
        for line in f.readlines():
            vocab.append(line.strip())

custom_tokenizer = get_tokenizer(vocab)

tokenized_sentence_list = custom_tokenizer.word_tokenize(<your_sentence>)

The CER and WER results on test set are:

Micro CER Macro CER Survival CER E-commerce WER Micro WER Macro WER Survival WER E-commerce WER
5.35 5.65 6.29 5.02 7.53 8.73 11.38 6.09

Acknowledgement

We would like to thank the PMU-C grant (Thai Language Automatic Speech Recognition Interface for Community E-Commerce, C10F630122) for the support of this research. We also would like to acknowledge the Apex compute cluster team which provides compute support for this project.

Paper

Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition

@inproceedings{suwanbandit23_interspeech,
  author={Artit Suwanbandit and Burin Naowarat and Orathai Sangpetch and Ekapol Chuangsuwanich},
  title={{Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={4069--4073},
  doi={10.21437/Interspeech.2023-1828}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.