AI & ML interests

Whisper fine-tuning, Hungarian language specific (LLM) models

The Hungarians Organization

We decided to create an organization to collect the latest (and useable) models for the Hungarian specific finetuned LLMs (Whisper, Bart, LLama, etc). Feel free to join our organization and push your models.

About the models

Hungarian language specific compare test results (on Google/flerus):

Original models WER CER Normalized_WER Normalized_CER Database Split Runtime
openai/whisper-tiny 102.46 50.31 103.37 50.19 google/fleurs test 60.44
openai/whisper-base 89.08 41.3 93.13 41.56 google/fleurs test 89.66
openai/whisper-small 48.67 15.1 45.55 15.39 google/fleurs test 175.03
openai/whisper-medium 32.49 9.58 29.04 10.05 google/fleurs test 393.56
openai/whisper-large 28.2 7.77 24.76 8.31 google/fleurs test 675.77
openai/whisper-large-v2 23.14 5.94 19.83 6.48 google/fleurs test 772.64
openai/whisper-large-v3 18.88 4.56 15.48 5.2 google/fleurs test 667.66
Finetuned models
Hungarians/whisper-small-cv17-hu 188.94 75.87 188.21 77.32 google/fleurs test 472.43
Hungarians/whisper-tiny-cv16-hu-v3 75.9 50.61 85.55 50.91 google/fleurs test 65.17
Hungarians/whisper-tiny-cv16-hu-v2 72.13 41.71 71.13 41.45 google/fleurs test 50.48
Hungarians/whisper-tiny-cv16-hu-final 68.43 38.24 64.07 38.14 google/fleurs test 41.48
Hungarians/whisper-tiny-cv16-hu 64.7 28.02 60.9 27.7 google/fleurs test 42.35
Hungarians/whisper-tiny-hu-cleaned 59.67 26.01 54.72 25.73 google/fleurs test 33.72
Hungarians/whisper-tiny-cv17-hu 58.76 24.86 56.1 24.72 google/fleurs test 39.57
sarpba/whisper-tiny-cv18-hu-cleaned 52.74 24.02 50.09 23.91 google/fleurs test 40.16
Hungarians/whisper-base-cv16-hu-v2 51.41 20.97 46.79 20.93 google/fleurs test 70.57
Hungarians/whisper-base-hu-cleaned 51.38 20.05 46.54 20.14 google/fleurs test 70.84
Hungarians/whisper-base-cv16-hu 50.06 17.71 44.83 17.44 google/fleurs test 65.49
Hungarians/whisper-medium-cv16-hu 49.77 24.98 47.79 25.4 google/fleurs test 498.53
Hungarians/whisper-base-cv16-hu-final 48.37 16.28 43.84 16.31 google/fleurs test 67.07
Hungarians/whisper-base-cv17-hu 45.61 14.95 40.79 14.94 google/fleurs test 64.15
sarpba/whisper-base-cv18-hu-cleaned 42.09 13.67 36.66 13.53 google/fleurs test 54.7
Hungarians/whisper-small-cv16-hu-v2 41.07 13.16 36.59 13.21 google/fleurs test 201.28
Hungarians/Whisper-small-hu-cleaned 39.12 13.91 41.15 14.11 google/fleurs test 274.09
Hungarians/whisper-small-cv16-hu 37.5 11.31 32.54 11.35 google/fleurs test 608.28
Hungarians/whisper-small-cv16-hu-v1.5 35.61 10.99 30.33 11.04 google/fleurs test 605.69
Hungarians/whisper-medium-hu-cleaned 26.26 6.8 21.97 7.31 google/fleurs test 442.53
Our best models
sarpba/whisper-tiny-cv18-hu-cleaned 52.74 24.02 50.09 23.91 google/fleurs test 40.16
sarpba/whisper-base-cv18-hu-cleaned 42.09 13.67 36.66 13.53 google/fleurs test 54.7
sarpba/whisper-small-cv18-hu-cleaned 29.75 9.23 25.19 9.38 google/fleurs test 281.95
sarpba/whisper-medium-cv18-hu-cleaned 23.89 6.79 19.81 7.3 google/fleurs test 541.17
Hungarians/whisper-large-v2-hu-cleaned 21.82 5.51 18.39 6.15 google/fleurs test 725.31
AZ UTOLSÓ HÁROM SOR INT8 KVANTÁLT MODELL EREDMÉNYE.

Quant loss examle

Model WER CER Normalized_WER Normalized_CER Database Split Runtime
Hungarians/whisper-base-cv17-hu 45.61 14.95 40.79 14.94 google/fleurs test 243.97
float16 50.55 21.01 46.81 20.99 google/fleurs test 301.41
float32 49.69 20.77 47.38 20.74 google/fleurs test 339.15
int8_float32 46.71 16.67 42.51 16.51 google/fleurs test 246.06
int8_float16 46.5 17.13 42.23 16.92 google/fleurs test 242.12
int8_bfloat16 45.7 15.06 41.03 15.04 google/fleurs test 148.05
bfloat16 45.6 15 40.88 14.97 google/fleurs test 144.87
int8 45.54 16.55 42.4 16.44 google/fleurs test 236.97

As you can see the INT8 quant have better points form original modell.

Lower value is better!

For Homeassistant faster-whisper need to use, the int8, fp16, fp32 modells, from subfolders.

Some Hungarian info bellow:

A kész nodellek mindíg itt vannak, az én (sarpba) repómban a félkész, vagy kisérleti stádiumu cuccok vannak.

Hosassistant faster-whisperhez az almappákban lévő int8, fp16, fp32 ct2 quantised (ezt nem tom hogy kéne magyarul írni :)) modelleket tudjátok használni a legegyszerűbben cociweb custom_whisper addonjával.

Közösség

Ha szeretnél csatlakozni a magyar nyelvű társalkodó csoportunkhoz ahol kérdezhetsz, megoszthatod a tapasztalataidat, vagy egy, a magyar LLM szakértőiből álló csoport tagja szeretnél lenni, csatlakozz FB csoportunkhoz: Hungarian-LLM.