Mixsmol-4x400M-v0.1 by Ontocord

This is the first checkpoint (Epoch 1) of Mixsmol-4x400M-v0.1 Note that this is an experimental in data mixing. Therefore, we only trained the model on 50B tokens (95% English and 5% Vietnamese) to test the following:

Reasoining capabilities through high-quality synthetic textbooks data pretraining
Crosslingual understanding through machine translation and multilingual + multiple tasks pretraining

After verifying our hypothesis with this run, we will schedule a second run on bigger data and compute for it to achieve its maximum capability.

Data

Synthetic Textbooks: 8M samples
RefinedWeb: 1M samples
RedPajama-v2: 500K samples
MathPile: Everything
ThePile: MiniPile Subset
GoodWiki
The Stack Smol XL
The Vault: train_small split
Instruction Pretraining: 250k samples

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
arc_challenge	Yaml	none	25	acc	0.1937	±	0.0115
		none	25	acc_norm	0.2329	±	0.0124
hellaswag	Yaml	none	10	acc	0.2856	±	0.0045
		none	10	acc_norm	0.3090	±	0.0046
mmlu	N/A	none	0	acc	0.2536	±	0.0483
- humanities	N/A	none	5	acc	0.2408	±	0.0341
- other	N/A	none	5	acc	0.2475	±	0.0443
- social_sciences	N/A	none	5	acc	0.2567	±	0.0456
- stem	N/A	none	5	acc	0.2756	±	0.0653
truthfulqa_mc2	Yaml	none	0	acc	0.3909	±	0.0148
winogrande	Yaml	none	5	acc	0.5107	±	0.014
gsm8k	Yaml	get-answer	5	exact_match	0	±	0

Contribution

This work is a shared contribution between Ontocord, BEE-spoke-data and VILM

vilm
/

Mixsmol-4x400M-v0.1-epoch1

Mixsmol-4x400M-v0.1 by Ontocord

Data

Contribution

Collection including vilm/Mixsmol-4x400M-v0.1-epoch1

Mixsmol