Qwen2-7B-Instruct-SPPO-Function-call-v2.12

This model is a fine-tuned version of slm-research-vn/Qwen2-7B-Instruct-SPPO-Function-call-v2.8 on the slm-research-vn/dpo-format-function-calling-v4, the slm-research-vn/dpo-format-glaive-code-assistant-v3-with-mistral-large-slm-iter4 and the argilla/dpo-mix-7k datasets. It achieves the following results on the evaluation set:

Loss: 0.3322
Rewards/chosen: 0.5523
Rewards/rejected: -0.7005
Rewards/accuracies: 0.9017
Rewards/margins: 1.2528
Logps/rejected: -278.7327
Logps/chosen: -129.0717
Logits/rejected: -0.5984
Logits/chosen: -0.7738

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-06
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 4
total_train_batch_size: 32
total_eval_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6806	0.0916	100	0.6816	0.0303	0.0099	0.6445	0.0205	-264.5260	-139.5110	-0.5879	-0.7638
0.5704	0.1832	200	0.5993	0.3495	0.1473	0.8237	0.2023	-261.7780	-133.1277	-0.5881	-0.7638
0.5032	0.2749	300	0.5313	0.5795	0.1792	0.8526	0.4003	-261.1383	-128.5271	-0.5893	-0.7651
0.4548	0.3665	400	0.4727	0.6406	0.0523	0.8844	0.5884	-263.6780	-127.3051	-0.5901	-0.7660
0.3823	0.4581	500	0.4235	0.6412	-0.1314	0.8931	0.7726	-267.3507	-127.2934	-0.5914	-0.7672
0.3513	0.5497	600	0.3843	0.6087	-0.3415	0.9133	0.9502	-271.5532	-127.9448	-0.5936	-0.7693
0.3444	0.6413	700	0.3571	0.5871	-0.5028	0.9104	1.0898	-274.7784	-128.3763	-0.5965	-0.7721
0.3486	0.7329	800	0.3427	0.5681	-0.6155	0.9104	1.1836	-277.0341	-128.7559	-0.5971	-0.7725
0.3317	0.8246	900	0.3349	0.5586	-0.6739	0.9133	1.2326	-278.2013	-128.9451	-0.5993	-0.7748
0.3077	0.9162	1000	0.3328	0.5530	-0.6974	0.9075	1.2504	-278.6715	-129.0585	-0.5998	-0.7754

Framework versions

PEFT 0.12.0
Transformers 4.44.0
Pytorch 2.3.1+cu121
Datasets 2.20.0
Tokenizers 0.19.1

khongtrunght
/

Qwen2-7B-Instruct-SPPO-Function-call-v2.12

Qwen2-7B-Instruct-SPPO-Function-call-v2.12

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for khongtrunght/Qwen2-7B-Instruct-SPPO-Function-call-v2.12

Dataset used to train khongtrunght/Qwen2-7B-Instruct-SPPO-Function-call-v2.12

Evaluation results