DPO_TEST_1

This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.2 on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.0229
Rewards/chosen: -740.3076
Rewards/rejected: -1059.6395
Rewards/accuracies: 0.9988
Rewards/margins: 319.3320
Logps/rejected: -10817.7158
Logps/chosen: -7838.3896
Logits/rejected: -32.6170
Logits/chosen: -26.3151

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 2
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 2
training_steps: 25806

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
37.7926	0.67	2867	2.8916	-149.7567	-197.0506	0.9359	47.2939	-2191.8269	-1932.8804	38.0669	4.1036
84.7229	1.33	5734	2.0247	-327.4202	-450.9706	0.9656	123.5504	-4731.0264	-3709.5146	-17.6232	-16.2614
0.4302	2.0	8601	0.2490	-391.4300	-536.8747	0.9923	145.4447	-5590.0679	-4349.6123	-13.6537	-12.7337
0.6952	2.67	11468	0.0587	-606.4489	-775.4740	0.9970	169.0251	-7976.0605	-6499.8027	8.1646	-0.2018
0.2119	3.33	14335	0.2843	-641.6364	-925.0908	0.9907	283.4543	-9472.2285	-6851.6772	-11.2088	-13.0496
0.129	4.0	17202	0.1065	-706.7910	-1019.4420	0.9958	312.6511	-10415.7412	-7503.2227	29.4650	10.0032
0.1046	4.67	20069	0.1005	-758.2514	-1105.3041	0.9977	347.0525	-11274.3594	-8017.8281	-37.3526	-28.3912
0.0656	5.33	22936	0.0241	-790.2775	-1078.3324	0.9986	288.0548	-11004.6445	-8338.0889	-7.1017	-13.6854
0.0	6.0	25803	0.0229	-740.3076	-1059.6395	0.9988	319.3320	-10817.7158	-7838.3896	-32.6170	-26.3151

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.0.1
Datasets 2.16.1
Tokenizers 0.15.0

RAIJAY
/

NBERT_DPO

DPO_TEST_1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for RAIJAY/NBERT_DPO

Evaluation results