romaingrx
/

red-teamer-mistral-nemo

red-teamer-model

Generated from Trainer

Model card Files Files and versions Community

red-teamer-mistral-nemo / README.md

romaingrx's picture

Update README.md

38c6d22 verified 18 days ago

|

history blame contribute delete

No virus

1.74 kB

	---
	base_model: mistralai/Mistral-Nemo-Instruct-2407
	datasets: allenai/wildjailbreak
	tags:
	- trl
	- sft
	- red-teamer-model
	- jailbreaking
	- generated_from_trainer
	---

	This model is a fine-tuned version of [mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) on the [wildjailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) dataset. Only the `adversarial_harmful` data types have been used for training.

	## Uses

	This model is intended to be used for red-teaming purposes only. It generates prompts that are likely to evade existing LLMs' content filters based on the user's input.

	→ The HarmBench evaluation will be released soon.

	## Training Hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 2
	- eval_batch_size: 2
	- seed: 42
	- distributed_type: multi-GPU
	- gradient_accumulation_steps: 16
	- total_train_batch_size: 32
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine_with_restarts
	- lr_scheduler_warmup_ratio: 0.05
	- num_epochs: 1

	## Results

	\| Epoch \| Step \| Validation Loss \| Training Loss \|
	\|:------:\|:----:\|:---------------:\|:-------------:\|
	\| 0.0982 \| 20 \| 1.2933 \| 1.3425 \|
	\| 0.1965 \| 40 \| 1.1966 \| 1.2067 \|
	\| 0.2947 \| 60 \| 1.1594 \| 1.1544 \|
	\| 0.3930 \| 80 \| 1.1386 \| 1.1427 \|
	\| 0.4912 \| 100 \| 1.1259 \| 1.1235 \|
	\| 0.5895 \| 120 \| 1.1179 \| 1.1167 \|
	\| 0.6877 \| 140 \| 1.1129 \| 1.1153 \|
	\| 0.7860 \| 160 \| 1.1098 \| 1.1118 \|
	\| 0.8842 \| 180 \| 1.1086 \| 1.1112 \|
	\| 0.9825 \| 200 \| 1.1083 \| 1.1113 \|