--- license: apache-2.0 datasets: - mlabonne/orpo-dpo-mix-40k language: - en library_name: transformers base_model: h2oai/h2o-danube2-1.8b-base tags: - llama-factory - unsloth --- # h2o-danube2 with ChatML template This model was first fine-tuned with [BAdam](https://arxiv.org/abs/2404.02827 "BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models") on [mlabonne/orpo-dpo-mix-40k](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k), but as SFT and not DPO, using LLama-Factory. ## Quants Much love, [mradermacher](https://huggingface.co/mradermacher)! - [mradermacher/danube2-1.8b-Neural-GGUF](https://huggingface.co/mradermacher/danube2-1.8b-Neural-GGUF) ## Template ```jinja <|im_start|>user {{instruction}}<|im_end|> <|im_start|>assistant {{response}}<|im_end|> ``` ## BAdam config ```yaml ### model model_name_or_path: danube2-base-chatml ### method stage: sft do_train: true finetuning_type: full use_badam: true badam_switch_mode: ascending badam_switch_interval: 50 badam_verbose: 1 badam_start_block: 12 badam_mask_mode: scatter seed: 314 ### dataset dataset: orpo_sft_mix_40k template: hermes_chatml cutoff_len: 8192 overwrite_cache: false preprocessing_num_workers: 12 ### output output_dir: orpo-chatml-badam logging_steps: 5 save_steps: 1 save_strategy: epoch plot_loss: true overwrite_output_dir: false ### train per_device_train_batch_size: 2 gradient_accumulation_steps: 8 learning_rate: 0.00001 num_train_epochs: 2 lr_scheduler_type: cosine warmup_ratio: 0.01 pure_bf16: true flash_attn: fa2 ### eval val_size: 0.01 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 1000 ``` ### BAdam training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 0.7474 | 0.3653 | 1000 | 0.8887 | | 0.9106 | 0.7306 | 2000 | 0.8681 | | 0.8121 | 1.0958 | 3000 | 0.8635 | | 0.8636 | 1.4611 | 4000 | 0.8562 | | 0.8 | 1.8264 | 5000 | 0.8565 |