--- base_model: gpt2 library_name: Distily license: mit tags: - generated_from_trainer model-index: - name: distily_bench_gpt2_attn results: [] --- # distily_bench_gpt2_attn This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified). The [Distily](https://github.com/lapp0/distily) library was used for this distillation. It achieves the following results on the evaluation set: - eval_enwikippl: 201.1306 - eval_frwikippl: 1264.6479 - eval_zhwikippl: 692.1948 - eval_loss: 1.2818 - eval_runtime: 17.7588 - eval_samples_per_second: 56.31 - eval_steps_per_second: 7.039 ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=2.0, loss_fn=jsd, layer_mapper=None, projector=None)) - train_embeddings: True - learning_rate: 4e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: constant - num_epochs: 1.0 ### Resource Usage Peak GPU Memory: 8.2195 GB ### Eval-Phase Metrics | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | **teacher eval** | | 30.2086 | 57.2728 | | | | | 18.1784 | | 0 | 0 | 55429.6875 | 57698.8047 | 6.0515 | 17.8022 | 56.173 | 7.022 | 56988.9141 | | 1000 | 0.0808 | 674.2750 | 4349.4961 | 1.9954 | 17.6958 | 56.51 | 7.064 | 19961.1016 | | 2000 | 0.1616 | 486.4362 | 3202.9236 | 1.8123 | 17.6829 | 56.552 | 7.069 | 1855.9937 | | 3000 | 0.2424 | 398.6795 | 2596.3247 | 1.6971 | 17.7713 | 56.27 | 7.034 | 975.8663 | | 4000 | 0.3232 | 350.7618 | 2375.6218 | 1.6039 | 17.792 | 56.205 | 7.026 | 869.2946 | | 5000 | 0.4040 | 302.1302 | 1985.2614 | 1.5168 | 17.7421 | 56.363 | 7.045 | 967.0451 | | 6000 | 0.4848 | 263.9246 | 1671.2548 | 1.4466 | 17.7779 | 56.25 | 7.031 | 822.3207 | | 7000 | 0.5657 | 242.7309 | 1513.9550 | 1.3874 | 17.8314 | 56.081 | 7.01 | 750.5385 | | 8000 | 0.6465 | 221.2715 | 1384.2833 | 1.3367 | 17.7638 | 56.294 | 7.037 | 824.5199 | | 9000 | 0.7273 | 201.1306 | 1264.6479 | 1.2818 | 17.7588 | 56.31 | 7.039 | 692.1948 | | 10000 | 0.8081 | 184.5633 | 1112.0341 | 1.2357 | 17.6966 | 56.508 | 7.064 | 578.3190 | | 11000 | 0.8889 | 171.4912 | 1108.7455 | 1.1873 | 17.8206 | 56.115 | 7.014 | 545.0269 | | 12000 | 0.9697 | 156.4515 | 982.5362 | 1.1465 | 17.741 | 56.367 | 7.046 | 586.4849 | | 12375 | 1.0 | 154.3638 | 955.2133 | 1.1337 | 17.7532 | 56.328 | 7.041 | 598.8307 | ### Framework versions - Distily 0.2.0 - Transformers 4.44.0 - Pytorch 2.3.0 - Datasets 2.20.0