omidrohanian commited on
Commit
c254f98
1 Parent(s): 441b0d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -1,16 +1,18 @@
1
  # Model Description
2
- DistilBioBERT is a distilled version of the [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2?text=The+goal+of+life+is+%5BMASK%5D.) which is distilled for 100k training steps using a total batch size of 192 on the PubMed dataset.
3
 
4
  # Distillation Procedure
5
- This model uses a simple distillation technique, which tries to align the output distribution of the student model with the output distribution of the teacher model on the MLM objective. In addition, it optionally uses another alignment loss for aligning the last hidden state of the student and teacher.
6
 
7
  # Initialisation
8
- Following the [DistilBERT](https://huggingface.co/distilbert-base-uncased?text=The+goal+of+life+is+%5BMASK%5D.), for efficient initialising of the student, we take a subset of the larger model by using the same embedding weights and initialising the student from the teacher by taking weights from every other layer.
9
 
10
  # Architecture
11
  In this model, the size of the hidden dimension and the embedding layer are both set to 768. The vocabulary size is 28996 for the cased version which is the one employed in our experiments. The number of transformer layers is 6 and the expansion rate of the feed-forward layer is 4. Overall this model has around 65 million parameters.
12
 
13
  # Citation
 
 
14
  ```bibtex
15
  @misc{https://doi.org/10.48550/arxiv.2209.03182,
16
  doi = {10.48550/ARXIV.2209.03182},
 
1
  # Model Description
2
+ DistilBioBERT is a distilled version of the [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2?text=The+goal+of+life+is+%5BMASK%5D.) model which is distilled for 100k training steps using a total batch size of 192 on the PubMed dataset.
3
 
4
  # Distillation Procedure
5
+ This model uses a simple distillation technique, which tries to align the output distribution of the student model with the output distribution of the teacher based on the MLM objective. In addition, it optionally uses another alignment loss for aligning the last hidden state of the student and teacher.
6
 
7
  # Initialisation
8
+ Following [DistilBERT](https://huggingface.co/distilbert-base-uncased?text=The+goal+of+life+is+%5BMASK%5D.), we initialise the student model by taking weights from every other layer of the teacher.
9
 
10
  # Architecture
11
  In this model, the size of the hidden dimension and the embedding layer are both set to 768. The vocabulary size is 28996 for the cased version which is the one employed in our experiments. The number of transformer layers is 6 and the expansion rate of the feed-forward layer is 4. Overall this model has around 65 million parameters.
12
 
13
  # Citation
14
+ If you use this model, please consider citing the following paper:
15
+
16
  ```bibtex
17
  @misc{https://doi.org/10.48550/arxiv.2209.03182,
18
  doi = {10.48550/ARXIV.2209.03182},