mohammadmahdinouri commited on
Commit
ef1c89a
1 Parent(s): c254f98

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -8,7 +8,7 @@ This model uses a simple distillation technique, which tries to align the output
8
  Following [DistilBERT](https://huggingface.co/distilbert-base-uncased?text=The+goal+of+life+is+%5BMASK%5D.), we initialise the student model by taking weights from every other layer of the teacher.
9
 
10
  # Architecture
11
- In this model, the size of the hidden dimension and the embedding layer are both set to 768. The vocabulary size is 28996 for the cased version which is the one employed in our experiments. The number of transformer layers is 6 and the expansion rate of the feed-forward layer is 4. Overall this model has around 65 million parameters.
12
 
13
  # Citation
14
  If you use this model, please consider citing the following paper:
 
8
  Following [DistilBERT](https://huggingface.co/distilbert-base-uncased?text=The+goal+of+life+is+%5BMASK%5D.), we initialise the student model by taking weights from every other layer of the teacher.
9
 
10
  # Architecture
11
+ In this model, the size of the hidden dimension and the embedding layer are both set to 768. The vocabulary size is 28996. The number of transformer layers is 6 and the expansion rate of the feed-forward layer is 4. Overall this model has around 65 million parameters.
12
 
13
  # Citation
14
  If you use this model, please consider citing the following paper: