Model Description

BioDistilBERT-cased is the result of training the DistilBERT-cased model in a continual learning fashion for 200k training steps using a total batch size of 192 on the PubMed dataset.

Initialisation

We initialise our model with the pre-trained checkpoints of the DistilBERT-cased model available on the Huggingface.

Architecture

In this model, the size of the hidden dimension and the embedding layer are both set to 768. The vocabulary size is 28996. The number of transformer layers is 6 and the expansion rate of the feed-forward layer is 4. Overall this model has around 65 million parameters.

Citation

If you use this model, please consider citing the following paper:

@misc{https://doi.org/10.48550/arxiv.2209.03182,
  doi = {10.48550/ARXIV.2209.03182},
  url = {https://arxiv.org/abs/2209.03182},
  author = {Rohanian, Omid and Nouriborji, Mohammadmahdi and Kouchaki, Samaneh and Clifton, David A.},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, 68T50},
  title = {On the Effectiveness of Compact Biomedical Transformers},
  publisher = {arXiv},
  year = {2022}, 
  copyright = {arXiv.org perpetual, non-exclusive license}
}