Edit model card
YAML Metadata Error: "widget" must be an array

πŸ”₯ RoBERTa-MLM-based PyTorrent 1M πŸ”₯

Pretrained weights based on PyTorrent Dataset which is a curated data from a large official Python packages. We use PyTorrent dataset to train a preliminary DistilBERT-Masked Language Modeling(MLM) model from scratch. The trained model, along with the dataset, aims to help researchers to easily and efficiently work on a large dataset of Python packages using only 5 lines of codes to load the transformer-based model. We use 1M raw Python scripts of PyTorrent that includes 12,350,000 LOC to train the model. We also train a byte-level Byte-pair encoding (BPE) tokenizer that includes 56,000 tokens, which is truncated LOC with the length of 50 to save computation resources.

Training Objective

This model is trained with a Masked Language Model (MLM) objective.

How to use the model?

from transformers import AutoTokenizer, AutoModel


tokenizer = AutoTokenizer.from_pretrained("Fujitsu/pytorrent")
model = AutoModel.from_pretrained("Fujitsu/pytorrent")

Citation

Preprint: https://arxiv.org/pdf/2110.01710.pdf

@misc{bahrami2021pytorrent,
      title={PyTorrent: A Python Library Corpus for Large-scale Language Models}, 
      author={Mehdi Bahrami and N. C. Shrikanth and Shade Ruangwan and Lei Liu and Yuji Mizobuchi and Masahiro Fukuyori and Wei-Peng Chen and Kazuki Munakata and Tim Menzies},
      year={2021},
      eprint={2110.01710},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      howpublished={https://arxiv.org/pdf/2110.01710},
}
Downloads last month
28
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.