Edit model card

CodeParrot-Multi 🦜 (small)

CodeParrot-Multi 🦜 is a GPT-2 model (110M parameters) trained to generate code in 9 programming languages: "Java", "JavaScript", "PHP", "Python", "C#", "C++", "GO", "Ruby" and "TypeScript".

Usage

You can load the CodeParrot-Multi model and tokenizer directly in transformers:

from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small-multi")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small-multi")

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)

or with a pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model="codeparrot/codeparrot-small-multi")
outputs = pipe("def hello_world():")

Training

The model was trained on the small Github code small after near deduplication, a subset of Github code dataset with the following settings:

Config Value
Batch size 192
Context size 1024
Training steps 300'000
Gradient accumulation 2
Gradient checkpointing False
Learning rate 5e-4
Weight decay 0.1
Warmup steps 2000
Schedule Cosine

The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 58 billion tokens.

Performance

We evaluated the model on OpenAI's HumanEval benchmark which consists of programming challenges:

Metric Value
pass@1 --%
pass@10 --%
pass@100 --%

The pass@k metric tells the probability that at least one out of k generations passes the tests.

Resources

Downloads last month
960
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train codeparrot/codeparrot-small-multi