Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Quantization made by Richard Erkhov.

Github

Discord

Request more models

distilgpt2-base-pretrained-he - GGUF

Original model description:

language: he

thumbnail: https://avatars1.githubusercontent.com/u/3617152?norod.jpg widget:

  • text: "ื”ืื™ืฉ ื”ืื—ืจื•ืŸ ืขืœื™ ืื“ืžื•ืช ื™ืฉื‘ ืœื‘ื“ ื‘ื—ื“ืจื• ื›ืฉืœืคืชืข ื ืฉืžืขื” ื ืงื™ืฉื”"
  • text: "ืฉืœื•ื, ืงืจื•ืื™ื ืœื™"
  • text: "ื”ืืจื™ ืคื•ื˜ืจ ื—ื™ื™ืš ื—ื™ื•ืš ื ื‘ื•ืš"
  • text: "ื”ื—ืชื•ืœ ืฉืœืš ืžืื•ื“ ื—ืžื•ื“ ื•"

license: mit

distilgpt2-base-pretrained-he

A tiny GPT2 based Hebrew text generation model initially trained on a TPUv3-8 which was made avilable to me via the TPU Research Cloud Program. Then was further fine-tuned on GPU.

Dataset

oscar (unshuffled deduplicated he) - Homepage | Dataset Permalink

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

CC-100 (he) - HomePage

This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.

Misc

  • Hebrew Twitter
  • Wikipedia
  • Various other sources

Training

Usage

Simple usage sample code


from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

def main():
    model_name="Norod78/distilgpt2-base-pretrained-he"

    prompt_text = "ืฉืœื•ื, ืงื•ืจืื™ื ืœื™"
    generated_max_length = 192

    print("Loading model...")
    model =  AutoModelForCausalLM.from_pretrained(model_name)
    print('Loading Tokenizer...')
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    text_generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

    print("Generating text...")
    result = text_generator(prompt_text, num_return_sequences=1, batch_size=1, do_sample=True, top_k=40, top_p=0.92, temperature = 1, repetition_penalty=5.0, max_length = generated_max_length)

    print("result = " + str(result))

if __name__ == '__main__':
    main()
Downloads last month
241
GGUF
Model size
121M params
Architecture
gpt2

2-bit

3-bit

4-bit

5-bit

6-bit

Inference API
Unable to determine this model's library. Check the docs .