Quantization made by Richard Erkhov.

distilgpt2-base-pretrained-he - GGUF

Model creator: https://huggingface.co/Norod78/
Original model: https://huggingface.co/Norod78/distilgpt2-base-pretrained-he/

Name	Quant method	Size
distilgpt2-base-pretrained-he.Q2_K.gguf	Q2_K	0.06GB
distilgpt2-base-pretrained-he.IQ3_XS.gguf	IQ3_XS	0.07GB
distilgpt2-base-pretrained-he.IQ3_S.gguf	IQ3_S	0.07GB
distilgpt2-base-pretrained-he.Q3_K_S.gguf	Q3_K_S	0.07GB
distilgpt2-base-pretrained-he.IQ3_M.gguf	IQ3_M	0.07GB
distilgpt2-base-pretrained-he.Q3_K.gguf	Q3_K	0.07GB
distilgpt2-base-pretrained-he.Q3_K_M.gguf	Q3_K_M	0.07GB
distilgpt2-base-pretrained-he.Q3_K_L.gguf	Q3_K_L	0.07GB
distilgpt2-base-pretrained-he.IQ4_XS.gguf	IQ4_XS	0.07GB
distilgpt2-base-pretrained-he.Q4_0.gguf	Q4_0	0.08GB
distilgpt2-base-pretrained-he.IQ4_NL.gguf	IQ4_NL	0.08GB
distilgpt2-base-pretrained-he.Q4_K_S.gguf	Q4_K_S	0.08GB
distilgpt2-base-pretrained-he.Q4_K.gguf	Q4_K	0.08GB
distilgpt2-base-pretrained-he.Q4_K_M.gguf	Q4_K_M	0.08GB
distilgpt2-base-pretrained-he.Q4_1.gguf	Q4_1	0.08GB
distilgpt2-base-pretrained-he.Q5_0.gguf	Q5_0	0.09GB
distilgpt2-base-pretrained-he.Q5_K_S.gguf	Q5_K_S	0.09GB
distilgpt2-base-pretrained-he.Q5_K.gguf	Q5_K	0.09GB
distilgpt2-base-pretrained-he.Q5_K_M.gguf	Q5_K_M	0.09GB
distilgpt2-base-pretrained-he.Q5_1.gguf	Q5_1	0.09GB
distilgpt2-base-pretrained-he.Q6_K.gguf	Q6_K	0.1GB

Original model description:

language: he

thumbnail: https://avatars1.githubusercontent.com/u/3617152?norod.jpg widget:

text: "האיש האחרון עלי אדמות ישב לבד בחדרו כשלפתע נשמעה נקישה"
text: "שלום, קרואים לי"
text: "הארי פוטר חייך חיוך נבוך"
text: "החתול שלך מאוד חמוד ו"

license: mit

distilgpt2-base-pretrained-he

A tiny GPT2 based Hebrew text generation model initially trained on a TPUv3-8 which was made avilable to me via the TPU Research Cloud Program. Then was further fine-tuned on GPU.

Dataset

oscar (unshuffled deduplicated he) - Homepage | Dataset Permalink

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

CC-100 (he) - HomePage

This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.

Misc

Hebrew Twitter
Wikipedia
Various other sources

Training

Done on a TPUv3-8 VM using Huggingface's clm-flax example script
I have made a list of items which might make it easier for other to use this script. The list was posted to This discussion forum
Further training was performed on GPU

Usage

Simple usage sample code


from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

def main():
    model_name="Norod78/distilgpt2-base-pretrained-he"

    prompt_text = "שלום, קוראים לי"
    generated_max_length = 192

    print("Loading model...")
    model =  AutoModelForCausalLM.from_pretrained(model_name)
    print('Loading Tokenizer...')
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    text_generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

    print("Generating text...")
    result = text_generator(prompt_text, num_return_sequences=1, batch_size=1, do_sample=True, top_k=40, top_p=0.92, temperature = 1, repetition_penalty=5.0, max_length = generated_max_length)

    print("result = " + str(result))

if __name__ == '__main__':
    main()