Kancil-V0-llama3 / README.md

Afrizal Hasbi Azizy

Update README.md

073bba8 verified 5 months ago

No virus

3.78 kB

	---
	library_name: transformers
	tags:
	- unsloth
	- llama3
	- indonesia
	license: llama3
	datasets:
	- catinthebag/TumpengQA
	language:
	- id
	---
	<center>
	<img src="https://imgur.com/9nG5J1T.png" alt="Kancil" width="600" height="300">
	<p><em>Kancil is a fine-tuned version of Llama 3 8B using synthetic QA dataset generated with Llama 3 70B.</em></p>
	</center>

	### Introducing the Kancil family of open models

	Selamat datang!

	I'm super stoked to announce... the 🦌 Kancil! It's a fine-tuned version of Llama 3 8B with the TumpengQA, an instruction dataset of 28 million words. Both the model and dataset is openly available in Huggingface.

	📚 The dataset was synthetically generated from Llama 3 70B. A big problem with existing Indonesian instruction dataset is they're really badly translated versions of English datasets. Llama 3 70B can generate fluent Indonesian! (with minor caveats 😔)

	🦚 This was highly inspired by last year's efforts from Merak-7B, a collection of open, fine-tuned Indonesian models. However, Kancil leveraged synthetic data in a very creative way, which makes it unique from Merak!

	### Version 0.0

	This is the very first working prototype, Kancil V0. It supports basic QA functionalities only. Currently, you cannot chat with it.

	This model was fine-tuned with QLoRA using the amazing Unsloth framework! It was built on top of [unsloth/llama-3-8b-bnb-4bit](https://huggingface.co/unsloth/llama-3-8b-bnb-4bit) and subsequently merged with the adapter back to 4 bit (no visible difference with merging back to fp 16).

	## Uses

	### Direct Use

	This model is developed with research purposes for researchers or general AI hobbyists. However, it has one big application: You can have lots of fun with it!

	### Out-of-Scope Use

	This is a minimally-functional research preview model with no safety curation. Do not use this model for commercial or practical applications.

	You are also not allowed to use this model without having fun.

	## Getting started

	As mentioned, this model was trained with Unsloth. Please use its code for better experience.

	```
	# Install dependencies
	%%capture
	!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
	!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes
	```

	```
	# Load the model
	from unsloth import FastLanguageModel
	import torch

	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name = "catinthebag/Kancil-V0-llama3",
	max_seq_length = max_seq_length,
	dtype = torch.bfloat16, # Will default to float 16 if not available
	load_in_4bit = True,
	)
	```
	```
	# This model was trained on this specific prompt template. Changing it might lead to performance degradations.
	prompt_template = """User: {prompt}
	Asisten: {response}"""

	EOS_TOKEN = tokenizer.eos_token
	def formatting_prompts_func(examples):
	inputs = examples["prompt"]
	outputs = examples["response"]
	texts = []
	for input, output in zip(inputs, outputs):
	text = prompt_template.format(prompt=input, response=output) + EOS_TOKEN
	texts.append(text)
	return { "text" : texts, }
	pass
	```
	```
	# Start generating!
	FastLanguageModel.for_inference(model)
	inputs = tokenizer(
	[
	prompt_template.format(
	prompt="Apa itu generative AI?",
	response="",
	)
	], return_tensors = "pt").to("cuda")

	outputs = model.generate(**inputs, max_new_tokens = 128, temperature=.8, use_cache = True)
	print(tokenizer.batch_decode(outputs)[0])
	```

	Note: There was an issue with the dataset such that newline characters are printed as string literals. Sorry about that!

	## Acknowledgments

	- Developed by: Afrizal Hasbi Azizy
	- Funded by [optional]: DF Labs (dflabs.id)
	- License: Llama 3 Community License Agreement