harithapliyal
/

llama-3-8b-bnb-4bit-finetuned-SentAnalysis

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

llama-3-8b-bnb-4bit-finetuned-SentAnalysis / README.md

harithapliyal's picture

Update README.md

cdf651e verified 24 days ago

|

history blame contribute delete

No virus

1.88 kB

	---
	base_model: unsloth/llama-3-8b-bnb-4bit
	language:
	- en
	license: apache-2.0
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- llama
	- trl
	---

	# Uploaded model

	- Developed by: harithapliyal
	- License: apache-2.0
	- Finetuned from model : unsloth/llama-3-8b-bnb-4bit

	This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

	[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

	from google.colab import userdata
	HF_KEY = userdata.get('HF_KEY')

	from unsloth import FastLanguageModel
	import torch

	<!-- from transformers import TrainingArguments
	from trl import SFTTrainer
	from unsloth import is_bfloat16_supported

	!pip uninstall -y xformers
	!pip install xformers

	!python -m xformers.info

	!pip install triton -->

	# Load model directly
	from transformers import AutoModelForCausalLM, BitsAndBytesConfig

	# Configure the quantization
	```
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype="float16"
	)
	```

	# Load the model with quantization
	```
	model1 = AutoModelForCausalLM.from_pretrained(
	"harithapliyal/llama-3-8b-bnb-4bit-finetuned-SentAnalysis",
	quantization_config=bnb_config
	)



	FastLanguageModel.for_inference(model1) # Enable native 2x faster inference
	inputs = tokenizer(
	[
	fine_tuned_prompt.format(
	"Classify the sentiment of the following text.", # instruction
	"I like play yoga under the rain", # input
	"", # output - leave this blank for generation!
	)
	], return_tensors = "pt").to("cuda")

	outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
	outputs = tokenizer.decode(outputs[0])
	print(outputs)
	```