CCRss
/

Meta-Llama-3.1-8B-Instruct-qlora-nf-ds_oasst1

Model card Files Files and versions Community

CCRss commited on Jul 30

Commit

ed04990

•

1 Parent(s): fd8ec75

Create README.md

Files changed (1) hide show

README.md +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+# Simple QLoRA Model Inference
+This guide demonstrates how to perform inference using a QLoRA (Quantized Low-Rank Adaptation) fine-tuned model with a single code cell.
+## Requirements
+- Python 3.7+
+- PyTorch
+- Transformers
+- PEFT (Parameter-Efficient Fine-Tuning)
+- bitsandbytes
+Install the required packages:
+```
+pip install torch transformers peft bitsandbytes
+```
+## Inference Code
+Copy and paste the following code into a Python script or Jupyter notebook cell:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from peft import PeftModel
+# Set up model paths
+BASE_MODEL_PATH = "meta-llama/Meta-Llama-3.1-8B-Instruct"
+ADAPTER_PATH = "CCRss/Meta-Llama-3.1-8B-Instruct-qlora-nf-ds_oasst1"
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.padding_side = "right"
+# Load quantized model with adapter
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.float16,
+)
+model = AutoModelForCausalLM.from_pretrained(
+    BASE_MODEL_PATH,
+    quantization_config=bnb_config,
+    device_map="auto"
+)
+model = PeftModel.from_pretrained(model, ADAPTER_PATH)
+# Generate text
+prompt = "Explain quantum computing in simple terms:"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+```
+## Usage
+1. Replace `BASE_MODEL_PATH` with the path to your base model.
+2. Replace `ADAPTER_PATH` with the path to your QLoRA adapter.
+3. Modify the `prompt` variable to use your desired input text.
+4. Run the code cell.
+## Customization
+- Adjust `max_new_tokens`, `temperature`, and other generation parameters in the `model.generate()` function call to control the output.
+## Troubleshooting
+- If you encounter CUDA out-of-memory errors, try reducing `max_new_tokens` or using a smaller model.
+- Ensure your GPU drivers and CUDA toolkit are up-to-date.
+For more advanced usage or optimizations, refer to the Hugging Face documentation for Transformers and PEFT.