DipeshChaudhary
/

ShareGPTChatBot-Counselchat1

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

DipeshChaudhary commited on Jul 2

Commit

1e5e798

•

1 Parent(s): 88b16e1

Update README.md

Files changed (1) hide show

README.md +21 -3

README.md CHANGED Viewed

@@ -38,9 +38,9 @@ base_model: unsloth/llama-3-8b-Instruct-bnb-4bit
     load_in_4bit=load_in_4bit,
     )
   ```
-  ```
-    #We now use the Llama-3 format for conversation style finetunes. We use Open Assistant conversations in ShareGPT style.
-    **We use our get_chat_template function to get the correct chat template. They support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old and their own optimized unsloth template**
     from unsloth.chat_templates import get_chat_template
     tokenizer = get_chat_template(
     tokenizer,
@@ -48,6 +48,24 @@ base_model: unsloth/llama-3-8b-Instruct-bnb-4bit
     mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
     )
   ```
 # Uploaded  model

     load_in_4bit=load_in_4bit,
     )
   ```
+   # We now use the Llama-3 format for conversation style finetunes. We use Open Assistant conversations in ShareGPT style.
+   **We use our get_chat_template function to get the correct chat template. They support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old and their own optimized unsloth template**
+  ```
     from unsloth.chat_templates import get_chat_template
     tokenizer = get_chat_template(
     tokenizer,
     mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
     )
   ```
+## FOR ACTUAL INFERENCE
+  ```
+    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
+    messages = [
+        {"from": "human", "value": "I'm worry about my exam."},
+    ]
+    inputs = tokenizer.apply_chat_template(
+        messages,
+        tokenize = True,
+        add_generation_prompt = True, # Must add for generation
+        return_tensors = "pt",
+    ).to("cuda")
+    from transformers import TextStreamer
+    text_streamer = TextStreamer(tokenizer)
+    x= model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)
+  ```
 # Uploaded  model