--- '[object Object]': null license: agpl-3.0 --- This repository contains the unquantized merge of [limarp-llongma2-8k lora](https://huggingface.co/lemonilia/limarp-llongma2-8k) in gguf format. You can quantize the f16 gguf to the quantization of your choice by following the below steps: 1. Download and extract the latest [llama.cpp binaries](https://github.com/ggerganov/llama.cpp/releases/download/master-cf658ad/llama-master-cf658ad-bin-win-avx2-x64.zip) ([or compile it yourself if you're on Linux](https://github.com/ggerganov/llama.cpp#build)) 2. Move the "quantize" executable to the same folder where you downloaded the f16 gguf model. 3. Open a command prompt window in that same folder and write the following command, making any changes that you see fit. ```bash quantize.exe limarp-llongma2-13b.f16.gguf limarp-llongma2-13b.q4_0.gguf q4_0 ``` 4. Press enter to run the command and the quantized model will be generated in the folder. The below are the contents of the original model card: # Model Card for LimaRP-LLongMA2-8k-v2 LimaRP-LLongMA2-8k is an experimental [Llama2](https://huggingface.co/meta-llama) finetune narrowly focused on novel-style roleplay chatting, and a continuation of the previously released [LimaRP-llama2](https://huggingface.co/lemonilia/limarp-llama2) with a larger number of training tokens (+95%). To considerably facilitate uploading, distribution and merging with other models, LoRA adapters are provided. LimaRP-LLongMA2 LoRA adapters, as their name suggests, are intended to be applied on LLongMA-2 models with 8k context ([7B](https://huggingface.co/conceptofmind/LLongMA-2-7b) and [13B](https://huggingface.co/conceptofmind/LLongMA-2-13b)) and their derivatives. Data updates may be posted in the future. The current version is **v3**. ## Model Details ### Model Description This is an experimental attempt at creating an RP-oriented fine-tune using a manually-curated, high-quality dataset of human-generated conversations. The main rationale for this are the observations from [Zhou et al.](https://arxiv.org/abs/2305.11206). The authors suggested that just 1000-2000 carefully curated training examples may yield high quality output for assistant-type chatbots. This is in contrast with the commonly employed strategy where a very large number of training examples (tens of thousands to even millions) of widely varying quality are used. For LimaRP a similar approach was used, with the difference that the conversational data is almost entirely human-generated. Every training example is manually compiled and selected to comply with subjective quality parameters, with virtually no chance for OpenAI-style alignment responses to come up. ## Uses The model is intended to approximate the experience of 1-on-1 roleplay as observed on many Internet forums dedicated on roleplaying. It _must_ be used with a specific format similar to that of this template: ``` <> Character's Persona: {bot character description} User's Persona: {user character description} Scenario: {what happens in the story} Play the role of Character. You must engage in a roleplaying chat with User below this line. Do not write dialogues and narration for User. Character should respond with messages of medium length. <> Character: {utterance} <> User: {utterance} [etc.] ``` With `<>`, `<>` and `<>` being special instruct-mode sequences. The text under curly braces must be replaced with appropriate text in _natural language_. Replace `User` and `Character` with actual character names. This more graphical breakdown of the prompt format with a practical example might make it clearer: ![graphical explanation](https://files.catbox.moe/fq8ner.png) ### More detailed notes on prompt format, usage and other settings - **The model has been tested mainly using Oobabooga's `text-generation-webui` as a backend** - Preferably respect spacing and newlines shown above. This might not be possible yet with some frontends. - Replace `Character` and `User` in the above template with your desired names. - The scenario description has a large influence on what the character will do. Try to keep it more open-ended to lessen its impact. - **The model expects users and characters to use third-person narration in simple past and enclose dialogues with standard quotation marks `" "`.** Other formats are not supported (= not in the training data). - Do not use newlines in Persona and Scenario. Use natural language. - The last line in `<>` does not need to be written exactly as depicted, but should mention that `Character` and `User` will engage in roleplay and specify the length of `Character`'s messages - The message lengths used during training are: `tiny`, `short`, `average`, `long`, `huge`, `humongous`. However, there might not have been enough training examples for each length for this instruction to have a significant impact. The preferred lengths for this type of role-playing are `average` or `long`. - Suggested text generation settings: - Temperature ~0.70 - Tail-Free Sampling 0.85 - Repetition penalty ~1.10 (Compared to LLaMAv1, Llama2 appears to require a somewhat higher rep.pen.) - Not used: Top-P (disabled/set to 1.0), Top-K (disabled/set to 0), Typical P (disabled/set to 1.0) ### Sample character cards Here are a few example **SillyTavern character cards** following the required format; download and import into SillyTavern. Feel free to modify and adapt them to your purposes. - [Carina, a 'big sister' android maid](https://files.catbox.moe/1qcqqj.png) - [Charlotte, a cute android maid](https://files.catbox.moe/k1x9a7.png) - [Etma, an 'aligned' AI assistant](https://files.catbox.moe/dj8ggi.png) - [Mila, an anthro pet catgirl](https://files.catbox.moe/amnsew.png) - [Samuel, a handsome vampire](https://files.catbox.moe/f9uiw1.png) And here is a sample of how the model is intended to behave with proper chat and prompt formatting: https://files.catbox.moe/egfd90.png ### Other tips It's possible to make the model automatically generate random character information and scenario by adding just `<>` and the character name in text completion mode in `text-generation-webui`, as done here (click to enlarge). The format generally closely matches that of the training data: ![example](https://files.catbox.moe/5ntmcj.png) ### Out-of-Scope Use The model has not been tested for: - IRC-style chat - Markdown-style roleplay (asterisks for actions, dialogue lines without quotation marks) - Storywriting - Usage without the suggested prompt format Furthermore, the model is not intended nor expected to provide factual and accurate information on any subject. ## Bias, Risks, and Limitations The model will show biases similar to those observed in niche roleplaying forums on the Internet, besides those exhibited by the base model. ### Recommendations The model may easily output disturbing and socially inappropriate content and therefore should not be used by minors or within environments where a general audience is expected. Its outputs will have in general a strong NSFW bias unless the character card/description de-emphasizes it. ## How to Get Started with the Model Download and load with `text-generation-webui` as a back-end application. It's suggested to start the `webui` via command line. Assuming you have copied the LoRA files under a subdirectory called `lora/limarp-llongma2-7b`, you would use something like this for the 7B model: ``` python server.py --api --verbose --model LLongMA-7B --lora limarp-llongma2-7b ``` When using 4-bit `bitsnbytes` it is suggested to use double quantization to increase accuracy. The starting command may be something like this: ``` python server.py --verbose --api --model LLongMA-2-13B --lora limarp13-llongma2-13b --load-in-4bit --use_double_quant ``` Then, preferably use [SillyTavern](https://github.com/SillyTavern/SillyTavern) as a front-end using the following settings: ![SillyTavern settings](https://files.catbox.moe/nd8v12.png) In addition of enabling the instruct mode with the correct sequences, it's particularly important to **enable "Include names"**, as the model was trained with them at the start of each utterance. If it's disabled, the model can start getting confused and often write for the user in its responses. To take advantage of this model's larger context length, unlock the context size and set it up to any length up to 8192 tokens, depending on your VRAM constraints. On most consumer GPUs this will likely need to be set to a lower value. ![Unlock context size](https://files.catbox.moe/wfj8vv.png) It is **recommended to ban/disable the EOS token** as it can for instance apparently give [artifacts or tokenization issues](https://files.catbox.moe/cxfrzu.png) when it ends up getting generated close to punctuation or quotation marks, at least in SillyTavern. These would typically happen with AI responses. ![Ban EOS](https://files.catbox.moe/xslnhb.png) ## Training Details ### Training Data The training data comprises about **1500** manually edited roleplaying conversation threads from various Internet RP forums, for about **24 megabytes** of data. Character and Scenario information was initially filled in for every thread with the help of mainly `gpt-4`. Later on this has been accomplished with a custom summarizer. Conversations in the dataset are almost entirely human-generated except for a handful of messages. Character names in the RP stories have been isolated and replaced with standard placeholder strings. Usernames, out-of-context (OOC) messages and personal information have not been intentionally included. ### Training Procedure The version of LimaRP uploaded on this repository was trained using a small NVidia A40 cluster in 8-bit with regular LoRA adapters and 8-bit AdamW optimizer. #### Training Hyperparameters The most important settings were as follows: - --learning_rate 0.000065 - --lr_scheduler_type cosine - --lora_r 8 - --lora_alpha 16 - --lora_dropout 0.01 - --num_train_epochs 2 - --bf16 True - --tf32 True - --bits 8 - --per_device_train_batch_size 1 - --gradient_accumulation_steps 1 - --optim adamw_bnb_8bit **All linear LoRA layers** were targeted. An effective batch size of 1 was found to yield the lowest loss curves during fine-tuning. It was also found that using `--train_on_source False` with the entire training example at the output yields similar results. These LoRAs have been trained in this way (similar to what was done with [Guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) or as with unsupervised finetuning). ## Environmental Impact Finetuning this model on 8 NVidia A40 48GB in parallel takes about 25 minutes (7B) or 45 minutes (13B).