Formatting Datasets for Chat Template Compatibility

Community Article Published June 28, 2024

When working with datasets for fine-tuning conversational models, it's essential to ensure that the data is formatted correctly to work seamlessly with any chat template. In this article, we'll explore a Python function that transforms the nroggendorff/mayo dataset from Hugging Face into a compatible format.

The format_prompts Function

Here's a breakdown of the format_prompts function:

def format_prompts(examples):
    texts = []
    for text in examples['text']:
        conversation = []
        parts = text.split('<|end|>')
        for i in range(0, len(parts) - 1, 2):
            prompt = parts[i].replace("<|user|>", "")
            response = parts[i + 1].replace("<|bot|>", "")
            conversation.append({"role": "user", "content": prompt})
            conversation.append({"role": "assistant", "content": response})
        formatted_conversation = tokenizer.apply_chat_template(conversation, tokenize=False)
        texts.append(formatted_conversation)
    output = {}
    output['text'] = texts
    return output

The function takes an examples parameter, which is expected to be a dictionary containing a 'text' key with a list of conversation strings.

  1. We initialize an empty list called texts to store the formatted conversations.

  2. We iterate over each text in examples['text']:

    • We split the text using the delimiter '<|end|>' to separate the conversation into parts.
    • We iterate over the parts in steps of 2, assuming that even indices represent user prompts and odd indices represent bot responses.
    • We extract the prompt and response by removing the "<|user|>" and "<|bot|>" tags, respectively.
    • We append the prompt and response to the conversation list as dictionaries with "role" and "content" keys.
  3. After processing all the parts, we apply the chat template to the conversation using tokenizer.apply_chat_template(), with tokenize set to False to avoid tokenization at this stage.

  4. We append the formatted_conversation to the texts list.

  5. Finally, we create an output dictionary with a 'text' key containing the list of formatted conversations and return it.

Usage

To use the format_prompts function, you can pass your dataset examples to it:

from datasets import load_dataset

dataset = load_dataset("nroggendorff/mayo", split="train")
dataset = dataset.map(format_prompts, batched=True)

dataset['text'][2] # Check to see if the fields were formatted correctly

By applying this formatting step, you can ensure that your dataset is compatible with various chat templates, making it easier to fine-tune conversational models for different use cases.