|
--- |
|
datasets: |
|
- CCRss/small-chatgpt-paraphrases-kz |
|
language: |
|
- kk |
|
library_name: transformers |
|
tags: |
|
- text-generation-inference |
|
license: mit |
|
--- |
|
## Model Overview |
|
The **qqp_kz** model is paraphrasing tool tailored for the Kazakh language. It is built upon the **humarin/chatgpt_paraphraser_on_T5_base model**, inheriting its robust architecture and adapting it for the nuances of Kazakh. |
|
|
|
### Key Features: |
|
- Language: Specifically designed for paraphrasing in Kazakh. |
|
- Base Model: Derived from **chatgpt_paraphraser_on_T5_base**, a proven model in paraphrasing tasks. |
|
- Tokenizer: Utilizes **CCRss/tokenizer_t5_kz** for optimal Kazakh language processing. |
|
|
|
Data Preprocessing |
|
The dataset used for training the qqp_kz model undergoes rigorous preprocessing to ensure compatibility and optimal performance: |
|
```python |
|
# Importing necessary modules from the transformers library |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
# Initializing the tokenizer for the specific model. This tokenizer is used to convert |
|
# text input into a format that is understandable by the model. |
|
tokenizer = AutoTokenizer.from_pretrained("CCRss/tokenizer_t5_kz") |
|
|
|
# Define a function for preprocessing the data. This function takes an example |
|
# (which includes source and target texts) and tokenizes both texts using the tokenizer. |
|
# The tokenized output is then formatted to a fixed length for consistent model input. |
|
def preprocess_data(example): |
|
# Extracting the source and target texts from the example |
|
source = example["src"] |
|
target = example["trg"] |
|
|
|
# Tokenizing the source text with padding and truncation to ensure a fixed length |
|
source_inputs = tokenizer(source, padding="max_length", truncation=True, max_length=128) |
|
|
|
# Tokenizing the target text with padding and truncation to ensure a fixed length |
|
target_inputs = tokenizer(target, padding="max_length", truncation=True, max_length=128) |
|
|
|
# Returning the tokenized inputs, combining both source and target, and setting the target as labels |
|
return {**source_inputs, **target_inputs, "labels": target_inputs["input_ids"]} |
|
|
|
# Applying the preprocessing function to the dataset, effectively transforming all text data |
|
# into a tokenized format suitable for the Seq2Seq model. |
|
encoded_dataset = dataset.map(preprocess_data) |
|
# Setting the format of the dataset to PyTorch tensors for compatibility with the training framework. |
|
encoded_dataset.set_format("torch") |
|
|
|
``` |
|
### Model Training |
|
|
|
The model is trained with the following configuration: |
|
|
|
```python |
|
|
|
# Importing necessary classes for training from the transformers library |
|
from transformers import TrainingArguments, Seq2SeqTrainer |
|
|
|
# Name of the pretrained model to be used for Seq2Seq learning |
|
name_of_model = "humarin/chatgpt_paraphraser_on_T5_base" |
|
# Loading the model from the pretrained weights |
|
model = AutoModelForSeq2SeqLM.from_pretrained(name_of_model) |
|
|
|
# Setting up training arguments. This includes batch size, learning rate, number of epochs, |
|
# directories for saving results and logs, and evaluation strategy. |
|
training_args = Seq2SeqTrainingArguments( |
|
per_device_train_batch_size=21, |
|
gradient_accumulation_steps=3, |
|
learning_rate=5e-5, |
|
save_steps=2000, |
|
num_train_epochs=3, |
|
output_dir='./results', |
|
logging_dir='./logs', |
|
logging_steps=2000, |
|
eval_steps=2000, |
|
evaluation_strategy="steps" |
|
) |
|
|
|
# Initializing the trainer with the model, training arguments, and the datasets for training and evaluation. |
|
trainer = Seq2SeqTrainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=encoded_dataset['train'], |
|
eval_dataset=encoded_dataset['valid'] |
|
) |
|
|
|
# Starting the training process of the model using the specified datasets and training arguments. |
|
trainer.train() |
|
``` |
|
|
|
### Usage |
|
The **qqp_kz** model is specifically designed for paraphrasing in the Kazakh language. It is highly suitable for a variety of NLP tasks such as content creation, enhancing translations, and linguistic research. |
|
|
|
To utilize the model: |
|
|
|
- Install the transformers library. |
|
- Load the model using the Hugging Face API. |
|
- Input your Kazakh text for paraphrasing. |
|
|
|
### Example Deployment |
|
For a practical demonstration of the model in action, please refer to our [Google Colab notebook](https://colab.research.google.com/drive/1ieNhrPnh-MEAlmMgGFVffB1LLXtaXsuf?usp=sharing). This notebook provides a comprehensive example of how to infer with the qqp_kz model. |
|
|
|
### Contributions and Feedback |
|
We welcome contributions to the qqp_kz model. If you have suggestions, improvements, or encounter any issues, please feel free to open an issue in the repository. |