File size: 4,634 Bytes
a828ccf
 
afab8d2
a828ccf
 
 
 
 
de82a22
a828ccf
b92b6ed
1ba1821
b92b6ed
 
 
 
8a650c3
b92b6ed
 
 
 
1ceb4d2
b92b6ed
 
1ceb4d2
 
6ab440d
b92b6ed
1ceb4d2
 
 
b92b6ed
1ceb4d2
b92b6ed
 
1ceb4d2
 
b92b6ed
1ceb4d2
 
b92b6ed
1ceb4d2
 
b92b6ed
 
1ceb4d2
 
b92b6ed
1ceb4d2
b92b6ed
 
 
 
 
 
 
 
1ceb4d2
 
b92b6ed
 
1ceb4d2
b92b6ed
1ceb4d2
b92b6ed
 
1ceb4d2
 
b92b6ed
 
 
 
 
 
 
 
 
 
 
 
 
1ceb4d2
b92b6ed
 
 
 
 
 
 
1ceb4d2
b92b6ed
 
 
 
de82a22
b92b6ed
de82a22
b92b6ed
 
 
 
de82a22
 
b6a57bb
de82a22
b92b6ed
de82a22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
datasets:
- CCRss/small-chatgpt-paraphrases-kz
language:
- kk
library_name: transformers
tags:
- text-generation-inference
license: mit
---
## Model Overview
The **qqp_kz** model is paraphrasing tool tailored for the Kazakh language. It is built upon the **humarin/chatgpt_paraphraser_on_T5_base model**, inheriting its robust architecture and adapting it for the nuances of Kazakh.

### Key Features:
- Language: Specifically designed for paraphrasing in Kazakh.
- Base Model: Derived from **chatgpt_paraphraser_on_T5_base**, a proven model in paraphrasing tasks.
- Tokenizer: Utilizes **CCRss/tokenizer_t5_kz** for optimal Kazakh language processing.

Data Preprocessing
The dataset used for training the qqp_kz model undergoes rigorous preprocessing to ensure compatibility and optimal performance:
```python
# Importing necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Initializing the tokenizer for the specific model. This tokenizer is used to convert
# text input into a format that is understandable by the model.
tokenizer = AutoTokenizer.from_pretrained("CCRss/tokenizer_t5_kz")

# Define a function for preprocessing the data. This function takes an example
# (which includes source and target texts) and tokenizes both texts using the tokenizer.
# The tokenized output is then formatted to a fixed length for consistent model input.
def preprocess_data(example):
    # Extracting the source and target texts from the example
    source = example["src"]
    target = example["trg"]
    
    # Tokenizing the source text with padding and truncation to ensure a fixed length
    source_inputs = tokenizer(source, padding="max_length", truncation=True, max_length=128)
    
    # Tokenizing the target text with padding and truncation to ensure a fixed length
    target_inputs = tokenizer(target, padding="max_length", truncation=True, max_length=128)
    
    # Returning the tokenized inputs, combining both source and target, and setting the target as labels
    return {**source_inputs, **target_inputs, "labels": target_inputs["input_ids"]}

# Applying the preprocessing function to the dataset, effectively transforming all text data
# into a tokenized format suitable for the Seq2Seq model.
encoded_dataset = dataset.map(preprocess_data)
# Setting the format of the dataset to PyTorch tensors for compatibility with the training framework.
encoded_dataset.set_format("torch")

```
### Model Training

The model is trained with the following configuration:

```python

# Importing necessary classes for training from the transformers library
from transformers import TrainingArguments, Seq2SeqTrainer

# Name of the pretrained model to be used for Seq2Seq learning
name_of_model = "humarin/chatgpt_paraphraser_on_T5_base"
# Loading the model from the pretrained weights
model = AutoModelForSeq2SeqLM.from_pretrained(name_of_model)

# Setting up training arguments. This includes batch size, learning rate, number of epochs,
# directories for saving results and logs, and evaluation strategy.
training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=21,
    gradient_accumulation_steps=3,
    learning_rate=5e-5,
    save_steps=2000,
    num_train_epochs=3,
    output_dir='./results',
    logging_dir='./logs',
    logging_steps=2000,
    eval_steps=2000,
    evaluation_strategy="steps"
)

# Initializing the trainer with the model, training arguments, and the datasets for training and evaluation.
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['valid']
)

# Starting the training process of the model using the specified datasets and training arguments.
trainer.train()
```

### Usage
The **qqp_kz** model is specifically designed for paraphrasing in the Kazakh language. It is highly suitable for a variety of NLP tasks such as content creation, enhancing translations, and linguistic research.

To utilize the model:

- Install the transformers library.
- Load the model using the Hugging Face API.
- Input your Kazakh text for paraphrasing.

### Example Deployment
For a practical demonstration of the model in action, please refer to our [Google Colab notebook](https://colab.research.google.com/drive/1ieNhrPnh-MEAlmMgGFVffB1LLXtaXsuf?usp=sharing). This notebook provides a comprehensive example of how to infer with the qqp_kz model.

### Contributions and Feedback
We welcome contributions to the qqp_kz model. If you have suggestions, improvements, or encounter any issues, please feel free to open an issue in the repository.