CCRss commited on
Commit
b92b6ed
1 Parent(s): e166482

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Model Overview
2
+ The **qqp_kz** model is a state-of-the-art paraphrasing tool tailored for the Kazakh language. It is built upon the **humarin/chatgpt_paraphraser_on_T5_base model**, inheriting its robust architecture and adapting it for the nuances of Kazakh.
3
+
4
+ ### Key Features:
5
+ - Language: Specifically designed for paraphrasing in Kazakh.
6
+ - Base Model: Derived from **chatgpt_paraphraser_on_T5_base**, a proven model in paraphrasing tasks.
7
+ - Tokenizer: Utilizes **CCRss/tokenizer_kazakh_t5_kz** for optimal Kazakh language processing.
8
+
9
+ Data Preprocessing
10
+ The dataset used for training the qqp_kz model undergoes rigorous preprocessing to ensure compatibility and optimal performance:
11
+ ```python
12
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
13
+
14
+ tokenizer = AutoTokenizer.from_pretrained("CCRss/tokenizer_kazakh_t5_kz")
15
+
16
+ def preprocess_data(example):
17
+ source = example["src"]
18
+ target = example["trg"]
19
+ source_inputs = tokenizer(source, padding="max_length", truncation=True, max_length=128)
20
+ target_inputs = tokenizer(target, padding="max_length", truncation=True, max_length=128)
21
+ return {**source_inputs, **target_inputs, "labels": target_inputs["input_ids"]}
22
+
23
+ encoded_dataset = dataset.map(preprocess_data)
24
+ encoded_dataset.set_format("torch")
25
+
26
+ ```
27
+ ### Model Training
28
+
29
+ The model is trained with the following configuration:
30
+
31
+ ```python
32
+ from transformers import TrainingArguments, Seq2SeqTrainer
33
+
34
+ name_of_model = "humarin/chatgpt_paraphraser_on_T5_base"
35
+ model = AutoModelForSeq2SeqLM.from_pretrained(name_of_model)
36
+
37
+ training_args = Seq2SeqTrainingArguments(
38
+ per_device_train_batch_size=21,
39
+ gradient_accumulation_steps=3,
40
+ learning_rate=5e-5,
41
+ save_steps=2000,
42
+ num_train_epochs=3,
43
+ output_dir='./results',
44
+ logging_dir='./logs',
45
+ logging_steps=2000,
46
+ eval_steps=2000,
47
+ evaluation_strategy="steps"
48
+ )
49
+
50
+ trainer = Seq2SeqTrainer(
51
+ model=model,
52
+ args=training_args,
53
+ train_dataset=encoded_dataset['train'],
54
+ eval_dataset=encoded_dataset['valid']
55
+ )
56
+
57
+ trainer.train()
58
+
59
+ ```
60
+
61
+ ### Usage
62
+ The **qqp_kz** model is ideal for various NLP applications requiring paraphrasing in Kazakh, including but not limited to, content creation, translation enhancements, and linguistic research.
63
+
64
+ To use the model:
65
+
66
+ - Install the transformers library.
67
+ - Load the model using the Hugging Face API.
68
+ - Input your Kazakh text for paraphrasing.
69
+
70
+ ### Contributions and Feedback
71
+ Contributions to this model are welcome. For any feedback or queries, please open an issue in the repository.