CCRss
/

qqp_kz

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

CCRss commited on Dec 21, 2023

Commit

b92b6ed

•

1 Parent(s): e166482

Create README.md

Files changed (1) hide show

README.md +71 -0

README.md ADDED Viewed

	@@ -0,0 +1,71 @@

+## Model Overview
+The **qqp_kz** model is a state-of-the-art paraphrasing tool tailored for the Kazakh language. It is built upon the **humarin/chatgpt_paraphraser_on_T5_base model**, inheriting its robust architecture and adapting it for the nuances of Kazakh.
+### Key Features:
+- Language: Specifically designed for paraphrasing in Kazakh.
+- Base Model: Derived from **chatgpt_paraphraser_on_T5_base**, a proven model in paraphrasing tasks.
+- Tokenizer: Utilizes **CCRss/tokenizer_kazakh_t5_kz** for optimal Kazakh language processing.
+Data Preprocessing
+The dataset used for training the qqp_kz model undergoes rigorous preprocessing to ensure compatibility and optimal performance:
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+tokenizer = AutoTokenizer.from_pretrained("CCRss/tokenizer_kazakh_t5_kz")
+def preprocess_data(example):
+    source = example["src"]
+    target = example["trg"]
+    source_inputs = tokenizer(source, padding="max_length", truncation=True, max_length=128)
+    target_inputs = tokenizer(target, padding="max_length", truncation=True, max_length=128)
+    return {**source_inputs, **target_inputs, "labels": target_inputs["input_ids"]}
+encoded_dataset = dataset.map(preprocess_data)
+encoded_dataset.set_format("torch")
+```
+### Model Training
+The model is trained with the following configuration:
+```python
+from transformers import TrainingArguments, Seq2SeqTrainer
+name_of_model = "humarin/chatgpt_paraphraser_on_T5_base"
+model = AutoModelForSeq2SeqLM.from_pretrained(name_of_model)
+training_args = Seq2SeqTrainingArguments(
+    per_device_train_batch_size=21,
+    gradient_accumulation_steps=3,
+    learning_rate=5e-5,
+    save_steps=2000,
+    num_train_epochs=3,
+    output_dir='./results',
+    logging_dir='./logs',
+    logging_steps=2000,
+    eval_steps=2000,
+    evaluation_strategy="steps"
+)
+trainer = Seq2SeqTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=encoded_dataset['train'],
+    eval_dataset=encoded_dataset['valid']
+)
+trainer.train()
+```
+### Usage
+The **qqp_kz** model is ideal for various NLP applications requiring paraphrasing in Kazakh, including but not limited to, content creation, translation enhancements, and linguistic research.
+To use the model:
+- Install the transformers library.
+- Load the model using the Hugging Face API.
+- Input your Kazakh text for paraphrasing.
+### Contributions and Feedback
+Contributions to this model are welcome. For any feedback or queries, please open an issue in the repository.