--- license: mit datasets: - andythetechnerd03/Vietnamese-Poem-5words language: - vi tags: - art --- # Vietnamese Text Summarization with Poem Summarize a piece of text with poem. Doesn't it sound fun?
## Introduction Jokes aside, this is a fun project by my team at FPT University about fine-tuning a Large Language Model (LLM) at summarizing a piece of long Vietnamese text in the form of **poems**. We call the model **VistralPoem5**.
Here's a little example: ![image](/assets/example_data_transformed.png) ## HuggingFace 🤗 ``` python from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "andythetechnerd03/VistralPoem5" tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto") model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") inputs = [ {"role": "system", "content": "Bạn là một nhà thơ chuyên nghiệp, nhiệm vụ của bạn là chuyển bài văn này thành 1 bài thơ 5 chữ từ khoảng 1 đến 3 khổ"}, {"role": "user", "content": "nhớ tới lời mẹ dặn\nsợ mẹ buồn con đau\nnên tự mình đứng dậy\nnhanh như có phép màu"} ] input_ids = tokenizer.apply_chat_template(inputs, return_tensors="pt").to(model.device) outputs = model.generate( input_ids=input_ids, max_new_tokens=200, do_sample=True, top_p=0.95, top_k=20, temperature=0.1, repetition_penalty=1.05, ) output_str = tokenizer.batch_decode(outputs[:, input_ids.size(1): ], skip_special_tokens=True)[0].strip() print(output_str) ``` ## Fine-tuning [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/andythetechnerd03/Vietnamese-Text-Summarization-Poem/blob/main/notebooks/fine_tune_with_axolotl.ipynb) This is not an easy task. The model we are using is a Vietnamese version of the popular [Mistral-7B](https://arxiv.org/abs/2310.06825) with 7 billion parameters. Obviously, it is very computationally expensive to fine-tune, therefore we applied various state-of-the-art optimization techniques: - [Flash Attention](https://github.com/Dao-AILab/flash-attention): helps reduce computation complexity of Attention from $O(n^2)$ to $O(n\log n)$ - [QLoRA (Quantized Low-Rank Adaptation)](https://arxiv.org/abs/2305.14314): train a smaller "adapter" which is a low-rank weight matrices, allowing for less computation. Furthermore, the base model is quantized to only `4-bit`, this is great for storing large models. - [Mixed Precision Training](https://arxiv.org/abs/1710.03740): here we combine `float32` with `bfloat16` data type for faster training. To train the LLM seamlessly as possible, we used a popular open-source fine-tuning platform called [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl). This platform helps you declare the parameters and config and train quickly without much code. ### Code for fine-tuning model To customize the configuration, you can modify the `create_file_config.py` file. After making your changes, run the script to generate a personalized configuration file. The following is an example of how to execute the model training: ``` python cd src export PYTHONPATH="$PWD" accelerate launch -m axolotl.cli.train config.yaml ``` ## Data This is not easy. Such data that takes the input as a long text (newspaper article, story) and output a poem is very hard to find. So we created our own... by using *prompt engineering*. - The collection of poems is straightforward. There are many repositories and prior works that collected a handful of Vietnamese poems, as well as publicly available samples online. We collected from [FPT Software AI Lab](https://github.com/fsoft-ailab/Poem-Generator) and [HuggingFace](https://github.com/fsoft-ailab/Poem-Generator). - From the poems we use prompt engineering to ask our base model to generate a story from such poem. The prompt is in the form
``` Bạn là một nhà kể chuyện phiếm, nhiệm vụ của bạn là hãy kể 1 câu chuyện đơn giản và ngắn gọn từ một bài thơ, câu chuyện nên là 1 bài liền mạch, thực tế\n\n{insert poem here}``` - Speaking of prompt engineering, there is another prompt to generate poem from context.
```Bạn là một nhà thơ chuyên nghiệp, nhiệm vụ của bạn là chuyển bài văn này thành 1 bài thơ 5 chữ từ khoảng 1 đến 3 khổ: \n {insert context here}``` - The pre-processing step is faily simple. A bit of lowercase here, punctuation removal there, plus reducing poems to 1-3 random paragraphs, and we are done. After all, we have about 72,101 samples with a ratio of 0.05 (68495 on the train set and 3606 on the test set) We published the dataset at [here](https://huggingface.co/datasets/andythetechnerd03/Vietnamese-Poem-5words) ### Custom Evaluation Data As part of the final evaluation for benchmark, we gathered around 27 Vietnamese children's stories and divided into many samples, accumulating to 118 samples. The dataset can be found [here](/data/eval_set.json) ## Model As mentioned earlier, we use [Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat) as the base model and we fine-tune it on our curated dataset earlier. Here's a few configurations: - The model is based on Transformer’s decoder-only architechture: - Number of Attention Heads: 32 - Hidden Size: 4096 - Vocab size: 38369 - Data type: bfloat16 - Number of Hidden Layers (Nx): 32 - Loss function: Cross-entropy - Parameter-Efficient Finetuning: QLora - 4 bit - Alpha: 16 - Rank: 32 - Target: Linear - Gradient accumulation: 4 - Learning Rate: 0.0002 - Warmup Steps: 10 - LR Scheduler: Cosine - Max Steps: 400 - Batch size: 16 - Optimizer: Adamw bnb 8bit - Sequence Len: 1096 The weights can be found [here](https://huggingface.co/andythetechnerd03/VistralPoem5) The notebook for training can be found at `notebook/Fine_tune_LLMs_with_Axolotl.ipynb` ## Benchmark We used the custom evaluation dataset to perform benchmark. Since popular metrics such as ROUGE is not applicable to poem format, we chose a simpler approach - counting the probability of 5-word poems in the result.
Here's the result: | Model | Number of Parameters | Hardware | Probability of 5-word(Higher is better) | Average inference time(Lower is better) | |----------------------------|----------------------|----------------------|-----------------------------------------|-----------------------------------------| | Vistral-7B-Chat (baseline) | 7B | 1x Nvidia Tesla A100 | 4.15% | 6.75s | | Google Gemini Pro* | > 100B | **Multi-TPU** | 18.3% | 3.4s | | **VistralPoem5 (Ours)** | **7B** | 1x Nvidia Tesla A100 | **61.4%** | **3.14s** | * API call, meaning inference time may be affected The benchmark code can be found at `notebook/infer_poem_model.ipynb` and `notebook/probability_5word.ipynb` ## Deployment We used Gradio for fast deployment on Google Colab. It should be in `notebook/infer_poem_model.ipynb` as well. ![Screenshot 2024-03-09 185803](https://github.com/andythetechnerd03/Vietnamese-Poem-Summarization/assets/101492362/8bd94ed1-bb67-48fb-924e-17ad320e3005) Docker Image, coming soon... ## Future Work - [ ] Make a custom loss function to align rhythm and tones. - [ ] Use a better metric for evaluating poems (rhythm and content summarization) - [ ] Use RLHF to align poems with human values. - [ ] And more... ## Credits - [Phan Phuc](https://github.com/pphuc25) for doing the fine-tuning. - [Me](https://github.com/andythetechnerd03) for designing the pipeline and testing the model. - [Truong Vo](https://github.com/justinvo277) for collecting the data.