usama98 commited on
Commit
6363169
1 Parent(s): 143fb7c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -1
README.md CHANGED
@@ -11,4 +11,34 @@ widget:
11
  - text: "عمرو بنِ قُمَيئَة: خَليلَيَّ لا تَستَعجِلا أَن"
12
 
13
  ---
14
- This is a poem generator that creates poems based on the style of the targeted poet. The model was trained on different poets and their respective poems, and the model's input is the poet's name and a suggestion that the model will strive to develop something that imitates the style of that specific poet.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - text: "عمرو بنِ قُمَيئَة: خَليلَيَّ لا تَستَعجِلا أَن"
12
 
13
  ---
14
+
15
+ # GPTPoet: Pre-training GPT2 for Arabic Poetry Language Understanding
16
+
17
+ <img src="https://raw.githubusercontent.com/aub-mind/arabert/master/arabert_logo.png" width="100" align="left"/>
18
+
19
+ **GPTPoet** is an Arabic pretrained language model based on [OpenAi GPT2 architechture](https://github.com/openai/gpt-2). We use the same GPT2-Base config. More details are available in the Google Colab [].
20
+
21
+ To save computation time the model used pretrained weights from another [model](https://huggingface.co/elgeish/gpt2-medium-arabic-poetry). This allowed us to fine-tune our model on our specific dataset, which to our knowledge was never used in NLP task before.
22
+
23
+ This is a poem generator that creates poems based on the style of the targeted poet. The model was trained on different poets and their respective poems, and the model's input is the poet's name and a suggestion that the model will strive to develop something that imitates the style of that specific poet.
24
+
25
+ #
26
+
27
+ ## What's New!
28
+
29
+
30
+ All models are available in the `HuggingFace` model page under the [usama98](https://huggingface.co/usama98/) name. Checkpoints are available in PyTorch.
31
+
32
+ # Dataset
33
+
34
+ The dataset consists of content scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is 1,831,770 poetic verses. Each verse is labeled by its meter, the poet who wrote it, and the age which it was written in. There are 22 meters, 3701 poets and 11 ages: Pre-Islamic, Islamic, Umayyad, Mamluk, Abbasid, Ayyubid, Ottoman, Andalusian, era between Umayyad and Abbasid, Fatimid, and finally the modern age. We are only interested in the 16 classic meters which are attributed to Al-Farahidi, and they comprise the majority of the dataset with a total number around 1.7M verses. It is important to note that the verses diacritic states are not consistent. This means that a verse can carry full, semi diacritics, or it can carry nothing.
35
+
36
+ - [APCD](https://hci-lab.github.io/LearningMetersPoems/#PCD)
37
+
38
+ # Preprocessing
39
+
40
+ It is recommended to apply our preprocessing tokenizer before training/testing on any dataset.
41
+
42
+ # Contacts
43
+ **Usama Zidan**: [Linkedin](https://huggingface.co/elgeish/gpt2-medium-arabic-poetry) | [Github](https://github.com/usama13o) | <usama.zidan@bcu.ac.uk> | <osama.zadan@gmail.com>
44
+