MohamedRashad commited on
Commit
d2eec13
1 Parent(s): d2e2dfe

feat: Add Fertility Score calculation and Tokenize Tashkeel indicator to leaderboard description

Browse files
Files changed (1) hide show
  1. app.py +2 -1
app.py CHANGED
@@ -215,10 +215,11 @@ leaderboard_description = """The `Total Number of Tokens` in this leaderboard is
215
 
216
  ## Updates
217
  1. New datasets is added for the evaluation (e.g. [arabic-quotes](https://huggingface.co/datasets/HeshamHaroon/arabic-quotes), [Moroccan_Arabic_Wikipedia_20230101_nobots](https://huggingface.co/datasets/SaiedAlshahrani/Moroccan_Arabic_Wikipedia_20230101_nobots)).
218
- 1. `Fertility Score` is calculated by dividing the total number of tokens by the total number of words in the dataset (another way to interpret `Total Number of Tokens`).
219
  1. `Tokenize Tashkeel` is an indicator of whether the tokenizer maintains the tashkeel when tokenizing or not (`✅` for yes, `❌` for no).
220
  1. `Vocab Size` is the total number of tokens in the tokenizer's vocabulary (e.g. `10000` tokens).
221
  1. `Tokenizer Class` is the class of the tokenizer (e.g. `BertTokenizer` or `GPT2Tokenizer`)
 
222
  """
223
 
224
  with gr.Blocks() as demo:
 
215
 
216
  ## Updates
217
  1. New datasets is added for the evaluation (e.g. [arabic-quotes](https://huggingface.co/datasets/HeshamHaroon/arabic-quotes), [Moroccan_Arabic_Wikipedia_20230101_nobots](https://huggingface.co/datasets/SaiedAlshahrani/Moroccan_Arabic_Wikipedia_20230101_nobots)).
218
+ 1. `Fertility Score` is calculated by dividing the total number of tokens by the total number of words in the dataset (Lower is better).
219
  1. `Tokenize Tashkeel` is an indicator of whether the tokenizer maintains the tashkeel when tokenizing or not (`✅` for yes, `❌` for no).
220
  1. `Vocab Size` is the total number of tokens in the tokenizer's vocabulary (e.g. `10000` tokens).
221
  1. `Tokenizer Class` is the class of the tokenizer (e.g. `BertTokenizer` or `GPT2Tokenizer`)
222
+ 1. `Total Number of Tokens` is the total number of tokens in the dataset after tokenization (Lower is better).
223
  """
224
 
225
  with gr.Blocks() as demo: