1-800-BAD-CODE commited on
Commit
b629ea4
1 Parent(s): 5c8868f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -0
README.md CHANGED
@@ -61,6 +61,31 @@ language:
61
  This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes),
62
  and detects sentence boundaries (full stops) in 47 languages.
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ## Tokenizer
66
 
 
61
  This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes),
62
  and detects sentence boundaries (full stops) in 47 languages.
63
 
64
+ # Model Architecture
65
+ This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
66
+ in every language without language-specific behavior:
67
+
68
+ ![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png)
69
+
70
+ We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph.
71
+
72
+ Then we predict punctuation before and after every subtoken.
73
+ Predicting before each token allows for Spanish inverted question marks.
74
+ Predicting after every token allows for all other punctuation, including punctuation within continuous-script
75
+ languages and acronyms.
76
+
77
+ We use embeddings to represent the predicted punctuation tokens to inform the sentence boundary head of the
78
+ punctuation that'll be inserted into the text. This allows proper full stop prediction, since certain punctuation
79
+ tokens (periods, questions marks, etc.) are strongly correlated with sentence boundaries.
80
+
81
+ We then shift full stop predictions to the right by one, to inform the true-casing head of where the beginning
82
+ of each new sentence is. This is important since true-casing is strongly correlated to sentence boundaries.
83
+
84
+ For true-casing, we predict `N` predictions per subtoken, where `N` is the number of characters in the subtoken.
85
+ In practice, `N` is the maximum subtoken length and extra predictions are ignored. Essentially, true-casing is
86
+ modeled as a multi-label problem. This allows for upper-casing arbitrary characters, e.g., "NATO", "MacDonald", "mRNA", etc.
87
+
88
+ Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language.
89
 
90
  ## Tokenizer
91