1-800-BAD-CODE
/

punctuation_fullstop_truecase_english

@@ -11,9 +11,9 @@ tags:
 ---
 # Model Overview
-This model accepts as input lower-cased, unpunctuated, unsegmented English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
 # Usage
 The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
@@ -65,7 +65,7 @@ Note that "Friend" in this context is a proper noun, which is why the model cons
 # Model Details
-This model generally follows the graph shown below, with brief descriptions for each step following.
 ![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png)
@@ -75,30 +75,24 @@ The model begins by tokenizing the text with a subword tokenizer.
 The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
 Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
-2. **Post-punctuation**:
-The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
-Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
-Post punctation is predicted once per subword - further discussion is below.
-3. **Re-encoding**
-All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
-Therefore, we must conditional all further predictions on the post punctuation tokens.
-For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
-Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
-The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
 5. **Sentence boundary detection**
-Parallel to the "pre" punctuation, another classification network predicts sentence boundaries from the re-encoded text.
-In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
-6. **Shift and concat sentence boundaries**
-In many languages, the first character of each sentence should be upper-cased.
 Thus, we should feed the sentence boundary information to the true-case classification network.
 Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
 Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
-Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
-7. **True-case prediction**
 Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
 Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
 (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
@@ -110,8 +104,8 @@ This model predicts the following set of punctuation tokens:
 | Token  | Description |
 | ---: | :---------- |
-| <NULL>    | Predict no punctuation |
-| <ACRONYM>    | Every character in this subword ends with a period |
 | .    | Latin full stop |
 | ,    | Latin comma |
 | ?    | Latin question mark |
@@ -127,11 +121,33 @@ This model was trained in the NeMo framework.
 This model was trained with News Crawl data from WMT.
 Approximately 10M lines were used from the years 2021 and 2012.
-The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics, e.g., 2021 contained a lot of COVID discussion.
 # Limitations
 This model was trained on news data, and may not perform well on conversational or informal data.
 # Evaluation
 In these metrics, keep in mind that

 ---
 # Model Overview
+This model accepts as input lower-cased, unpunctuated English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
+In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
 # Usage
 The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
 # Model Details
+This model implements the graph shown below, with brief descriptions for each step following.
 ![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png)
 The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
 Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
+2. **Punctuation**:
+The encoded sequence is then fed into a classification network to predict punctuation tokens.
+Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
+An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.
 5. **Sentence boundary detection**
+For sentence boundary detection, we condition the model on punctuation via embeddings.
+Each punctuation prediction is used to select an embedding for that token, which is concatenated to the encoded representation.
+The SBD head analyzes both the encoding of the un-punctuated sequence and the puncutation predictions, and predicts which tokens are sentence boundaries.
+7. **Shift and concat sentence boundaries**
+In English, the first character of each sentence should be upper-cased.
 Thus, we should feed the sentence boundary information to the true-case classification network.
 Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
 Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
+Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
+8. **True-case prediction**
 Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
 Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
 (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
 | Token  | Description |
 | ---: | :---------- |
+| NULL    | Predict no punctuation |
+| ACRONYM    | Every character in this subword ends with a period |
 | .    | Latin full stop |
 | ,    | Latin comma |
 | ?    | Latin question mark |
 This model was trained with News Crawl data from WMT.
 Approximately 10M lines were used from the years 2021 and 2012.
+The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics, and 2021 is dominated by COVID discussions.
 # Limitations
+## Domain
 This model was trained on news data, and may not perform well on conversational or informal data.
+## Noisy Training Data
+The training data was noisy, and no manual cleaning was utilized.
+Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
+| Token  | Count |
+| ---: | :---------- |
+| Mr    | 115232 |
+| Mr.    | 108212 |
+| Token  | Count |
+| -: | :- |
+| U.S.    | 85324 |
+| US    | 37332 |
+| U.S | 354 |
+| U.s | 108 |
+| u.S. | 65 |
+| u.s | 2 |
+Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
 # Evaluation
 In these metrics, keep in mind that