1-800-BAD-CODE
/

punctuation_fullstop_truecase_english

@@ -22,7 +22,8 @@ The easy way to use this model is to install [punctuators](https://github.com/1-
 pip install punctuators
 ```
-Running the following script should load this model and run some texts:
 <details open>
   <summary>Example Usage</summary>
@@ -36,8 +37,10 @@ m = PunctCapSegModelONNX.from_pretrained("pcs_en")
 # Define some input texts to punctuate
 input_texts: List[str] = [
-    "hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
-    "i live in the us where george hw bush was once president"
 ]
 results: List[List[str]] = m.infer(input_texts)
 for input_text, output_texts in zip(input_texts, results):
@@ -49,6 +52,8 @@ for input_text, output_texts in zip(input_texts, results):
 ```
 </details>
 <details open>
@@ -56,7 +61,21 @@ for input_text, output_texts in zip(input_texts, results):
   <summary>Expected Output</summary>
 ```text
 ```
 Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.
@@ -88,7 +107,7 @@ The SBD head analyzes both the encoding of the un-punctuated sequence and the pu
 7. **Shift and concat sentence boundaries**
 In English, the first character of each sentence should be upper-cased.
 Thus, we should feed the sentence boundary information to the true-case classification network.
-Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
 Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
 Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
@@ -132,6 +151,7 @@ This model was trained on news data, and may not perform well on conversational
 ## Noisy Training Data
 The training data was noisy, and no manual cleaning was utilized.
 Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
 | Token  | Count |
@@ -149,9 +169,9 @@ Acronyms and abbreviations are especially noisy; the table below shows how many
 Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
-Further, an assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence.
-However, a non-negligible portion of the training data contains multiple sentences in one line.
 Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
@@ -196,10 +216,10 @@ We show here the cosine similarity between the embeddings of each token:
 | | NULL | ACRONYM | . | , | ? |
 | - | - | - | - | - | - |
 | NULL |	1.00	| | | | |
-| ACRONYM |	-0.93 |	1.00  | | ||
-| . |	-1.00 |	0.94 |	1.00 |	| |
-| ,	| 1.00 |	-0.94 |	-1.00 |	1.00 |	|
-| ?	| -1.00 |	0.93 |	1.00 |	-1.00 |	1.00 |
 Recall that these embeddings are used to predict sentence boundaries... thus we should expect full stops to cluster.
@@ -211,7 +231,7 @@ Next, we see that "`.`" and "`?`" are exactly the same, because w.r.t. SBD these
 Further, we see that "`.`" and "`?`" are exactly the opposite of `NULL`.
 This is expected since these tokens typically imply sentence boundaries, whereas `NULL` and "`,`" never do.
-Lastly, we see that `ACRONYM` is very, but not totally, similar to the full stops "`.`" and "`?`",
-and almost, but not totally, the opposite of `NULL` and "`,`".
 Intuition suggests this is because acronyms can be full stops ("I live in the northern U.S. It's cold here.") or not ("It's 5 a.m. and I'm tired.").

 pip install punctuators
 ```
+Running the following script should load this model and run some random texts I made up:
 <details open>
   <summary>Example Usage</summary>
 # Define some input texts to punctuate
 input_texts: List[str] = [
+    "george hw bush was the president of the us for 8 years",
+    "i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends",
+    "despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live",
+    "i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter",
 ]
 results: List[List[str]] = m.infer(input_texts)
 for input_text, output_texts in zip(input_texts, results):
 ```
+Exact output may vary based on the model version; here is the current output:
 </details>
 <details open>
   <summary>Expected Output</summary>
 ```text
+In: george hw bush was the president of the us for 8 years
+	Out: George H.W. Bush was the president of the U.S. for 8 years.
+In: i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends
+	Out: I woke up at 6 a.m. and took the dog for a hike in the Metacomet Mountains.
+	Out: We like to take morning adventures on the weekends.
+In: despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live
+	Out: Despite being mid March, it snowed overnight and into the morning.
+	Out: Here in Connecticut, it was snowier up in the mountains than in the Farmington Valley where I live.
+In: i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter
+	Out: I saw Mr. Smith at the store he was shopping for a new lawn mower.
+	Out: I suggested he get one of those new battery operated ones.
+	Out: They're so much quieter.
 ```
 Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.
 7. **Shift and concat sentence boundaries**
 In English, the first character of each sentence should be upper-cased.
 Thus, we should feed the sentence boundary information to the true-case classification network.
+Since the true-case classification network is feed-forward and has no temporal context, each time step must embed whether it is the first word of a sentence.
 Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
 Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
 ## Noisy Training Data
 The training data was noisy, and no manual cleaning was utilized.
+### Acronyms and Abbreviations
 Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
 | Token  | Count |
 Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
+### Sentence Boundary Detection Targets
+An assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence.
+However, a non-negligible portion of the training data contains multiple sentences per line.
 Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
 | | NULL | ACRONYM | . | , | ? |
 | - | - | - | - | - | - |
 | NULL |	1.00	| | | | |
+| ACRONYM |	-0.49 |	1.00  | | ||
+| . |	-1.00 |	0.48 |	1.00 |	| |
+| ,	| 1.00 |	-0.48 |	-1.00 |	1.00 |	|
+| ?	| -1.00 |	0.49 |	1.00 |	-1.00 |	1.00 |
 Recall that these embeddings are used to predict sentence boundaries... thus we should expect full stops to cluster.
 Further, we see that "`.`" and "`?`" are exactly the opposite of `NULL`.
 This is expected since these tokens typically imply sentence boundaries, whereas `NULL` and "`,`" never do.
+Lastly, we see that `ACRONYM` is similar to, but not the same as, the full stops "`.`" and "`?`",
+and far from, but not the opposite of, `NULL` and "`,`".
 Intuition suggests this is because acronyms can be full stops ("I live in the northern U.S. It's cold here.") or not ("It's 5 a.m. and I'm tired.").