1-800-BAD-CODE commited on
Commit
f852113
1 Parent(s): c7bbf57

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -13
README.md CHANGED
@@ -22,7 +22,8 @@ The easy way to use this model is to install [punctuators](https://github.com/1-
22
  pip install punctuators
23
  ```
24
 
25
- Running the following script should load this model and run some texts:
 
26
  <details open>
27
 
28
  <summary>Example Usage</summary>
@@ -36,8 +37,10 @@ m = PunctCapSegModelONNX.from_pretrained("pcs_en")
36
 
37
  # Define some input texts to punctuate
38
  input_texts: List[str] = [
39
- "hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
40
- "i live in the us where george hw bush was once president"
 
 
41
  ]
42
  results: List[List[str]] = m.infer(input_texts)
43
  for input_text, output_texts in zip(input_texts, results):
@@ -49,6 +52,8 @@ for input_text, output_texts in zip(input_texts, results):
49
 
50
  ```
51
 
 
 
52
  </details>
53
 
54
  <details open>
@@ -56,7 +61,21 @@ for input_text, output_texts in zip(input_texts, results):
56
  <summary>Expected Output</summary>
57
 
58
  ```text
 
 
 
 
 
 
59
 
 
 
 
 
 
 
 
 
60
  ```
61
 
62
  Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.
@@ -88,7 +107,7 @@ The SBD head analyzes both the encoding of the un-punctuated sequence and the pu
88
  7. **Shift and concat sentence boundaries**
89
  In English, the first character of each sentence should be upper-cased.
90
  Thus, we should feed the sentence boundary information to the true-case classification network.
91
- Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
92
  Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
93
  Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
94
 
@@ -132,6 +151,7 @@ This model was trained on news data, and may not perform well on conversational
132
  ## Noisy Training Data
133
  The training data was noisy, and no manual cleaning was utilized.
134
 
 
135
  Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
136
 
137
  | Token | Count |
@@ -149,9 +169,9 @@ Acronyms and abbreviations are especially noisy; the table below shows how many
149
 
150
  Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
151
 
152
-
153
- Further, an assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence.
154
- However, a non-negligible portion of the training data contains multiple sentences in one line.
155
  Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
156
 
157
 
@@ -196,10 +216,10 @@ We show here the cosine similarity between the embeddings of each token:
196
  | | NULL | ACRONYM | . | , | ? |
197
  | - | - | - | - | - | - |
198
  | NULL | 1.00 | | | | |
199
- | ACRONYM | -0.93 | 1.00 | | ||
200
- | . | -1.00 | 0.94 | 1.00 | | |
201
- | , | 1.00 | -0.94 | -1.00 | 1.00 | |
202
- | ? | -1.00 | 0.93 | 1.00 | -1.00 | 1.00 |
203
 
204
  Recall that these embeddings are used to predict sentence boundaries... thus we should expect full stops to cluster.
205
 
@@ -211,7 +231,7 @@ Next, we see that "`.`" and "`?`" are exactly the same, because w.r.t. SBD these
211
  Further, we see that "`.`" and "`?`" are exactly the opposite of `NULL`.
212
  This is expected since these tokens typically imply sentence boundaries, whereas `NULL` and "`,`" never do.
213
 
214
- Lastly, we see that `ACRONYM` is very, but not totally, similar to the full stops "`.`" and "`?`",
215
- and almost, but not totally, the opposite of `NULL` and "`,`".
216
  Intuition suggests this is because acronyms can be full stops ("I live in the northern U.S. It's cold here.") or not ("It's 5 a.m. and I'm tired.").
217
 
 
22
  pip install punctuators
23
  ```
24
 
25
+ Running the following script should load this model and run some random texts I made up:
26
+
27
  <details open>
28
 
29
  <summary>Example Usage</summary>
 
37
 
38
  # Define some input texts to punctuate
39
  input_texts: List[str] = [
40
+ "george hw bush was the president of the us for 8 years",
41
+ "i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends",
42
+ "despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live",
43
+ "i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter",
44
  ]
45
  results: List[List[str]] = m.infer(input_texts)
46
  for input_text, output_texts in zip(input_texts, results):
 
52
 
53
  ```
54
 
55
+ Exact output may vary based on the model version; here is the current output:
56
+
57
  </details>
58
 
59
  <details open>
 
61
  <summary>Expected Output</summary>
62
 
63
  ```text
64
+ In: george hw bush was the president of the us for 8 years
65
+ Out: George H.W. Bush was the president of the U.S. for 8 years.
66
+
67
+ In: i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends
68
+ Out: I woke up at 6 a.m. and took the dog for a hike in the Metacomet Mountains.
69
+ Out: We like to take morning adventures on the weekends.
70
 
71
+ In: despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live
72
+ Out: Despite being mid March, it snowed overnight and into the morning.
73
+ Out: Here in Connecticut, it was snowier up in the mountains than in the Farmington Valley where I live.
74
+
75
+ In: i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter
76
+ Out: I saw Mr. Smith at the store he was shopping for a new lawn mower.
77
+ Out: I suggested he get one of those new battery operated ones.
78
+ Out: They're so much quieter.
79
  ```
80
 
81
  Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.
 
107
  7. **Shift and concat sentence boundaries**
108
  In English, the first character of each sentence should be upper-cased.
109
  Thus, we should feed the sentence boundary information to the true-case classification network.
110
+ Since the true-case classification network is feed-forward and has no temporal context, each time step must embed whether it is the first word of a sentence.
111
  Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
112
  Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
113
 
 
151
  ## Noisy Training Data
152
  The training data was noisy, and no manual cleaning was utilized.
153
 
154
+ ### Acronyms and Abbreviations
155
  Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
156
 
157
  | Token | Count |
 
169
 
170
  Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
171
 
172
+ ### Sentence Boundary Detection Targets
173
+ An assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence.
174
+ However, a non-negligible portion of the training data contains multiple sentences per line.
175
  Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
176
 
177
 
 
216
  | | NULL | ACRONYM | . | , | ? |
217
  | - | - | - | - | - | - |
218
  | NULL | 1.00 | | | | |
219
+ | ACRONYM | -0.49 | 1.00 | | ||
220
+ | . | -1.00 | 0.48 | 1.00 | | |
221
+ | , | 1.00 | -0.48 | -1.00 | 1.00 | |
222
+ | ? | -1.00 | 0.49 | 1.00 | -1.00 | 1.00 |
223
 
224
  Recall that these embeddings are used to predict sentence boundaries... thus we should expect full stops to cluster.
225
 
 
231
  Further, we see that "`.`" and "`?`" are exactly the opposite of `NULL`.
232
  This is expected since these tokens typically imply sentence boundaries, whereas `NULL` and "`,`" never do.
233
 
234
+ Lastly, we see that `ACRONYM` is similar to, but not the same as, the full stops "`.`" and "`?`",
235
+ and far from, but not the opposite of, `NULL` and "`,`".
236
  Intuition suggests this is because acronyms can be full stops ("I live in the northern U.S. It's cold here.") or not ("It's 5 a.m. and I'm tired.").
237