1-800-BAD-CODE commited on
Commit
b51be78
1 Parent(s): 5e360b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -3
README.md CHANGED
@@ -11,11 +11,10 @@ tags:
11
  ---
12
 
13
  # Model Overview
14
- This model accepts as input lower-cased, unpunctuated English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
15
 
16
  In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
17
 
18
-
19
  # Usage
20
  The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
21
 
@@ -23,7 +22,6 @@ The easy way to use this model is to install [punctuators](https://github.com/1-
23
  pip install punctuators
24
  ```
25
 
26
-
27
  Running the following script should load this model and run some texts:
28
  <details open>
29
 
@@ -185,3 +183,31 @@ The data is a held-out portion of News Crawl, which has been deduplicated.
185
  Examples longer than the model's maximum length (256) were truncated.
186
  The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total.
187
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
  # Model Overview
14
+ This model accepts as input lower-cased, unpunctuated English text and performs in one pass punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
15
 
16
  In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
17
 
 
18
  # Usage
19
  The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
20
 
 
22
  pip install punctuators
23
  ```
24
 
 
25
  Running the following script should load this model and run some texts:
26
  <details open>
27
 
 
183
  Examples longer than the model's maximum length (256) were truncated.
184
  The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total.
185
 
186
+ ## Results
187
+
188
+ # Fun Facts
189
+ Some fun facts are examined in this section.
190
+
191
+ ## Embeddings
192
+ Let's examine the embeddings (see graph above) to see if the model meaningfully employed them.
193
+
194
+ We show here the cosine similarity between the embeddings of each token:
195
+
196
+ | | NULL | ACRONYM | . | , | ? |
197
+ | - | - | - | - | - | - |
198
+ | NULL | 1.00 | | | | |
199
+ | ACRONYM | -0.93 | 1.00 | | ||
200
+ | . | -1.00 | 0.94 | 1.00 | | |
201
+ | , | 1.00 | -0.94 | -1.00 | 1.00 | |
202
+ | ? | -1.00 | 0.93 | 1.00 | -1.00 | 1.00 |
203
+
204
+ Recall that these embeddings are used to predict sentence boundaries... thus we should expect full stops to cluster.
205
+
206
+ Indeed, we see that `NULL` and `COMMA` are exactly the same, because neither have an implication on sentence boundaries.
207
+
208
+ Next, we see that periods and question marks are exactly the same, and exactly the opposite of NULL.
209
+ This is expected since these tokens typically imply sentence boundaries, whereas NULL and commas do not.
210
+
211
+ Lastly, we see that ACRONYM is quite, but not totally, similar to periods and question marks,
212
+ and almost, but not totally, the opposite of NULL and commas.
213
+ Intuitio suggests this is because acronyms can be full stops ("I live in the northern U.S. It's cold here.") or not ("It's 5 a.m. and I'm tired").