raynardj commited on
Commit
389c472
1 Parent(s): 8584913

a model to fine disease

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - ner
6
+ - ncbi
7
+ - disease
8
+ - pubmed
9
+ - bioinfomatics
10
+ license: apache-2.0
11
+ datasets:
12
+ - ncbi-disease
13
+ - bc5cdr
14
+ widget:
15
+ - text: "Hepatocyte nuclear factor 4 alpha (HNF4α) is regulated by different promoters to generate two isoforms, one of which functions as a tumor suppressor. Here, the authors reveal that induction of the alternative isoform in hepatocellular carcinoma inhibits the circadian clock by repressing BMAL1, and the reintroduction of BMAL1 prevents HCC tumor growth."
16
+
17
+ ---
18
+
19
+ # NER to find Gene & Gene products
20
+ > The model was trained on ncbi-disease, BC5CDR dataset, pretrained on this [pubmed-pretrained roberta model](/raynardj/roberta-pubmed)
21
+ All the labels, the possible token classes.
22
+ ```json
23
+ {"label2id": {
24
+ "O": 0,
25
+ "Disease":1,
26
+ }
27
+ }
28
+ ```
29
+
30
+ Notice, we removed the 'B-','I-' etc from data label.🗡
31
+
32
+ ## This is the template we suggest for using the model
33
+ ```python
34
+ from transformers import pipeline
35
+ PRETRAINED = "raynardj/ner-disease-ncbi-bionlp-bc5cdr-pubmed"
36
+ ner = pipeline(task="ner",model=PRETRAINED, tokenizer=PRETRAINED)
37
+ ner("Your text", aggregation_strategy="first")
38
+ ```
39
+ And here is to make your output more consecutive ⭐️
40
+ ```python
41
+ import pandas as pd
42
+ from transformers import AutoTokenizer
43
+ tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
44
+ def clean_output(outputs):
45
+ results = []
46
+ current = []
47
+ last_idx = 0
48
+ # make to sub group by position
49
+ for output in outputs:
50
+ if output["index"]-1==last_idx:
51
+ current.append(output)
52
+ else:
53
+ results.append(current)
54
+ current = [output, ]
55
+ last_idx = output["index"]
56
+ if len(current)>0:
57
+ results.append(current)
58
+
59
+ # from tokens to string
60
+ strings = []
61
+ for c in results:
62
+ tokens = []
63
+ starts = []
64
+ ends = []
65
+ for o in c:
66
+ tokens.append(o['word'])
67
+ starts.append(o['start'])
68
+ ends.append(o['end'])
69
+ new_str = tokenizer.convert_tokens_to_string(tokens)
70
+ if new_str!='':
71
+ strings.append(dict(
72
+ word=new_str,
73
+ start = min(starts),
74
+ end = max(ends),
75
+ entity = c[0]['entity']
76
+ ))
77
+ return strings
78
+ def entity_table(pipeline, **pipeline_kw):
79
+ if "aggregation_strategy" not in pipeline_kw:
80
+ pipeline_kw["aggregation_strategy"] = "first"
81
+ def create_table(text):
82
+ return pd.DataFrame(
83
+ clean_output(
84
+ pipeline(text, **pipeline_kw)
85
+ )
86
+ )
87
+ return create_table
88
+ # will return a dataframe
89
+ entity_table(ner)(YOUR_VERY_CONTENTFUL_TEXT)
90
+ ```
91
+ > check our NER model on
92
+ * [gene and gene products](/raynardj/ner-gene-dna-rna-jnlpba-pubmed)
93
+ * [chemical substance](/raynardj/ner-chemical-bionlp-bc5cdr-pubmed).
94
+ * [disease](raynardj/ner-disease-ncbi-bionlp-bc5cdr-pubmed)