malteos commited on
Commit
18ee1d3
1 Parent(s): 6276f3f
README.md CHANGED
@@ -1,3 +1,47 @@
1
  ---
2
  license: mit
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - feature-extraction
5
+ language: en
6
  ---
7
+
8
+ # PubMedNCL
9
+
10
+ A pretrained language model for document representations of biomedical papers.
11
+ PubMedNCL is based on [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext), which is a BERT model pretrained on abstracts and full-texts from PubMedCentral, and fine-tuned via citation neighborhood contrastive learning, as introduced by [SciNCL](https://huggingface.co/malteos/scincl).
12
+
13
+ ## How to use the pretrained model
14
+
15
+
16
+ ```python
17
+ from transformers import AutoTokenizer, AutoModel
18
+
19
+ # load model and tokenizer
20
+ tokenizer = AutoTokenizer.from_pretrained('malteos/PubMedNCL')
21
+ model = AutoModel.from_pretrained('malteos/PubMedNCL')
22
+
23
+ papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
24
+ {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
25
+
26
+ # concatenate title and abstract with [SEP] token
27
+ title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
28
+
29
+ # preprocess the input
30
+ inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
31
+
32
+ # inference
33
+ result = model(**inputs)
34
+
35
+ # take the first token ([CLS] token) in the batch as the embedding
36
+ embeddings = result.last_hidden_state[:, 0, :]
37
+ ```
38
+
39
+ ## Citation
40
+
41
+ - [Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)](https://arxiv.org/abs/2202.06671).
42
+ - [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://arxiv.org/abs/2007.15779).
43
+
44
+ ## License
45
+
46
+ MIT
47
+
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "data/s2orc_with_specter_without_scidocs/specter/corpus_seed_0/seed_0_ep5knn20-25_en3random_without_knn_hn2knn3998-4000/model_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "transformers_version": "4.5.1",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 30522
24
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee39908b91b5dbf93aa8859ca9e140f7b087f3c09ae05250b45628301dec191b
3
+ size 438012727
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "data/s2orc_with_specter_without_scidocs/specter/corpus_seed_0/seed_0_ep5knn20-25_en3random_without_knn_hn2knn3998-4000/model_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext", "do_basic_tokenize": true, "never_split": null}
trainer_state.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.txt ADDED
The diff for this file is too large to render. See raw diff