File size: 13,769 Bytes
dbdac9b
 
 
 
 
 
 
 
 
 
b7a70c3
 
d67ba89
b7a70c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d67ba89
dbdac9b
45bf5ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0971bb0
45bf5ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dbdac9b
 
 
 
 
 
 
796612f
dbdac9b
 
 
 
 
 
 
636f50e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
license: mit
---

# Model
The Pandemic PACT Advanced Categorisation Engine (PPACE) is a fine-tuned 8B LLM designed for automatically classifying research abstracts from funded biomedical projects according to WHO-aligned research priorities. Developed as part of the GLOPID-R Pandemic PACT project, PPACE assists in tracking and analysing research funding and clinical evidence for a wide range of diseases with outbreak potential.

The model leverages a human-annotated dataset expanded with rationales generated by a larger LLM. These rationales provide explanations for the chosen labels, enhancing the model's interpretability and accuracy.

# Usage

### Preparing the input Prompt
```python
def construct_input_prompt(title, abstract):
    categories = """We have a project in the area of biomedical research which we want to classify in terms of the research priorities it related to. We have 12 possible research priorities and a project can be mapped to 1 or more of these priorities. The following is a guide on what each of these 12 categories are alongside the specific areas that they cover. 

1. Pathogen: Natural History, Transmission, and Diagnostics:
    Development of diagnostic tools, understanding pathogen morphology, genomics, and genotyping, studying immunity, using disease models, and assessing the environmental stability of pathogens.

2. Animal and Environmental Research & Research on Diseases Vectors:
    Animal sources, transmission routes, vector biology, and control strategies for vectors.

3. Epidemiological Studies:
    Research on disease transmission dynamics, susceptibility, control measure effectiveness, and disease mapping through surveillance and reporting.

4. Clinical Characterisation and Management in Humans:
    Prognostic factors for disease severity, disease pathogenesis, supportive care and management, long-term health consequences, and clinical trials for disease management.

5. Infection Prevention and Control:
    Research on community restriction measures, barriers and PPE, infection control in healthcare settings, and measures at the human-animal interface.

6. Therapeutics Research, Development, and Implementation:
    Pre-clinical studies for therapeutic development, clinical trials for therapeutic safety and efficacy, development of prophylactic treatments, logistics and supply chain management for therapeutics, clinical trial design for therapeutics, and research on adverse events related to therapeutic administration.

7. Vaccines Research, Development, and Implementation:
    Pre-clinical studies for vaccine development, clinical trials for vaccine safety and efficacy, logistics and distribution strategies for vaccines, vaccine design and administration, clinical trial design for vaccines, research on adverse events related to immunisation, and characterisation of vaccine-induced immunity.

8. Research to Inform Ethical Issues:
    Ethical considerations in research design, ethical issues in public health measures, ethical clinical decision-making, ethical resource allocation, ethical governance, and ethical considerations in social determinants of health.

9. Policies for public health, disease control and  community resilience:
    Approaches to public health interventions, community engagement, communication and infodemic management, vaccine/therapeutic hesitancy, and policy research and interventions.

10. Secondary Impacts of Disease, Response, and Control Measures:
    Indirect health impacts, social impacts, economic impacts, and other secondary impacts such as environmental effects, food security, and infrastructure.

11. Health Systems Research:
    Health service delivery, health financing, access to medicines and technologies, health information systems, health leadership and governance, and health workforce management.

12. Capacity Strengthening:
    Individual capacity building, institutional capacity strengthening, systemic/environmental components, and cross-cutting activities across all levels of capacity building."""

    prompt = "Based on the research categorization guidelines, classify the following project into the appropriate primary research priorities using the categories 1 to 12."
    prompt += f"\n\n{categories.strip()}\n\nProject Information:\n\n"
    prompt += f"### Title:\n'''\n{title.strip()}\n'''\n\n### Abstract:\n'''\n{abstract.strip()}\n'''\n\n"
    prompt += "Based on this information, identify the relevant research categories for this project. Provide clear explanation for your choices. Section your response in the following format:"
    prompt += "\n\n### Explanation: ...\n\n### Categories: ..."
    
    return prompt

title = "Neutralization of Primate Immunodeficiency Viruses"
abstract = "We will repurpose existing assays, techniques and expertise that are central to our project team's virology, structural biology, vaccine development and protein production skill-sets for HIV research, to now also work on SARS-CoV-2 during the COVID-19 pandemic emergency. These interactive research efforts will draw on our established methodologies and should represent a productive use of our existing NIH grant resources. We note that there continue to be institutional restrictions at all three performance sites on the effort that can be applied to our original goals relating to HIV-1 vaccine research and development. Those goals will be unchanged, but will be pursued at a reduced effort during the period when we also work on the new SARS-CoV-2 projects for which we have fewer institutional restrictions due to the COVID-19 pandemic."

input_prompt = construct_input_prompt(title, abstract)
```

### Loading the Model
```python
model_id = "nlpie/ppace-v1.0"

tokenizer = ts.AutoTokenizer.from_pretrained(model_id)

tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token

model = ts.AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16)
```

### Infernece
``` python
def generateOutput(input):
  inputs = tokenizer(
  [
      tokenizer.decode(tokenizer.apply_chat_template([{"content": input, "role": "user"}]))
  ], return_tensors = "pt").to("cuda")

  output = model.generate(**inputs, max_new_tokens = 512, num_beams=4, eos_token_id=tokenizer("<|eot_id|>", add_special_tokens=False)["input_ids"][0])

  return tokenizer.decode(output[0])

output = generateOutput(input_prompt)
```

### Complete Code
``` python
def construct_input_prompt(title, abstract):
    categories = """We have a project in the area of biomedical research which we want to classify in terms of the research priorities it related to. We have 12 possible research priorities and a project can be mapped to 1 or more of these priorities. The following is a guide on what each of these 12 categories are alongside the specific areas that they cover. 

1. Pathogen: Natural History, Transmission, and Diagnostics:
    Development of diagnostic tools, understanding pathogen morphology, genomics, and genotyping, studying immunity, using disease models, and assessing the environmental stability of pathogens.

2. Animal and Environmental Research & Research on Diseases Vectors:
    Animal sources, transmission routes, vector biology, and control strategies for vectors.

3. Epidemiological Studies:
    Research on disease transmission dynamics, susceptibility, control measure effectiveness, and disease mapping through surveillance and reporting.

4. Clinical Characterisation and Management in Humans:
    Prognostic factors for disease severity, disease pathogenesis, supportive care and management, long-term health consequences, and clinical trials for disease management.

5. Infection Prevention and Control:
    Research on community restriction measures, barriers and PPE, infection control in healthcare settings, and measures at the human-animal interface.

6. Therapeutics Research, Development, and Implementation:
    Pre-clinical studies for therapeutic development, clinical trials for therapeutic safety and efficacy, development of prophylactic treatments, logistics and supply chain management for therapeutics, clinical trial design for therapeutics, and research on adverse events related to therapeutic administration.

7. Vaccines Research, Development, and Implementation:
    Pre-clinical studies for vaccine development, clinical trials for vaccine safety and efficacy, logistics and distribution strategies for vaccines, vaccine design and administration, clinical trial design for vaccines, research on adverse events related to immunisation, and characterisation of vaccine-induced immunity.

8. Research to Inform Ethical Issues:
    Ethical considerations in research design, ethical issues in public health measures, ethical clinical decision-making, ethical resource allocation, ethical governance, and ethical considerations in social determinants of health.

9. Policies for public health, disease control and  community resilience:
    Approaches to public health interventions, community engagement, communication and infodemic management, vaccine/therapeutic hesitancy, and policy research and interventions.

10. Secondary Impacts of Disease, Response, and Control Measures:
    Indirect health impacts, social impacts, economic impacts, and other secondary impacts such as environmental effects, food security, and infrastructure.

11. Health Systems Research:
    Health service delivery, health financing, access to medicines and technologies, health information systems, health leadership and governance, and health workforce management.

12. Capacity Strengthening:
    Individual capacity building, institutional capacity strengthening, systemic/environmental components, and cross-cutting activities across all levels of capacity building."""

    prompt = "Based on the research categorization guidelines, classify the following project into the appropriate primary research priorities using the categories 1 to 12."
    prompt += f"\n\n{categories.strip()}\n\nProject Information:\n\n"
    prompt += f"### Title:\n'''\n{title.strip()}\n'''\n\n### Abstract:\n'''\n{abstract.strip()}\n'''\n\n"
    prompt += "Based on this information, identify the relevant research categories for this project. Provide clear explanation for your choices. Section your response in the following format:"
    prompt += "\n\n### Explanation: ...\n\n### Categories: ..."
    
    return prompt

title = "Neutralization of Primate Immunodeficiency Viruses"
abstract = "We will repurpose existing assays, techniques and expertise that are central to our project team's virology, structural biology, vaccine development and protein production skill-sets for HIV research, to now also work on SARS-CoV-2 during the COVID-19 pandemic emergency. These interactive research efforts will draw on our established methodologies and should represent a productive use of our existing NIH grant resources. We note that there continue to be institutional restrictions at all three performance sites on the effort that can be applied to our original goals relating to HIV-1 vaccine research and development. Those goals will be unchanged, but will be pursued at a reduced effort during the period when we also work on the new SARS-CoV-2 projects for which we have fewer institutional restrictions due to the COVID-19 pandemic."

input_prompt = construct_input_prompt(title, abstract)

model_id = "nlpie/ppace-v1.0"

tokenizer = ts.AutoTokenizer.from_pretrained(model_id)

tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token

model = ts.AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16)

def generateOutput(input):
  inputs = tokenizer(
  [
      tokenizer.decode(tokenizer.apply_chat_template([{"content": input, "role": "user"}]))
  ], return_tensors = "pt").to("cuda")

  output = model.generate(**inputs, max_new_tokens = 512, num_beams=4, eos_token_id=tokenizer("<|eot_id|>", add_special_tokens=False)["input_ids"][0])

  return tokenizer.decode(output[0])

output = generateOutput(input_prompt)

print(output)
```

# Model Details
PPACE is fine-tuned using Low-Rank Adaptation (LoRA) to ensure efficient training while maintaining high performance. The fine-tuning process involves training the model for 4 epochs on a dataset of 5142 projects, using 8 A100 GPUs with a batch size of 1 per GPU and 4 gradient accumulation steps.

## Hyperparameters

| Hyperparameter            | Value  |
|---------------------------|--------|
| Total Batch Size          | 8      |
| Gradient Accumulation Steps | 4    |
| Learning Rate             | 2e-4   |
| LR Scheduler              | Linear |
| Epochs                    | 2      |
| LoRA Rank                 | 128    |
| LoRA α                    | 256    |
| LoRA Dropout              | 0.05   |


## Citation

If you use this dataset in your research, please cite:

```bibtex
@misc{rohanian2024rapidbiomedicalresearchclassification,
      title={Rapid Biomedical Research Classification: The Pandemic PACT Advanced Categorisation Engine}, 
      author={Omid Rohanian and Mohammadmahdi Nouriborji and Olena Seminog and Rodrigo Furst and Thomas Mendy and Shanthi Levanita and Zaharat Kadri-Alab and Nusrat Jabin and Daniela Toale and Georgina Humphreys and Emilia Antonio and Adrian Bucher and Alice Norton and David A. Clifton},
      year={2024},
      eprint={2407.10086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.10086}, 
}