|
# Model description |
|
This model is BERT-based architecture with 8 layers. The detailed config is summarized as follows. The drug-like molecule BERT is inspired by ["Self-Attention Based Molecule Representation for Predicting Drug-Target Interaction"](https://arxiv.org/abs/1908.06760). We modified several points of training procedures. |
|
|
|
``` |
|
config = BertConfig( |
|
vocab_size=vocab_size, |
|
hidden_size=128, |
|
num_hidden_layers=8, |
|
num_attention_heads=8, |
|
intermediate_size=512, |
|
hidden_act="gelu", |
|
hidden_dropout_prob=0.1, |
|
attention_probs_dropout_prob=0.1, |
|
max_position_embeddings=max_seq_len + 2, |
|
type_vocab_size=1, |
|
pad_token_id=0, |
|
position_embedding_type="absolute" |
|
) |
|
``` |
|
|
|
# Training and evaluation data |
|
It's trained on drug-like molecules on the PubChem database. The PubChem database contains more than 100 M molecules, therefore, we filtered drug-like molecules using the quality of drug-likeliness score (QED). The 4.1 M molecules were filtered and the QED score threshold was set to 0.7. |
|
|
|
# Tokenizer |
|
We utilize a character-level tokenizer. The special tokens are "[SOS]", "[EOS]", "[PAD]", "[UNK]". |
|
|
|
# Training hyperparameters |
|
The following hyperparameters were used during training: |
|
- Adam optimizer, learning_rate: 5e-4, scheduler: cosine annealing |
|
- Batch size: 2048 |
|
- Training steps: 24 K |
|
- Training_precision: FP16 |
|
- Loss function: cross-entropy loss |
|
- Training masking rate: 30 % |
|
- Testing masking rate: 15 % (original molecule BERT utilized 15 % of masking rate) |
|
- NSP task: None |
|
|
|
# Performance |
|
- Accuracy: 94.02 % |