# Model description This model is BERT-based architecture with 8 layers. The detailed config is summarized as follows. The drug-like molecule BERT is inspired by ["Self-Attention Based Molecule Representation for Predicting Drug-Target Interaction"](https://arxiv.org/abs/1908.06760). We modified several points of training procedures. ``` config = BertConfig( vocab_size=vocab_size, hidden_size=128, num_hidden_layers=8, num_attention_heads=8, intermediate_size=512, hidden_act="gelu", hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=max_seq_len + 2, type_vocab_size=1, pad_token_id=0, position_embedding_type="absolute" ) ``` # Training and evaluation data It's trained on drug-like molecules on the PubChem database. The PubChem database contains more than 100 M molecules, therefore, we filtered drug-like molecules using the quality of drug-likeliness score (QED). The 4.1 M molecules were filtered and the QED score threshold was set to 0.7. # Tokenizer We utilize a character-level tokenizer. The special tokens are "[SOS]", "[EOS]", "[PAD]", "[UNK]". # Training hyperparameters The following hyperparameters were used during training: - Adam optimizer, learning_rate: 5e-4, scheduler: cosine annealing - Batch size: 2048 - Training steps: 24 K - Training_precision: FP16 - Loss function: cross-entropy loss - Training masking rate: 30 % - Testing masking rate: 15 % (original molecule BERT utilized 15 % of masking rate) - NSP task: None # Performance - Accuracy: 94.02 %