--- library_name: transformers --- # Model Card for Model ID BiRNA-BERT is a BERT-style transformer encoder model that generates embeddings for RNA sequences. BiRNA-BERT has been trained on BPE tokens and individual nucleotides. As a result, it can generate both granular nucleotide-level embeddings and efficient sequence-level embeddings (using BPE). BiRNA-BERT was trained using the MosaicBERT framework - https://huggingface.co/mosaicml/mosaic-bert-base # Usage ## Extracting RNA embeddings ```python import torch import transformers from transformers import AutoModelForMaskedLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("buetnlpbio/birna-tokenizer") config = transformers.BertConfig.from_pretrained("buetnlpbio/birna-bert") mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/birna-bert",config=config,trust_remote_code=True) mysterybert.cls = torch.nn.Identity() # To get sequence embeddings seq_embed = mysterybert(**tokenizer("AGCTACGTACGT", return_tensors="pt")) print(seq_embed.logits.shape) # CLS + 4 BPE token embeddings + SEP # To get nucleotide embeddings char_embed = mysterybert(**tokenizer("A G C T A C G T A C G T", return_tensors="pt")) print(char_embed.logits.shape) # CLS + 12 nucleotide token embeddings + SEP ``` ## Explicitly increasing max sequence length ```python config = transformers.BertConfig.from_pretrained("buetnlpbio/birna-bert") config.alibi_starting_size = 2048 # maximum sequence length updated to 2048 from config default of 1024 mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/birna-bert",config=config,trust_remote_code=True) ```