Edit model card

buetnlpbio/nuc-only-dna-bert was trained on only Human Genome DNA dataset for 1 epoch (for ablation). Performance on other DNA types may be limited.

Model Card for Model ID

BiRNA-BERT is a BERT-style transformer encoder model that generates embeddings for RNA sequences. BiRNA-BERT has been trained on BPE tokens and individual nucleotides. As a result, it can generate both granular nucleotide-level embeddings and efficient sequence-level embeddings (using BPE).

BiRNA-BERT was trained using the MosaicBERT framework - https://huggingface.co/mosaicml/mosaic-bert-base

Usage

Extracting RNA embeddings

import torch
import transformers
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("buetnlpbio/bidna-tokenizer")

config = transformers.BertConfig.from_pretrained("buetnlpbio/nuc-only-dna-bert")
mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/nuc-only-dna-bert",config=config,trust_remote_code=True)
mysterybert.cls = torch.nn.Identity()

# To get nucleotide embeddings
char_embed = mysterybert(**tokenizer("A G C T A C G T A C G T", return_tensors="pt")) 
print(char_embed.logits.shape) # CLS + 12 nucleotide token embeddings + SEP
Downloads last month
2
Inference API
Examples
Mask token: [MASK]
This model can be loaded on Inference API (serverless).

Collection including buetnlpbio/nuc-only-dna-bert