buetnlpbio
/

birna-bert

Inference Endpoints

Model card Files Files and versions Community

birna-bert / README.md

buetnlpbio's picture

Update README.md

7b47149 verified 3 months ago

|

history blame contribute delete

No virus

1.63 kB

	---
	library_name: transformers
	---

	# Model Card for Model ID

	BiRNA-BERT is a BERT-style transformer encoder model that generates embeddings for RNA sequences. BiRNA-BERT has been trained on BPE tokens and individual nucleotides. As a result, it can generate both granular nucleotide-level embeddings and efficient sequence-level embeddings (using BPE).

	BiRNA-BERT was trained using the MosaicBERT framework - https://huggingface.co/mosaicml/mosaic-bert-base


	# Usage
	## Extracting RNA embeddings

	```python
	import torch
	import transformers
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("buetnlpbio/birna-tokenizer")

	config = transformers.BertConfig.from_pretrained("buetnlpbio/birna-bert")
	mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/birna-bert",config=config,trust_remote_code=True)
	mysterybert.cls = torch.nn.Identity()

	# To get sequence embeddings
	seq_embed = mysterybert(**tokenizer("AGCTACGTACGT", return_tensors="pt"))
	print(seq_embed.logits.shape) # CLS + 4 BPE token embeddings + SEP

	# To get nucleotide embeddings
	char_embed = mysterybert(**tokenizer("A G C T A C G T A C G T", return_tensors="pt"))
	print(char_embed.logits.shape) # CLS + 12 nucleotide token embeddings + SEP
	```

	## Explicitly increasing max sequence length

	```python
	config = transformers.BertConfig.from_pretrained("buetnlpbio/birna-bert")
	config.alibi_starting_size = 2048 # maximum sequence length updated to 2048 from config default of 1024

	mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/birna-bert",config=config,trust_remote_code=True)
	```