Fine tuning the model

#2
by hkhho - opened

Thank you for sharing this amazing model! I'm new to NLP model so I am a bit lost, hopefully can get some helps here.

I would like to fine tune this model but my case is a bit complicated. I want to performance classification task. Given pair of genomic sequence (Sequence A and Sequence B) in fasta format, and I have 3 classes based on the sequence matching.

My thought is to fit sequence A to gena-lm to get outputA, and then sequence B to gena-lm to get outputB. Then concat the output from the two gena-lm then make final prediction. Does it sound doable or too compicated?

And for the input and label, would this model work if my sequence and output like the one below?

inputs = tokenizer("TTTTAAAGACCAGATCAGTATTTTCTTGATACGCTTGTCACCATTTTTGTTCTCACGACA", return_tensors="pt")
labels = tokenizer(1, return_tensors="pt")["input_ids"]

AIRI - Artificial Intelligence Research Institute org
edited Apr 27, 2023

You can use sequence classification model for both single sequence and pairs of sequences:

from transformers import AutoTokenizer, BigBirdForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
model = BigBirdForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')

and feed two sequences to model at once separated by special token [SEP]. It should form input [CLS] SeqA [SEP] SeqB [SEP]:

inp = tokenizer('ATGC [SEP] GCTA')

This HF doc page might help you to setup training: https://huggingface.co/docs/transformers/tasks/sequence_classification

Thank you for replying! It's good to know we can use pairs of sequences as input.

I tried the code below and returned KeyError.

from transformers import AutoTokenizer, BigBirdForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
model = BigBirdForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
inputs = tokenizer("TTTTAAAGACCAGATCAGTATTTTCTTGATACGCTTGTCACCATTTTTGTTCTCACGACA [SEP] ACCATTTTTGTTCTCACGACA [SEP]", return_tensors="pt")
model(inputs)


KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py in getattr(self, item)
247 try:
--> 248 return self.data[item]
249 except KeyError:

KeyError: 'size'

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last)
5 frames
in <cell line: 2>()
1 inputs = tokenizer("TTTTAAAGACCAGATCAGTATTTTCTTGATACGCTTGTCACCATTTTTGTTCTCACGACA [SEP] ACCATTTTTGTTCTCACGACA [SEP]", return_tensors="pt")
----> 2 model(inputs)

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.10/dist-packages/transformers/models/big_bird/modeling_big_bird.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
2743 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
2744
-> 2745 outputs = self.bert(
2746 input_ids,
2747 attention_mask=attention_mask,

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.10/dist-packages/transformers/models/big_bird/modeling_big_bird.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
2029 raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
2030 elif input_ids is not None:
-> 2031 input_shape = input_ids.size()
2032 elif inputs_embeds is not None:
2033 input_shape = inputs_embeds.size()[:-1]

/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py in getattr(self, item)
248 return self.data[item]
249 except KeyError:
--> 250 raise AttributeError
251
252 def getstate(self):

AttributeError:

AIRI - Artificial Intelligence Research Institute org

use

model(**inputs)

to call the model.

Sign up or log in to comment