Batch input not working parallel

#187
by leonshub - opened

Hello, I am trying to run multiple inputs on the gpu for parallel processing, but it just takes n * Time necessary for a single input for n inputs.
So if it takes 13 seconds for a single input, then it takes roughly 130 seconds for a batch input of 10.
Is my code wrong or is there something I dont understand about batch processing?

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, StoppingCriteria, StoppingCriteriaList

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config = bnb_config, device_map="auto")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

texts = ["text1","text2"]

inputs = tokenizer(texts, return_tensors="pt",padding=True).to("cuda:0")

start_time = time.time()
outputs = model.generate(**inputs, max_new_tokens=512)
end_time = time.time()
execution_time = end_time - start_time
print(f"Generation time: {execution_time} seconds")

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

leonshub changed discussion title from Batch inference not working to Batch input not working parallel

solved my problem

leonshub changed discussion status to closed

Sign up or log in to comment