batch mode vs. single mode, what performance to expect?

#5
by lwd-kskd - opened

I yesterday experimented with the batch mode analysis using the batch.py file and observed lower throughput in batch mode than in single mode (60s per image in batch mode vs. 20s per image in single mode). I used a single NVIDIA T4 for inference.

The VRAM usage ranged from 9GB/16GB in single mode to 14GB/16GB in 8 image batch mode, so VRAM doesn't seem to be the limitting factor. However, the cores usage was around 99% for all cases.

What could be done to speed up the processing? Why is the text generation step using the inputs_embeds so slow with the int4 quantized model on my hardware, is this normal?

Hi, you’d expect 2-5x increase depending on GPU power. Can you please let me know your CPU, CUDA and PyTorch version. Additionally, can you please try a batch size of 2, 3 and 4 and let me know the s/it.

Hi! Curious... So my numbers are a bit off indeed.

Update:
I ran batch mode with batch size of 1, 2, 3, 4 and 6.

Actually in single mode I reached a throughput of 1ximg/22s and in batch mode with batch size 2 the throughput was only 1ximg/66s. Then it increased again to around 1ximg/48s in batch mode with batch size 3, 1ximg/36s in batch mode with batch size 4 and 1ximg/27s in batch mode with batch size 6.

CUDA: 12.2
PyTorch: 2.3.0+cu121
CPU: 4x virtual cores of Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
(AWS instance g4dn.xlarge with Ubuntu 22.04 + CUDA 12.2)

What caught my attention is that the inference time for Llava 1.5 in 4 bit quantization on the same hardware is multiple times quicker, taking roughly 2.5s to 3s per image. As both text generation models are similar in size, I expected something similar. (Although, the captions generated by joy caption seem to be 2x more elaborate.)

Just returning to the thread to inform that with an NVIDIA A10G (which supports torch.bfloat16), the scaling behaves as expected. So the unexpected behaviour on the NVIDIA T4 may have something to do with it supporting only torch.float16.

Hi, thanks for the update. I did check this out (must have forgotten to comment), but it's due to Turing architecture's support for BnB and int4 - there's some evidence from other users that using a non-quantized model would work.

Additionally, your benchmark for the T4 on a single image was about 20s, when the expected time for a T4 is around 13-16s. If you were using a cloud service like AWS there's some evidence to suggest strict power limits (low to no spikes, limit TDP to 70w) has restricted your performance.

Sign up or log in to comment