Latency issue while inference using huggingfacepipeline from langchain.

#2
by prasoons075 - opened

It takes around on average 5 mins to respond. Any hacks to reduce the model response time?

H2O.ai org

The best way is to use https://github.com/huggingface/text-generation-inference for hosting an optimized endpoint and then using that in langchain. Built-in HF inference is incredibly slow, specifically on Falcon models.

psinger changed discussion status to closed

Sign up or log in to comment