cutoff score to consider for LLM call

#11
by karthikfds - opened

Hi,

I am planning to use a reranker for my RAG model. Once I retrieve the documents from the VectorDB, I aim to send only valid documents to the LLM. What cutoff score should I consider?

Just to understand the scores, I've come up with the following sample examples:
query = "Where does John live and where does he work?"

documents = [
"John is a software engineer at Apple.",
"New York is a bustling metropolis located in the northeastern United States.",
"John lives in New York with his parents and family for the last 35 years.",
"Apple is a mobile manufacturing company.",
"John is from New York which is know for skyscrappers.",
"John is from New York",
"John hails from New York."
]

Model: BAAI/bge-reranker-v2-m3 with normalize=True

Output: [0.0418315718328972, 0.002715773148516757, 0.8457385438206448, 2.441601463355055e-05, 0.13356921483618522, 0.2132427860507904, 0.37600044875040023]

Observations:

  • The sentence "John is a software engineer at Apple," which contains valuable information about the question, was given a score of 0.04.
  • When a sentence provides additional information beyond the query, we receive lower scores. For instance, a sentence including extra information (such as facts about skyscrapers) obtained a score of 0.13, compared to 0.21.
  • There were notably different scores for sentences conveying the same information, such as "John is from New York" and "John hails from New York."

@karthikfds interesting points!

I think that it's better to just keep the top k sentences by similarity score rather than set an absolute threshold, since that threshold would have to be changed depending on the domain/sentence length/other criteria.

Could you provide a code snippet that you used?

Meanwhile I'll try to answer your observations.
On the first point: indeed I don't know why this one was scored so low. One possible explanation is that the score would have increased with asking "Where does John live and at which company does he work?" instead?
On the second: the effect that including irrelevant information reduces similarity seems legitimate to me: including irrelevant information dilutes the meaning of your original sentence.
Finally, it also seems strange to me that "hails" gets a better score, given that "hail" indicates beeing born in rather than currently living. (cf. Cambridge Dictionnary)

Thanks @m-ric for your inputs.

Here is the code snippet used (with FlagEmbedding==1.2.8):
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-v2-m3')

query = "Where does John live and where does he work?"
documents = [
"John is a software engineer at Apple.",
"New York is a bustling metropolis located in the northeastern United States.",
"John lives in New York with his parents and family for the last 35 years.",
"Apple is a mobile manufacturing company.",
"John is from New York which is know for skyscrappers.",
"John is from New York",
"John hails from New York."
]

list = []
for doc in documents:
list.append([query,doc])

scores = reranker.compute_score(list, normalize=True)
print(scores)

Regarding the similarity score vs top k: I am endeavoring to minimize costs by sending only relevant chunks to LLM. Additionally, I have another use case for source linking where we must display only valid chunks.

Comments on other observations:

  1. When I changed the query to: "Where does John live and at which company does he work?" here are the scores:
    [0.32725257710586336, 0.0014769222686033384, 0.8135377079545972, 0.002375353690290117, 0.04760177047170493, 0.07414538192317419, 0.11335805713249769]
    Scores were reduced for the last 2 sentences
  2. irrelevant information dilutes the meaning of your original sentence: If you consider the 3rd document it contains additional information "with his parents and family for the last 35 years.". However, it was scored high.
  3. Agree with you.

@karthikfds your code seems correct! So it's more an issue of the reranker.

Just to be sure, could you also use the LLM-based reranker as follows:

from FlagEmbedding import FlagLLMReranker
reranker = FlagLLMReranker('BAAI/bge-reranker-v2-gemma', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
# reranker = FlagLLMReranker('BAAI/bge-reranker-v2-gemma', use_bf16=True) # You can also set use_bf16=True to speed up computation with a slight performance degradation

score = reranker.compute_score(['query', 'passage'])
print(score)

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores)

If performance is not improved, I'd suggest sticking with ColbertV2 reranking, and cut tables if they are too long.

Also, I did not understand your comment about similarity score vs top k to minimize costs: whether you use similarity score or top k, you will only select the same number of documents that will go to the Reader LLM, isn't it?

@m-ric I checked out BAAI/bge-reranker-v2-gemma, but it's a large model and might not be suitable for production. Nevertheless, I'll give it another shot.

Regarding your comment about similarity score versus top k to minimize costs: whether we use similarity score or top k, we'll ultimately select the same number of documents for the Reader LLM, right? Instead of using top k, I'm considering selecting documents with very high confidence. So, for a given query, if only one document contains the answer, I'll prioritize that single document instead of a top k approach.

Also, any suggestion on: "irrelevant information dilutes the meaning of your original sentence: If you consider the 3rd document it contains additional information "with his parents and family for the last 35 years.". However, it was scored high."

Regards,
Karthik

Sign up or log in to comment