Long context embedding & chunking on same document with bge-m3

#59
by DracoDev - opened

Given that bge-m3 is a long context model of 8192 and many documents could be embedded without chunking I would like to compare three possibilities.

  1. Embed the Entire Document. In particular if there are downsides to this are those results visible at a particular content length say 4k, 5k? 6k? at what length if any are there downsides to not chunking? In other words if the document is under 1k are we still going to object to storing the whole thing? At what size does whole document embedding have any measurable downsides aside from memory?

  2. Chunk the Document. Is there a particular chunking size or multiple that tends to be work better. 512, 1024? (I know all content is different but are there basic rules of thumb)

  3. Embed the Entire Document & Chunk the Same Document.
    Let's leave out memory out as a factor or keep it a secondary factor. Assume ample memory is available. There is also the theoretical question of accuracy that can be more inportant for certain types of documents. Moreoever it does not need to imply that every section of the document gets chunked. It could be just clear subsections for example. For the sake of dicussion let's assume that the chunk can or does have a meta data link to the parent document. Will chunking combined with full document embedding create redundancy in retrieval? How would it affect accuracy? Again is there an approximate document size factor affecting when this is or is not a good strategy?

Beijing Academy of Artificial Intelligence org

@DracoDev , splitting the entire document into multiple chunks often can improve the retrieval performance, and I think a chunk size of 512 is enough.
Encoding both entire doc and chunks is worth trying, and I think the influence is different when applying it to different downstream tasks.

Sign up or log in to comment