The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community Paper • 2408.08291 • Published Aug 15 • 9
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation Paper • 2407.13696 • Published Jul 18 • 5
Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP Paper • 2407.00402 • Published Jun 29 • 22
Large Language Model Confidence Estimation via Black-Box Access Paper • 2406.04370 • Published Jun 1 • 19