Reconsidering LLM Uncertainty Estimation Methods in the Wild

Abstract

This paper studies practical deployment challenges for LLM uncertainty estimation beyond standard short-form QA evaluation. It analyzes threshold sensitivity under calibration shift, robustness to query perturbations (including typos, adversarial prompts, and chat history), transfer to long-form generation, and aggregation of multiple uncertainty scores. Evaluations across 19 methods show strong threshold sensitivity and notable vulnerability to adversarial prompts, while score ensembling provides consistent gains. The results highlight key gaps between benchmark performance and real-world reliability requirements.

Publication
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yavuz Faruk Bakman
Yavuz Faruk Bakman
PhD Student in Computer Science Capital One Responsible AI Fellow

My research interests include Trustworthy LLM, Continual Learning and Federated Learning.