Reconsidering LLM Uncertainty Estimation Methods in the Wild

Yavuz Faruk Bakman, Duygu Nur Yaldiz, Sungmin Kang, Tuo Zhang, Baturalp Buyukates, Salman Avestimehr, Sai Praneeth Karimireddy

July, 2025

Abstract

This paper studies practical deployment challenges for LLM uncertainty estimation beyond standard short-form QA evaluation. It analyzes threshold sensitivity under calibration shift, robustness to query perturbations (including typos, adversarial prompts, and chat history), transfer to long-form generation, and aggregation of multiple uncertainty scores. Evaluations across 19 methods show strong threshold sensitivity and notable vulnerability to adversarial prompts, while score ensembling provides consistent gains. The results highlight key gaps between benchmark performance and real-world reliability requirements.

Type

Conference paper

Publication

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reconsidering LLM Uncertainty Estimation Methods in the Wild

Abstract

Yavuz Faruk Bakman

PhD Student in Computer Science Capital One Responsible AI Fellow