This paper studies practical deployment challenges for LLM uncertainty estimation beyond standard short-form QA evaluation. It analyzes threshold sensitivity under calibration shift, robustness to query perturbations (including typos, adversarial prompts, and chat history), transfer to long-form generation, and aggregation of multiple uncertainty scores. Evaluations across 19 methods show strong threshold sensitivity and notable vulnerability to adversarial prompts, while score ensembling provides consistent gains. The results highlight key gaps between benchmark performance and real-world reliability requirements.