Yavuz Bakman
Yavuz Bakman
Home
News
Publications
Contact
Robustness
Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
Large language models are repeatedly updated after deployment, yet alignment is commonly assessed with static black-box evaluations. …
Yavuz Faruk Bakman
,
Duygu Nur Yaldiz
,
Salman Avestimehr
,
Sai Praneeth Karimireddy
PDF
Source Document
Cite
×