Recent Research by WAISI Members

Can Your Uncertainty Scores Detect Hallucinated Entity?
Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li
Feb 17, 2025
A Unified Understanding and Evaluation of Steering Methods
Shawn Im, Yixuan Li
Feb 4, 2025
How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence
Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li
Feb 2, 2025
Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach
Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li
Feb 1, 2025
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel
Jan 27, 2025
Err on the Side of Texture: Texture Bias on Real Data
Blaine Hoak, Ryan Sheatsley, Patrick McDaniel
Dec 13, 2024
Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education
Anand Syamkumar, Nora Tseng, Kaycie Barron, Shanglin Yang, Shamya Karumbaiah, Rheeya Uppaal, Junjie Hu
Nov 6, 2024
Safety-Aware Fine-Tuning of Large Language Models
Hyeong Kyu Choi, Xuefeng Du, Yixuan Li
Oct 13, 2024
Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition
Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios G Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos
Oct 8, 2024
On the Generalization of Preference Learning with DPO
Shawn Im, Yixuan Li
Aug 6, 2024
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences
Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak
Jun 12, 2024
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu
May 22, 2024
PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning
Hyeong Kyu Choi, Yixuan Li
May 3, 2024
Understanding the Learning Dynamics of Alignment with Human Feedback
Shawn Im, Yixuan Li
Mar 27, 2024
ARGS: Alignment as reward-guided search
Maxim Khanov, Jirayu Burapacheep, Yixuan Li
Jan 23, 2024
Debate Helps Supervise Unreliable Experts
Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman
Nov 15, 2023
The Efficacy of Transformer-based Adversarial Attacks in Security Domains
Kunyang Li, Kyle Domico, Jean-Charles Noirot Ferrand, Patrick McDaniel
Oct 17, 2023
The Space of Adversarial Strategies
Ryan Sheatsley, Blaine Hoak, Eric Pauley, and Patrick McDaniel
Aug 10, 2023
Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection
Rheeya Uppaal, Junjie Hu, Yixuan Li
May 22, 2023
The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning
Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, Somesh Jha
Feb 28, 2023