WAISI & XLab

WAISI Technical AI Safety Workshop Program

Most AI Safety communities introduce members who are interested in technical AI safety through the pipeline of Intro Technical Fellowship → Paper Reading Sessions → Alignment Research Engineer Accelerator program (ARENA) → Research Programs (SPAR, XLab SRF, MATS). However, most university groups have struggled with ARENA sessions for a few key reasons: the steep learning curve, significant time commitment, and lack of experienced TA's. The technical workshop program aims to address these issues by creating ARENA-styled workshops on AI Safety topics that focus on shorter, more manageable exercises, while still preserving the rigor of research-style work.

Transferable Adversarial Materials (TAM): Defeating ISR AUASs and LAWSs via Disruptive and Adversarial Material

Within the past decade, small portable Unmanned Aerial Systems (UASs) operated by individual infantry units have been demonstrated to be vital assets on the battlefield in intelligence, surveillance, and reconnaissance (ISR) roles as well as in one-way suicide attacks (loitering munition) and reusable bomb-dropping UASs. Many countries are attempting to integrate AI vision models into these systems to automate navigation and target identification and reduce vulnerability to jamming. We aim to demonstrate the effectiveness of a Transferable Adversarial Material (TAM), a deformable material which could be deployed in a variety of settings and deceive military-purpose computer vision models analogous to those being deployed in AUASs.

Our Research Catalog

Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?	Hyeong Kyu Choi, Xiaojin Zhu, Yixuan Li	NeurIPS 2025 Spotlight	Aug 24, 2025
Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? Hyeong Kyu Choi, Xiaojin Zhu, Yixuan Li NeurIPS 2025 Spotlight Aug 24, 2025

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders	James Oldfield, Shawn Im, Yixuan Li, Mihalis A Nicolaou, Ioannis Patras, Grigorios G Chrysos	NeurIPS 2025	May 27, 2025
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders James Oldfield, Shawn Im, Yixuan Li, Mihalis A Nicolaou, Ioannis Patras, Grigorios G Chrysos NeurIPS 2025 May 27, 2025

Visual Instruction Bottleneck Tuning	Changdae Oh, Jiatong Li, Shawn Im, Yixuan Li	NeurIPS 2025	May 20, 2025
Visual Instruction Bottleneck Tuning Changdae Oh, Jiatong Li, Shawn Im, Yixuan Li NeurIPS 2025 May 20, 2025

On the Robustness Tradeoff in Fine-Tuning	Kunyang Li, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Blaine Hoak, Yohan Beugin, Eric Pauley, Patrick McDaniel	ICCV 2025	Mar 19, 2025
On the Robustness Tradeoff in Fine-Tuning Kunyang Li, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Blaine Hoak, Yohan Beugin, Eric Pauley, Patrick McDaniel ICCV 2025 Mar 19, 2025

Alignment and Adversarial Robustness: Are More Human-Like Models More Secure?	Blaine Hoak, Kunyang Li, Patrick McDaniel		Feb 17, 2025
Alignment and Adversarial Robustness: Are More Human-Like Models More Secure? Blaine Hoak, Kunyang Li, Patrick McDaniel Feb 17, 2025

Can Your Uncertainty Scores Detect Hallucinated Entity?	Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li	TMLR 2025	Feb 17, 2025
Can Your Uncertainty Scores Detect Hallucinated Entity? Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li TMLR 2025 Feb 17, 2025

A Unified Understanding and Evaluation of Steering Methods	Shawn Im, Yixuan Li		Feb 4, 2025
A Unified Understanding and Evaluation of Steering Methods Shawn Im, Yixuan Li Feb 4, 2025

How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence	Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li	ICML 2025	Feb 2, 2025
How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li ICML 2025 Feb 2, 2025

Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach	Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li	ICML 2025	Feb 1, 2025
Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li ICML 2025 Feb 1, 2025

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs	Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel	ACM 2024	Jan 27, 2025
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel ACM 2024 Jan 27, 2025

Err on the Side of Texture: Texture Bias on Real Data	Blaine Hoak, Ryan Sheatsley, Patrick McDaniel	SaTML 2025	Dec 13, 2024
Err on the Side of Texture: Texture Bias on Real Data Blaine Hoak, Ryan Sheatsley, Patrick McDaniel SaTML 2025 Dec 13, 2024

Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education	Anand Syamkumar, Nora Tseng, Kaycie Barron, Shanglin Yang, Shamya Karumbaiah, Rheeya Uppaal, Junjie Hu	ACM 2024	Nov 6, 2024
Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education Anand Syamkumar, Nora Tseng, Kaycie Barron, Shanglin Yang, Shamya Karumbaiah, Rheeya Uppaal, Junjie Hu ACM 2024 Nov 6, 2024

Safety-Aware Fine-Tuning of Large Language Models	Hyeong Kyu Choi, Xuefeng Du, Yixuan Li	NeurIPS 2024, Workshop on Safe Generative AI	Oct 13, 2024
Safety-Aware Fine-Tuning of Large Language Models Hyeong Kyu Choi, Xuefeng Du, Yixuan Li NeurIPS 2024, Workshop on Safe Generative AI Oct 13, 2024

Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition	Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios G Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos	ICML 2025 Spotlight	Oct 8, 2024
Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios G Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos ICML 2025 Spotlight Oct 8, 2024

On the Generalization of Preference Learning with DPO	Shawn Im, Yixuan Li		Aug 6, 2024
On the Generalization of Preference Learning with DPO Shawn Im, Yixuan Li Aug 6, 2024

PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences	Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak	ICLR 2025	Jun 12, 2024
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak ICLR 2025 Jun 12, 2024

Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity	Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu	ICLR 2025	May 22, 2024
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu ICLR 2025 May 22, 2024

PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning	Hyeong Kyu Choi, Yixuan Li	ICML 2024	May 3, 2024
PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning Hyeong Kyu Choi, Yixuan Li ICML 2024 May 3, 2024

Understanding the Learning Dynamics of Alignment with Human Feedback	Shawn Im, Yixuan Li	ICML 2024	Mar 27, 2024
Understanding the Learning Dynamics of Alignment with Human Feedback Shawn Im, Yixuan Li ICML 2024 Mar 27, 2024

ARGS: Alignment as reward-guided search	Maxim Khanov, Jirayu Burapacheep, Yixuan Li	ICLR 2024	Jan 23, 2024
ARGS: Alignment as reward-guided search Maxim Khanov, Jirayu Burapacheep, Yixuan Li ICLR 2024 Jan 23, 2024

Debate Helps Supervise Unreliable Experts	Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman		Nov 15, 2023
Debate Helps Supervise Unreliable Experts Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman Nov 15, 2023

The Efficacy of Transformer-based Adversarial Attacks in Security Domains	Kunyang Li, Kyle Domico, Jean-Charles Noirot Ferrand, Patrick McDaniel		Oct 17, 2023
The Efficacy of Transformer-based Adversarial Attacks in Security Domains Kunyang Li, Kyle Domico, Jean-Charles Noirot Ferrand, Patrick McDaniel Oct 17, 2023

The Space of Adversarial Strategies	Ryan Sheatsley, Blaine Hoak, Eric Pauley, and Patrick McDaniel		Aug 10, 2023
The Space of Adversarial Strategies Ryan Sheatsley, Blaine Hoak, Eric Pauley, and Patrick McDaniel Aug 10, 2023

Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection	Rheeya Uppaal, Junjie Hu, Yixuan Li	ACL 2023	May 22, 2023
Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection Rheeya Uppaal, Junjie Hu, Yixuan Li ACL 2023 May 22, 2023

The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning	Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, Somesh Jha	ICLR 2023	Feb 28, 2023
The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, Somesh Jha ICLR 2023 Feb 28, 2023