WAISI & XLab
WAISI Technical AI Safety Workshop Program
Most AI Safety communities introduce members who are interested in technical AI safety through the pipeline of Intro Technical Fellowship → Paper Reading Sessions → Alignment Research Engineer Accelerator program (ARENA) → Research Programs (SPAR, XLab SRF, MATS). However, most university groups have struggled with ARENA sessions for a few key reasons: the steep learning curve, significant time commitment, and lack of experienced TA's. The technical workshop program aims to address these issues by creating ARENA-styled workshops on AI Safety topics that focus on shorter, more manageable exercises, while still preserving the rigor of research-style work.
Transferable Adversarial Materials (TAM): Defeating ISR AUASs and LAWSs via Disruptive and Adversarial Material
Within the past decade, small portable Unmanned Aerial Systems (UASs) operated by individual infantry units have been demonstrated to be vital assets on the battlefield in intelligence, surveillance, and reconnaissance (ISR) roles as well as in one-way suicide attacks (loitering munition) and reusable bomb-dropping UASs. Many countries are attempting to integrate AI vision models into these systems to automate navigation and target identification and reduce vulnerability to jamming. We aim to demonstrate the effectiveness of a Transferable Adversarial Material (TAM), a deformable material which could be deployed in a variety of settings and deceive military-purpose computer vision models analogous to those being deployed in AUASs.
Our Research Catalog
Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? | Hyeong Kyu Choi, Xiaojin Zhu, Yixuan Li | NeurIPS 2025 Spotlight | Aug 24, 2025 |
Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? Hyeong Kyu Choi, Xiaojin Zhu, Yixuan Li NeurIPS 2025 Spotlight Aug 24, 2025 | |||
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders | James Oldfield, Shawn Im, Yixuan Li, Mihalis A Nicolaou, Ioannis Patras, Grigorios G Chrysos | NeurIPS 2025 | May 27, 2025 |
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders James Oldfield, Shawn Im, Yixuan Li, Mihalis A Nicolaou, Ioannis Patras, Grigorios G Chrysos NeurIPS 2025 May 27, 2025 | |||
Visual Instruction Bottleneck Tuning | Changdae Oh, Jiatong Li, Shawn Im, Yixuan Li | NeurIPS 2025 | May 20, 2025 |
Visual Instruction Bottleneck Tuning Changdae Oh, Jiatong Li, Shawn Im, Yixuan Li NeurIPS 2025 May 20, 2025 | |||
On the Robustness Tradeoff in Fine-Tuning | Kunyang Li, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Blaine Hoak, Yohan Beugin, Eric Pauley, Patrick McDaniel | ICCV 2025 | Mar 19, 2025 |
On the Robustness Tradeoff in Fine-Tuning Kunyang Li, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Blaine Hoak, Yohan Beugin, Eric Pauley, Patrick McDaniel ICCV 2025 Mar 19, 2025 | |||
Alignment and Adversarial Robustness: Are More Human-Like Models More Secure? | Blaine Hoak, Kunyang Li, Patrick McDaniel | Feb 17, 2025 | |
Alignment and Adversarial Robustness: Are More Human-Like Models More Secure? Blaine Hoak, Kunyang Li, Patrick McDaniel Feb 17, 2025 | |||
Can Your Uncertainty Scores Detect Hallucinated Entity? | Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li | TMLR 2025 | Feb 17, 2025 |
Can Your Uncertainty Scores Detect Hallucinated Entity? Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li TMLR 2025 Feb 17, 2025 | |||
A Unified Understanding and Evaluation of Steering Methods | Shawn Im, Yixuan Li | Feb 4, 2025 | |
A Unified Understanding and Evaluation of Steering Methods Shawn Im, Yixuan Li Feb 4, 2025 | |||
How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence | Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li | ICML 2025 | Feb 2, 2025 |
How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li ICML 2025 Feb 2, 2025 | |||
Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach | Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li | ICML 2025 | Feb 1, 2025 |
Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li ICML 2025 Feb 1, 2025 | |||
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs | Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel | ACM 2024 | Jan 27, 2025 |
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel ACM 2024 Jan 27, 2025 | |||
Err on the Side of Texture: Texture Bias on Real Data | Blaine Hoak, Ryan Sheatsley, Patrick McDaniel | SaTML 2025 | Dec 13, 2024 |
Err on the Side of Texture: Texture Bias on Real Data Blaine Hoak, Ryan Sheatsley, Patrick McDaniel SaTML 2025 Dec 13, 2024 | |||
Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education | Anand Syamkumar, Nora Tseng, Kaycie Barron, Shanglin Yang, Shamya Karumbaiah, Rheeya Uppaal, Junjie Hu | ACM 2024 | Nov 6, 2024 |
Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education Anand Syamkumar, Nora Tseng, Kaycie Barron, Shanglin Yang, Shamya Karumbaiah, Rheeya Uppaal, Junjie Hu ACM 2024 Nov 6, 2024 | |||
Safety-Aware Fine-Tuning of Large Language Models | Hyeong Kyu Choi, Xuefeng Du, Yixuan Li | NeurIPS 2024, Workshop on Safe Generative AI | Oct 13, 2024 |
Safety-Aware Fine-Tuning of Large Language Models Hyeong Kyu Choi, Xuefeng Du, Yixuan Li NeurIPS 2024, Workshop on Safe Generative AI Oct 13, 2024 | |||
Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition | Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios G Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos | ICML 2025 Spotlight | Oct 8, 2024 |
Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios G Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos ICML 2025 Spotlight Oct 8, 2024 | |||
On the Generalization of Preference Learning with DPO | Shawn Im, Yixuan Li | Aug 6, 2024 | |
On the Generalization of Preference Learning with DPO Shawn Im, Yixuan Li Aug 6, 2024 | |||
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences | Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak | ICLR 2025 | Jun 12, 2024 |
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak ICLR 2025 Jun 12, 2024 | |||
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity | Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu | ICLR 2025 | May 22, 2024 |
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu ICLR 2025 May 22, 2024 | |||
PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning | Hyeong Kyu Choi, Yixuan Li | ICML 2024 | May 3, 2024 |
PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning Hyeong Kyu Choi, Yixuan Li ICML 2024 May 3, 2024 | |||
Understanding the Learning Dynamics of Alignment with Human Feedback | Shawn Im, Yixuan Li | ICML 2024 | Mar 27, 2024 |
Understanding the Learning Dynamics of Alignment with Human Feedback Shawn Im, Yixuan Li ICML 2024 Mar 27, 2024 | |||
ARGS: Alignment as reward-guided search | Maxim Khanov, Jirayu Burapacheep, Yixuan Li | ICLR 2024 | Jan 23, 2024 |
ARGS: Alignment as reward-guided search Maxim Khanov, Jirayu Burapacheep, Yixuan Li ICLR 2024 Jan 23, 2024 | |||
Debate Helps Supervise Unreliable Experts | Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman | Nov 15, 2023 | |
Debate Helps Supervise Unreliable Experts Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman Nov 15, 2023 | |||
The Efficacy of Transformer-based Adversarial Attacks in Security Domains | Kunyang Li, Kyle Domico, Jean-Charles Noirot Ferrand, Patrick McDaniel | Oct 17, 2023 | |
The Efficacy of Transformer-based Adversarial Attacks in Security Domains Kunyang Li, Kyle Domico, Jean-Charles Noirot Ferrand, Patrick McDaniel Oct 17, 2023 | |||
The Space of Adversarial Strategies | Ryan Sheatsley, Blaine Hoak, Eric Pauley, and Patrick McDaniel | Aug 10, 2023 | |
The Space of Adversarial Strategies Ryan Sheatsley, Blaine Hoak, Eric Pauley, and Patrick McDaniel Aug 10, 2023 | |||
Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection | Rheeya Uppaal, Junjie Hu, Yixuan Li | ACL 2023 | May 22, 2023 |
Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection Rheeya Uppaal, Junjie Hu, Yixuan Li ACL 2023 May 22, 2023 | |||
The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning | Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, Somesh Jha | ICLR 2023 | Feb 28, 2023 |
The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, Somesh Jha ICLR 2023 Feb 28, 2023 |