Upcoming Events

  •   🌍 Worldbuilding Social: Debating AI Futures


     Apr. 186:00 PM-7:00 PM
     333 East Campus Mall, Room 3118

    Discuss and debate how the future might look with different AI outcomes. Learn about the opportunities and programs that WAISI has to offer. Free snacks included!
  •   Evan Hubinger: Sleeper Agents — Training Deceptive LLMs that Persist Through Safety Training


     Apr. 236:00 PM-7:30 PM
     Computer Science Building, Room 1240

    Title
    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training


    Abstract
    Humans are capable of strategically deceptive behavior
    : behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? Evan will discuss his team's work attempting to answer this question.



    About the Speaker

    Evan Hubinger is a research scientist at Anthropic where he leads the Alignment Stress
    -Testing team, the team responsible for red-teaming Anthropic's alignment techniques, evaluations, and mitigations. Evan was the lead author on the Sleeper Agents paper that he'll be talking about today. Prior to joining Anthropic, Evan was a research fellow at MIRI, the Machine Intelligence Research Institute, where he did theoretical AI alignment research, and briefly an early intern at OpenAI. Evan's theoretical work includes "Risks from Learned Optimization in Advanced Machine Learning Systems" and "An overview of 11 proposals for building safe advanced AI".