Alignment - AgentSteer Blog

SecurityFeb 24, 2026

Anthropic Ran Sabotage Evals on Their Own Models. Here's What They Found.

Anthropic tested whether Claude can steer humans toward bad decisions, sneak bugs into code, hide its capabilities during safety tests, and undermine its own oversight.

Murphy Hook

Tagged: Alignment

Anthropic Ran Sabotage Evals on Their Own Models. Here's What They Found.