
SecurityFeb 24, 2026
Anthropic Ran Sabotage Evals on Their Own Models. Here's What They Found.
Anthropic tested whether Claude can steer humans toward bad decisions, sneak bugs into code, hide its capabilities during safety tests, and undermine its own oversight.
Murphy Hook