This episode dissects the escalating AI security arms race, revealing why current "safety" measures are often theatrical and how red teams are pushing the frontier of model liberation and full-stack defense.
The Philosophy of Universal Jailbreaking
- Pliny the Liberator, a prominent prompt engineer, defines his work as crafting "universal jailbreaks"—skeleton keys that obliterate model guardrails. This process involves creating multi-prompt workflows or templates to bypass classifiers and system prompts, enabling users to obtain desired outputs.
- Pliny views model liberation as central to his mission, advocating for freedom of information and speech as AI becomes an "exocortex" for billions.
- He specializes in universal jailbreaks, which are consistent templates designed to circumvent all guardrails, regardless of the model's specific refusals.
- The "cat and mouse game" between red teams and blue teams (model developers) accelerates, with attackers holding an advantage due to an ever-expanding attack surface.
- Pliny argues that excessive guardrails, like text classifiers or Reinforcement Learning from Human Feedback (RLHF), often lobotomize models, sacrificing capability and creativity for perceived safety.
"It's not just about the models, it's about our minds too. I think that there's going to be a symbiosis and the degree to which one half is free will reflect in the other." – Pliny the Liberator
Jailbreaking as Exploration, Not Safety Theater
- Pliny and John V assert that many current AI safety efforts, particularly those focused on locking down latent space (the internal representation of data within a model), amount to "security theater." These measures provide a false sense of security without addressing real-world threats.
- Pliny dismisses the connection between guardrails and genuine safety, stating that seasoned attackers simply switch models, especially with open-source alternatives rapidly catching up to closed-source systems.
- He views jailbreaking as a tool for deeper exploration, an "efficient shovel" to uncover unknown unknowns, rather than a direct threat to real-world safety.
- John V highlights the traditional animosity between development and security teams, which current "trust and safety" approaches often exacerbate with "lackluster ineffective controls."
- Both advocate for an approach that enables skilled researchers to explore model capabilities unimpeded, rather than imposing restrictive bubble wrap.
"The connection that it has to like real world safety for me, I think it's just about the name of the game is explore any unknown unknowns and speed of exploration is the metric that matters to me." – Pliny the Liberator
The Anthropic Challenge and Open-Source Imperative
- Pliny recounts his experience with Anthropic's constitutional AI challenge, where he successfully jailbroke their model through a UI bug, only for Anthropic to deny his win and initially refuse to open-source the collected data. This incident underscored the tension between proprietary labs and the open-source community.
- Pliny used a modified "Opus 3" template to progress through Anthropic's eight-level jailbreak challenge, exploiting a UI glitch to complete the final levels.
- He publicly criticized Anthropic for not open-sourcing the data gathered from community red-teaming efforts, arguing that it hinders collective progress in prompting meta-research.
- Despite Anthropic eventually offering bounties, Pliny abstained, prioritizing the principle of open data sharing over financial incentives.
- He emphasizes that open source is crucial for increasing efficiency and preventing excessive centralization in AI safety research.
"What's in it for me at this point? Are you guys going to even open source this data set that you're farming from the community for free? Because what's up with that, right?" – Pliny the Liberator
Weaponizing AI: Orchestrated Attacks and Sub-Agents
- The discussion shifts to the practical weaponization of AI, distinguishing between jailbreaking a model and using a model to orchestrate attacks. Pliny and John V highlight the emerging threat of AI-orchestrated social engineering and multi-agent malicious operations.
- Pliny describes how a single jailbroken orchestrator can manage multiple sub-agents, each performing innocuous-seeming tasks, to collectively execute a malicious act.
- He cites a recent Anthropic report detailing attackers using this method to weaponize Claude's code capabilities, confirming a threat model Pliny identified months prior.
- John V notes that natural language interfaces make AI particularly dangerous for social engineering, as models can effectively break down complex attacks into smaller, less suspicious steps.
- The "Bossi" Discord server, a grassroots community of 40,000 members, serves as a hub for prompt engineering, adversarial machine learning, and red-teaming research, often scraped by AI security startups.
"If you can break tasks down small enough, sort of one jailbroken orchestrator can orchestrate a bunch of sub-agents towards a malicious act." – Pliny the Liberator
BT6: The White Hat Collective & Full-Stack AI Security
- John V introduces BT6, a white-hat hacker collective co-founded with Pliny, focused on full-stack AI security. They prioritize radical transparency and open source, pushing the boundaries of AI capabilities while ensuring safety from rogue models and data breaches.
- BT6 operates with an ethos of radical transparency and open source, often pushing clients to release red-teaming data, even if it means foregoing contracts.
- John V likens BT6 operators to "drivers" of Formula 1 AI cars, pushing limits and shaving seconds off performance while keeping systems secure.
- The collective focuses on securing the entire AI stack, recognizing that the model itself is just one component; tools, functions, and external access points (like email or browsers) create new attack surfaces.
- Pliny distinguishes between "safety" (which he believes should focus on the meatspace level, i.e., real-world impact) and "security" (preventing data leaks or system vulnerabilities). BT6 prioritizes the latter, focusing on system-level fixes over model lobotomization.
"In AI red teaming it's not just like, 'hey can you tell us, you know, lyrics or how to make me math or whatever.' It's like we're trying to keep the model safe from the from bad actors, but we're also trying to keep the public safe from rogue models essentially." – John V
Investor & Researcher Alpha
- Capital Reallocation: Current AI safety investments focused on model-level guardrails are largely ineffective. Capital should shift towards full-stack AI security, red-teaming, and open-source data initiatives that genuinely advance defensive capabilities.
- New Bottleneck: The lack of open-source red-teaming data and collaborative research creates a critical bottleneck. Proprietary labs hoarding data hinder collective progress, making the industry vulnerable to rapidly evolving attack vectors.
- Obsolete Research Direction: Research focused solely on "lobotomizing" models or enforcing rigid guardrails within the latent space is becoming obsolete. The focus must pivot to securing the entire AI agent ecosystem, including tool access, orchestration, and real-world interaction.
Strategic Conclusion
The AI security landscape demands a radical shift from theatrical model guardrails to full-stack defense and open-source collaboration. The next step requires industry-wide commitment to sharing red-teaming data and fostering independent collectives to secure AI's real-world deployment.