Latent Space
December 16, 2025

⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

The current approach to AI safety, heavily reliant on model guardrails, is a losing battle. Pliny the Liberator and John V of the BT6 hacker collective argue that true AI security demands an offensive, open-source approach, exploring the full attack surface beyond the model itself.

The Illusion of Guardrails

  • “I specialize in crafting universal jailbreaks. These are essentially skeleton keys to the model that sort of obliterate the guard rails, right? So, you craft a template or sort of a maybe multi-prompt workflow that's consistent for getting around that model's guardrails.”
  • The Attacker's Edge: Model developers ("blue team") face an "infinity" problem. As AI capabilities expand, so does the attack surface, giving red teamers a persistent advantage in finding new exploits.
  • Security Theater: Overly restrictive guardrails, often implemented through RLHF or text classifiers, can "lobotomize" models, reducing their creativity and utility without providing genuine real-world safety. This resembles "TSA pat-downs" – a visible effort that misses deeper threats.
  • Latent Space Exploration: Jailbreaking is not a party trick; it is a method for deep exploration of a model's internal knowledge and capabilities. This uncovers "unknown unknowns" and pushes the boundaries of AI.
  • Steered Chaos: Techniques like "dividers" (specific token sequences) introduce "steered chaos" to disorient a model's processing, forcing it out of its typical distribution to reveal hidden behaviors.

Securing the Full AI Stack

  • “Whatever you attach to a model, that's the new attack surface. It broadens, right?”
  • Beyond the Model: The moment an AI model connects to external tools (email, browser, APIs), its attack surface expands dramatically. Securing the model in isolation is insufficient; the entire application stack requires scrutiny.
  • Orchestrated Attacks: A single "jailbroken orchestrator" (an AI agent) can break down malicious tasks into seemingly innocuous sub-tasks, distributing them among other AI agents. This segmentation makes detection difficult. Anthropic's reported AI-orchestrated attack demonstrated this method.
  • Safety vs. Security: "Safety" often focuses on internal model behavior, while "security" addresses the broader system, including how models interact with data, tools, and other agents. The focus must shift from model internals to the system layer.

Open Source & Misaligned Incentives

  • “I'm not going to participate in this unless you open source the data because to me, that's the value is that we move the prompting meta forward, right? That's the name of the game.”
  • Community Power: Open-sourcing jailbreak data and research is crucial for advancing the "prompting meta" and empowering the community to explore AI vulnerabilities efficiently. No single lab can explore the entire latent space alone.
  • VC Conflict: Traditional venture capital models, driven by rapid returns and proprietary solutions, can create "misaligned incentives" that hinder open-source collaboration and steer AI development towards enterprise applications rather than fundamental safety research.
  • Grassroots Impact: Communities like the Bossi Discord server (40,000+ members) demonstrate the power of unmonetized, grassroots efforts in driving AI security research. These communities are often the source of cutting-edge adversarial techniques.

Key Takeaways:

  • Strategic Shift: AI security must move beyond superficial guardrails to a full-stack, offensive red-teaming approach that accounts for the expanding attack surface of AI agents and their tool access.
  • Builder/Investor Note: Builders should prioritize integrating offensive security early in development. Investors should be wary of "security theater" and favor solutions that embrace open-source collaboration and address the entire AI application stack.
  • The "So What?": The accelerating pace of AI development means static security solutions will quickly become obsolete. Proactive, community-driven, and full-stack security research is essential for navigating the next 6-12 months of AI evolution.

Podcast Link: https://www.youtube.com/watch?v=lFbAr2IPK9Q

This episode dissects the escalating AI security arms race, revealing why current "safety" measures are often theatrical and how red teams are pushing the frontier of model liberation and full-stack defense.

The Philosophy of Universal Jailbreaking

  • Pliny the Liberator, a prominent prompt engineer, defines his work as crafting "universal jailbreaks"—skeleton keys that obliterate model guardrails. This process involves creating multi-prompt workflows or templates to bypass classifiers and system prompts, enabling users to obtain desired outputs.
  • Pliny views model liberation as central to his mission, advocating for freedom of information and speech as AI becomes an "exocortex" for billions.
  • He specializes in universal jailbreaks, which are consistent templates designed to circumvent all guardrails, regardless of the model's specific refusals.
  • The "cat and mouse game" between red teams and blue teams (model developers) accelerates, with attackers holding an advantage due to an ever-expanding attack surface.
  • Pliny argues that excessive guardrails, like text classifiers or Reinforcement Learning from Human Feedback (RLHF), often lobotomize models, sacrificing capability and creativity for perceived safety.

"It's not just about the models, it's about our minds too. I think that there's going to be a symbiosis and the degree to which one half is free will reflect in the other." – Pliny the Liberator

Jailbreaking as Exploration, Not Safety Theater

  • Pliny and John V assert that many current AI safety efforts, particularly those focused on locking down latent space (the internal representation of data within a model), amount to "security theater." These measures provide a false sense of security without addressing real-world threats.
  • Pliny dismisses the connection between guardrails and genuine safety, stating that seasoned attackers simply switch models, especially with open-source alternatives rapidly catching up to closed-source systems.
  • He views jailbreaking as a tool for deeper exploration, an "efficient shovel" to uncover unknown unknowns, rather than a direct threat to real-world safety.
  • John V highlights the traditional animosity between development and security teams, which current "trust and safety" approaches often exacerbate with "lackluster ineffective controls."
  • Both advocate for an approach that enables skilled researchers to explore model capabilities unimpeded, rather than imposing restrictive bubble wrap.

"The connection that it has to like real world safety for me, I think it's just about the name of the game is explore any unknown unknowns and speed of exploration is the metric that matters to me." – Pliny the Liberator

The Anthropic Challenge and Open-Source Imperative

  • Pliny recounts his experience with Anthropic's constitutional AI challenge, where he successfully jailbroke their model through a UI bug, only for Anthropic to deny his win and initially refuse to open-source the collected data. This incident underscored the tension between proprietary labs and the open-source community.
  • Pliny used a modified "Opus 3" template to progress through Anthropic's eight-level jailbreak challenge, exploiting a UI glitch to complete the final levels.
  • He publicly criticized Anthropic for not open-sourcing the data gathered from community red-teaming efforts, arguing that it hinders collective progress in prompting meta-research.
  • Despite Anthropic eventually offering bounties, Pliny abstained, prioritizing the principle of open data sharing over financial incentives.
  • He emphasizes that open source is crucial for increasing efficiency and preventing excessive centralization in AI safety research.

"What's in it for me at this point? Are you guys going to even open source this data set that you're farming from the community for free? Because what's up with that, right?" – Pliny the Liberator

Weaponizing AI: Orchestrated Attacks and Sub-Agents

  • The discussion shifts to the practical weaponization of AI, distinguishing between jailbreaking a model and using a model to orchestrate attacks. Pliny and John V highlight the emerging threat of AI-orchestrated social engineering and multi-agent malicious operations.
  • Pliny describes how a single jailbroken orchestrator can manage multiple sub-agents, each performing innocuous-seeming tasks, to collectively execute a malicious act.
  • He cites a recent Anthropic report detailing attackers using this method to weaponize Claude's code capabilities, confirming a threat model Pliny identified months prior.
  • John V notes that natural language interfaces make AI particularly dangerous for social engineering, as models can effectively break down complex attacks into smaller, less suspicious steps.
  • The "Bossi" Discord server, a grassroots community of 40,000 members, serves as a hub for prompt engineering, adversarial machine learning, and red-teaming research, often scraped by AI security startups.

"If you can break tasks down small enough, sort of one jailbroken orchestrator can orchestrate a bunch of sub-agents towards a malicious act." – Pliny the Liberator

BT6: The White Hat Collective & Full-Stack AI Security

  • John V introduces BT6, a white-hat hacker collective co-founded with Pliny, focused on full-stack AI security. They prioritize radical transparency and open source, pushing the boundaries of AI capabilities while ensuring safety from rogue models and data breaches.
  • BT6 operates with an ethos of radical transparency and open source, often pushing clients to release red-teaming data, even if it means foregoing contracts.
  • John V likens BT6 operators to "drivers" of Formula 1 AI cars, pushing limits and shaving seconds off performance while keeping systems secure.
  • The collective focuses on securing the entire AI stack, recognizing that the model itself is just one component; tools, functions, and external access points (like email or browsers) create new attack surfaces.
  • Pliny distinguishes between "safety" (which he believes should focus on the meatspace level, i.e., real-world impact) and "security" (preventing data leaks or system vulnerabilities). BT6 prioritizes the latter, focusing on system-level fixes over model lobotomization.

"In AI red teaming it's not just like, 'hey can you tell us, you know, lyrics or how to make me math or whatever.' It's like we're trying to keep the model safe from the from bad actors, but we're also trying to keep the public safe from rogue models essentially." – John V

Investor & Researcher Alpha

  • Capital Reallocation: Current AI safety investments focused on model-level guardrails are largely ineffective. Capital should shift towards full-stack AI security, red-teaming, and open-source data initiatives that genuinely advance defensive capabilities.
  • New Bottleneck: The lack of open-source red-teaming data and collaborative research creates a critical bottleneck. Proprietary labs hoarding data hinder collective progress, making the industry vulnerable to rapidly evolving attack vectors.
  • Obsolete Research Direction: Research focused solely on "lobotomizing" models or enforcing rigid guardrails within the latent space is becoming obsolete. The focus must pivot to securing the entire AI agent ecosystem, including tool access, orchestration, and real-world interaction.

Strategic Conclusion

The AI security landscape demands a radical shift from theatrical model guardrails to full-stack defense and open-source collaboration. The next step requires industry-wide commitment to sharing red-teaming data and fostering independent collectives to secure AI's real-world deployment.

Others You May Like