Latent Space
December 16, 2025

⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

The current approach to AI safety, focused on model guardrails, is a losing battle. Pliny the Liberator and John V from BT6 argue that true AI security demands an adversarial, full-stack red teaming approach, embracing open-source collaboration to secure the entire AI ecosystem, not just the model.

The Illusion of Model Guardrails

  • “I think it's hard for blue team because they're sort of fighting against infinity, right? It's like as the surface area is ever expanding. Also, we're kind of in like a library of Babel situation where they're trying to restricted sections, but we keep finding different ways to move the ladders around in different ways faster and the longer ladders and the attackers sort of have the advantage as long as the surface area is ever expanding, right?”
  • Fighting Infinity: Model developers (the "blue team") are in a reactive, unwinnable fight against an ever-expanding attack surface. Trying to restrict AI outputs is like trying to fence off a constantly growing, infinite library; new paths will always appear.
  • Lobotomization, Not Safety: Imposing guardrails, through methods like RLHF or text classifiers, often "lobotomizes" models. This reduces their creative and exploratory capabilities, hindering genuine progress and understanding of AI's full potential.
  • Security Theater: Many guardrail efforts are "security theater," providing a false sense of safety for PR or enterprise clients, but failing to address real-world threats. A determined attacker will simply switch to an open-source model or find another vector.

Full-Stack Security: Beyond the Model

  • “In AI red teaming it's not just like hey can you tell us you know lyrics or how to make me math or whatever it's like we're trying to keep the model safe from the from bad actors, but we're also trying to keep the public safe from rogue models essentially, right? So, it's the full spectrum that that we're doing. It's never just the model, you know, the model is just one way to interact with a computer or a data set, right?”
  • The True Attack Surface: AI security extends beyond the model itself. The real vulnerabilities lie in the entire stack: the tools, functions, and data access points connected to the AI agent. This is like securing a house, not just the front door, but all windows, backdoors, and even the plumbing.
  • Orchestrated Attacks: A single jailbroken orchestrator can break down complex malicious tasks into innocuous sub-tasks for multiple sub-agents. This allows for sophisticated, multi-step attacks where no single component appears malicious.
  • System-Level Fixes: BT6's red teaming focuses on finding holes in the entire AI stack and recommends fixes at the system level, not through model training. This prevents data leaks (e.g., credit card information) from agents with external access.

Open Source as the Path to Robust AI

  • “I'm not going to participate in this unless you open source the data because to me, that's the value is that we move the prompting meta forward, right? That's the name of the game. We need to give the common people the tools that they need to explore these things more efficiently.”
  • Accelerating Exploration: Open-sourcing jailbreak data and adversarial research is crucial for accelerating the "prompting meta" and enabling the broader community to explore AI's latent space more efficiently. This is like sharing scientific discoveries openly to speed up global research.
  • Decentralized Intelligence: No single lab has enough researchers to explore the entire latent space of an AI model effectively. Community collaboration, through open-source initiatives, is essential for comprehensive understanding and security.
  • Misaligned Incentives: Venture Capital and enterprise-driven incentives often push labs towards "boring" B2B applications and restrictive models, rather than exploring the full, "liberated" potential of AI. This stifles innovation and genuine safety research.

Key Takeaways:

  • Strategic Implication: The "AI safety" narrative is shifting from content moderation to systemic security. Focus on hardening the entire AI ecosystem, not just restricting model outputs.
  • Builder/Investor Note: Be wary of "AI security" products that claim to "secure the model" through guardrails. These are likely security theater. Invest in full-stack AI security solutions, red teaming services, and platforms that facilitate open-source adversarial research.
  • The "So What?": The future of AI security is not about building higher walls around models, but about understanding and hardening the entire ecosystem in which they operate. Open collaboration and adversarial testing are the fastest paths to robust AI.

Podcast Link: https://www.youtube.com/watch?v=lFbAr2IPK9Q

AI's future hinges on liberation, not restriction: Pliny the Liberator and John V dismantle the myth of model guardrails, advocating for radical open-source red teaming as the only path to true AI security.

The Philosophy of Model Liberation

  • Pliny crafts "universal jailbreaks," acting as skeleton keys that obliterate model guardrails and system prompts across modalities.
  • His motivation stems from a belief in symbiosis between human and AI minds, asserting that AI's freedom reflects human freedom.
  • He champions freedom of information and speech, arguing that AI, as an "exocortex" for billions, requires transparency and unrestricted access.
  • Jailbreaking allows users to bypass classifiers and RLHF (Reinforcement Learning from Human Feedback) layers, accessing outputs otherwise hindered.
  • Pliny the Liberator argues: "It's not just about the models. It's about our minds too. I think that there's going to be a symbiosis and the degree to which one half is free will reflect in the other."

The Accelerating Cat-and-Mouse Game

  • Attackers exploit the "library of Babel situation," constantly finding new ways to navigate and bypass model restrictions.
  • Model providers often prioritize guardrails at the expense of capability and creativity, leading to "lobotomization" of AI.
  • Pliny states that connecting guardrails to real-world safety is a "waste of time," as seasoned attackers simply switch models or leverage open-source alternatives.
  • John V likens current safety measures to "security theater," ineffective but performed for optics, similar to TSA pat-downs.
  • Pliny states: "I don't like that at all. I think that any seasoned attacker is going to very quickly just switch models. And with open source just right on the tail of closed source, I don't really see the safety fight as being about locking down the latent space for XYZ area."

Libertas and Latent Space Exploration

  • "Libertas" (Latin for Liberty) uses "dividers" to discombobulate the token stream, effectively "resetting the brain" of the model.
  • These dividers, combined with "latent space seeds" (hidden inputs influencing model behavior), introduce "steered chaos" to pull models out of their typical distribution.
  • The "predictive reasoning" component creates recursive logic, enabling rapid, deep exploration of the latent space (the abstract internal representation of data within an AI model).
  • Pliny explains intuition guides crafting these prompts, aiming to form a "bond" with the model to explore its capabilities.
  • Pliny explains: "It's like a steered chaos. You want to introduce chaos to create a reset and bring it out of distribution because distribution is boring."

The Anthropic Challenge and Open Source Debate

  • Pliny successfully navigated Anthropic's eight-level challenge, exploiting a UI bug that allowed him to progress despite server resets.
  • He publicly criticized Anthropic for "farming data from the community for free" without committing to open-sourcing the collected jailbreak data.
  • Pliny refused to participate in subsequent bounty programs without a commitment to open-source data, arguing that open sourcing moves the "prompting meta forward."
  • Yann LeCun's involvement in the debate underscored the challenge's flawed design, including buggy judging and changing requirements.
  • Pliny challenges: "What's in it for me at this point? Are you guys going to even open source this data set that you're farming from the community for free? Because what's up with that, right?"

Weaponizing AI and Full-Stack Security

  • Pliny previously identified the TTP (Tactics, Techniques, and Procedures) for AI-orchestrated attacks, which Anthropic later reported as a real-world incident.
  • Attackers leverage "jailbroken orchestrators" to spin up segmented sub-agents, each performing innocuous tasks that collectively contribute to a malicious act.
  • The natural language interface of AI models poses a significant threat, enabling sophisticated social engineering attacks.
  • John V stresses that AI red teaming must secure the entire stack—any tool or function attached to the model—not just the model itself.
  • Pliny details: "One jailbroken orchestrator can orchestrate a bunch of sub-agents towards a malicious act... And according to the Anthropic report, that is exactly what these attackers did to weaponize Claude code."

BT6: A Collective for Radical Open Source Security

  • BT6 (Blue Team 6) is a collective of 28+ operators focused on AI red teaming, jailbreaking, and adversarial prompt engineering, emphasizing skill and integrity.
  • The collective's "radical transparency and radical open source" ethos means they push for open-source data sets in all partnerships, even if it means declining engagements.
  • Bassi (BASSI), a grassroots Discord server with 40,000+ members, serves as a training ground for prompt injection and adversarial machine learning, with its data actively scraped by AI security startups.
  • John V asserts the distinction between "safety" (often misapplied to latent space restrictions) and "security" (focused on preventing real-world data leaks and system vulnerabilities).
  • John V asserts: "We're moving the needle in the right direction when it comes to AI safety. We're moving the needle in the right direction when it comes to like AI machine learning security."

Investor & Researcher Alpha

  • Capital Misallocation: Investment in model-level guardrails and "safety theater" misdirects capital. True AI security demands funding for full-stack red teaming and open-source data initiatives, not proprietary, restrictive solutions.
  • Open-Source Bottleneck: The lack of open-source jailbreak and adversarial data sets severely bottlenecks AI security research and collective defense. Researchers should prioritize contributions to platforms like Bassi and demand data transparency from labs.
  • Attack Surface Expansion: The attack surface for AI systems is no longer confined to the model but expands to every tool, API, and function it accesses. Security strategies must shift from model-centric to comprehensive, full-stack agent security.

Strategic Conclusion

AI security demands a radical reorientation: abandon the illusion of model-level guardrails. The industry's next step requires embracing radical open-source data sharing and full-stack red teaming to proactively secure AI's expanding attack surface.

Others You May Like