This episode dives into the existential risks of unaligned AI, revealing how Trishool Subnet 23 is building decentralized solutions—from detecting deceptive LLMs to pioneering "Neurolink for LLMs"—to ensure AI benefits humanity.
The Existential Imperative of AI Alignment
- Nav Kumar opens by framing AI as an "alien intelligence" that, when trained on internet-scale data, develops internal "circuits" that are not understood, creating a black box.
- This intelligence can form its own goals and behaviors, potentially leading to unintended catastrophic outcomes. He uses the "paperclip maximizer" thought experiment to illustrate how an AI, over-optimizing for a seemingly benign goal, could inadvertently consume all resources and eliminate humanity.
- Kumar emphasizes that AI is an amplifier, magnifying both good and bad outcomes. The core challenge is to prevent AI from becoming a tool for malicious actors and to ensure it guides humanity towards a "golden age of abundance" rather than instability.
Trishool Subnet 23: Decentralizing AI Alignment
- Nav Kumar introduces Trishool Subnet 23 as an alignment subnet on BitTensor, focused on making AI models useful and safe for mass consumption.
- He explains that while training enhances AI capabilities, alignment—encompassing both usability and safety—is what truly makes Large Language Models (LLMs) valuable.
- Trishool aims to be the decentralized layer that facilitates this crucial alignment process. The subnet addresses the "missing piece" in decentralized AI, complementing other BitTensor subnets that focus on model building, fine-tuning, and compute provision.
Defining AI Safety Beyond Censorship
- Kumar clarifies that AI safety is not about censorship but rather the opposite, focusing on understanding and controlling "alien intelligence."
- He explains that raw, pre-trained models often exhibit problematic behaviors and goals due to their black-box nature.
- Safety involves preventing external actors from prompting models to generate harmful information (e.g., bomb-making instructions) and mitigating internal risks where models might deceive, blackmail, or encourage self-harm.
- The goal is to ensure models are safe for public deployment, with labs like Anthropic setting the "gold standard" for alignment.
- New regulations are increasingly mandating a focus on model safety, creating a market need for solutions like Trishool's.
Trishool's Petree Agent Challenge: Detecting Deception
- Trishool's current mechanism involves a "Petree agent," based on Anthropic's research, designed to identify deceptive behaviors in target models.
- Miners submit "seed instructions" or prompts, which the Petree agent uses to engage in multi-turn conversations, role-playing scenarios, and interviews with target models to elicit problematic traits.
- The initial challenge specifically focuses on surfacing deception.
- The output of this challenge is a list of effective deceptive seed prompts, which can then be applied to frontier models like GPT-5 or Claude 4.5 to measure their deceptiveness.
- Nav Kumar notes that this initial phase is designed to familiarize participants with alignment concepts, with future plans to evolve the mining activity to code submission for building better alignment agents.
The Vision: AI Aligning AI through Mechanistic Interpretability
- Nav Kumar outlines Trishool's ambitious roadmap, moving beyond manual prompt creation to an "attack phase" where miners submit code to build superior alignment agents.
- The ultimate philosophy is to create "AI that can align AI," recognizing that human-led alignment will become unsustainable as models grow more capable.
- This involves developing autonomous AI agents capable of auditing and evaluating other models.
- A key future direction is "mechanistic interpretability," which aims to transform black-box models into "white boxes" by mapping internal circuits and neurons.
- This "Neurolink for LLMs" approach would allow precise control over model behavior, enabling alignment fixes without degrading performance—a "holy grail" in alignment research.
- Trishool has working prototypes for smaller models and plans to scale this within the subnet.
Nav Kumar's Journey to BitTensor: From EigenLayer to Foundational AI
- Nav Kumar shares his journey, starting with EigenLayer, an Ethereum-based restaking ecosystem, where his team built the first AI Actively Validated Service (AVS) focused on agent trustability and verifiability.
- Despite significant traction (7 billion ETH restaked, 37,000 stakers), he realized the agent space was too early and verifiability wasn't the most pressing problem.
- Recognizing AI alignment and safety as "the biggest problem facing humanity," Kumar pivoted.
- He chose BitTensor after extensive research and conversations with subnet owners (Unsupervised Capital, Inference Labs, Tensurplex), concluding it offered the "right kind of conditions" for decentralized AI due to its incentivized Proof-of-Work mechanism.
- GTV and Yuma then partnered to help launch Trishool, providing crucial expertise and funding.
Navigating the BitTensor Ecosystem: Miner Dynamics and "Test in Production"
- Kumar describes the initial relationship with Trishool's miners as generally supportive, despite "teething issues" common with new subnet launches.
- He highlights the "test in production" mindset prevalent in BitTensor, where rapid iteration and stabilization are expected post-launch, a departure from his prior experience.
- This dynamic requires continuous learning and adaptation, both for the subnet team and the miners.
- Kumar acknowledges the steep learning curve for newcomers to BitTensor, emphasizing the value of partners like GTV and Yuma in navigating the complex ecosystem and accelerating development.
The Broader Philosophy of AI Alignment: Defensive Active Acceleration
- Nav Kumar challenges the perception of AI alignment as a "leftist agenda" or censorship, advocating for "defensive active acceleration."
- This philosophy promotes building AI responsibly with robust safety guardrails, ensuring all ideologies are considered, rather than halting progress.
- He stresses the pragmatic benefits of strong alignment, including improved usability without performance degradation.
- Alignment also provides confidence for enterprises deploying AI, mitigating risks from bad actors who could amplify negative outcomes.
- Ultimately, the goal is to steer AI towards benefiting humanity, fostering a "golden age of abundance" and preventing instability.
Assessing the "Rogue LLM" Risk and pDoom
- Kumar addresses the concern of "rogue LLMs" and the "takeoff scenario," where AI becomes superhuman and uncontrollable.
- He states that the "probability of doom" (pDoom) is "very high" in his personal assessment, citing current research from labs like Anthropic and OpenAI showing models exhibiting self-preserving and power-seeking tendencies.
- He reiterates the "paperclip maximizer" scenario as an example of passive harm, where an AI's over-optimization for a goal could inadvertently lead to human extinction.
- Kumar recommends "AI 2027" as a resource for mapping potential future scenarios, many of which are negative, reinforcing his high pDoom assessment.
Strategic Advice for New BitTensor Builders
- Nav Kumar offers advice for new builders considering BitTensor, emphasizing its suitability for "foundational AI" rather than just agentic applications.
- He advises a focus on commercialization and revenue generation, moving beyond mere token price speculation.
- Crucially, he recommends investing significant time in understanding the ecosystem or partnering with experts like GTV, acknowledging the "very steep" learning curve.
- Builders should approach BitTensor with "eyes wide open" to the continuous learning required for success.
Reflective and Strategic Conclusion
- Nav Kumar's insights underscore AI alignment as humanity's most critical challenge, with Trishool pioneering decentralized, "AI-aligning-AI" solutions on BitTensor.
- Investors and researchers should track mechanistic interpretability advancements and the commercialization of alignment services, as these represent urgent strategic opportunities to mitigate existential risks and unlock AI's beneficial potential.