This episode reveals how Ready AI is transforming the web's unstructured chaos into a structured, queryable library for next-generation AI agents, using Bittensor's Subnet 33 to incentivize this massive data annotation effort.
Introduction: Structuring Data for the Future of AI
- David, co-founder of Ready AI, introduces the company's mission: to structure the world's data and make it universally accessible to AI. He highlights the scale of the problem, noting that 90% of the world's 335 zettabytes of data will be unstructured by 2027, with only 3-5% currently tagged for AI workloads.
- Ready AI was drawn to Bittensor for its powerful incentive and validation mechanisms, which are perfectly suited for data annotation.
- By using LLMs for annotation tasks within Bittensor's framework, they can dramatically lower costs and democratize access to structured data beyond major tech companies.
- The team's background includes building AI agents for entities like Common Crawl and crypto personality Seedphrase, reinforcing their core belief that structured data is the "lynchpin to highly accurate model outcomes."
The Three Revolutions of AI Data
- Dan, co-founder and CTO of Ready AI, outlines a three-stage evolution of how AI uses data, positioning Subnet 33 at the forefront of the next major shift.
- First AI Revolution (Single-Dimensional Data): Characterized by early LLMs (like GPT-2) trained on vast amounts of unstructured, unrefined data, such as raw web scrapes and text blobs.
- Second AI Revolution (Semi-Structured Data): The current era, defined by two-dimensional data that enables Retrieval-Augmented Generation (RAG). RAG is a technique where an LLM's knowledge is supplemented with external, relevant data during inference. Subnet 33 initially focused here, adding speaker identification and categorization to conversational data.
- Third AI Revolution (N-Dimensional Augmented Data): The emerging paradigm where each piece of data is augmented with multiple layers of meaningful, synthetic data. Instead of just raw text, a webpage's data would include summaries, analysis of its CSS (e.g., color psychology, text-to-image ratio), and other rich contextual information.
Dan explains the potential of this new paradigm: "You can see how all of this information actually needs to be created but then once it's created has this profound effect in that data to be organized and used."
MCP: The Key to Actionable Data
- The conversation pivots to how this n-dimensional data becomes usable. Dan identifies the MCP (Multi-Context Protocol) as the critical technology that unlocks this potential. MCP acts as a universal plug-in environment for LLMs, allowing them to dynamically access specialized data sources and processes, much like a printer driver allows any program to print.
- Strategic Implication: MCP shifts the paradigm from monolithic, centralized data stores (like Google's) to a decentralized landscape of specialized, interoperable data sources.
- Dan provides a concrete example: An MCP-enabled website builder could analyze augmented data from other successful cross-country skiing sites to automatically suggest appropriate color schemes, font sizes for an older demographic, and content themes, moving far beyond generic templates.
Major Announcement: The Common Crawl Partnership
- David announces a major new initiative: Ready AI is expanding its partnership with Common Crawl to process its entire 20-terabyte web crawl, optimizing it for MCP with Markdown and LLM.txt formatting.
- Common Crawl is a massive, publicly available repository of web-crawled data, forming the foundation for many large language models.
- LLM.txt is a proposed standard for formatting web data to make it more easily and precisely understood by AI agents, crucial for reducing hallucinations and improving accuracy.
- This project aims to create a "2.0 version of the crawl," moving beyond simple cleaning for pre-training to creating a rich, structured dataset for advanced AI agent applications.
- A portion of this Common Crawl data is already being processed on Subnet 33 as of the episode's recording.
How Subnet 33 Works: A Technical Deep Dive
- The team explains the mechanics of Subnet 33, detailing how it incentivizes the creation of high-quality metadata.
- Process Flow:
- Validators source heterogeneous data (e.g., Common Crawl pages, social media posts, enterprise data).
- A validator generates a "ground truth" embedding for an entire document (e.g., a full webpage).
- The document is broken into smaller "windows" or chunks, which are sent to multiple miners.
- Miners annotate their assigned window and return metadata.
- The validator scores the miners' output based on its proximity to the ground truth embedding and the uniqueness of the tags provided. This "fractal data mining" approach rewards both relevance and specificity.
- Actionable Insight: This validation mechanism inherently performs QA, ensuring high-quality data while keeping costs low. The final, annotated data is then pushed to storage endpoints like Hugging Face or private enterprise databases.
Future Roadmap: A GAN-based Incentive Mechanism
- Looking ahead, David discusses plans to re-architect the subnet's incentive mechanism, drawing inspiration from the GAN (Generative Adversarial Network) architecture implemented by Bittensor's Subnet 1 (Apex).
- GANs are a class of machine learning models where two neural networks, a generator and a discriminator, compete against each other.
- Proposed Implementation:
- Generator Miners: Would create annotations under tight time constraints.
- Validator: Would also generate reference annotations but without time constraints, allowing for more complex, high-quality processes like chain-of-thought prompting.
- Discriminator Miners: Would be rewarded for correctly identifying whether an annotation came from a generator miner or the validator.
- Strategic Implication: This architecture creates a more dynamic and competitive environment, pushing miners to improve both the speed and quality of their annotations while allowing the subnet to handle a more diverse range of complex data-tagging tasks.
Enterprise Adoption and Competitive Advantage
- David outlines Ready AI's strong position for enterprise adoption, driven by significant cost and efficiency advantages.
- Cost & Speed: The subnet can deliver annotations for a tenth of a cent per tag, compared to 2-10 cents for human-led firms like Scale AI. The built-in QA process reduces turnaround time from days or weeks to minutes.
- Market Opportunity: The structured data market is projected to grow from $5 billion to $17 billion. The recent disruption at Scale AI (after its acquisition by Meta, key customers like Google and OpenAI pulled out) highlights a critical enterprise need for data sovereignty and control, which decentralized solutions like Bittensor can provide.
- Traction: Ready AI is actively engaged in six proof-of-concept trials with large enterprises and has a sales pipeline of $2.7 million, signaling strong commercial interest.
AI Agents and Open Source Contributions
- Ready AI is demonstrating the power of its structured data by deploying AI agents.
- The Tao Agent, an AI assistant for the Bittensor ecosystem, was launched on Twitter to provide insights using structured data.
- Upcoming Feature: Access to a private terminal for the Tao Agent will be token-gated, requiring users to hold Ready AI's token, creating a direct value link between their commercial products and the token.
- The agent will begin posting comprehensive summaries of Novelty Search episodes, with the transcripts being added to its queryable dataset.
- Their first open-source conversational dataset is already the #1 most downloaded on Hugging Face, validating the quality of the subnet's output.
The Post-dτao Era: Building a Resilient Community
- The discussion concludes with reflections on operating after Bittensor's dτao (Dynamic Tao) update, which shifted token emissions from a central root network to being controlled by individual subnets.
- David notes the update created a forcing function for subnets to focus on revenue generation and building community awareness.
- Strategic Shift: Ready AI has intensified its focus on enterprise sales and developed a "miner optimization toolkit" to make it easier for new participants to join Subnet 33.
- The hosts observe that in the post-dτao world, miners are no longer just transient participants but can become long-term, aligned partners who contribute to a subnet's success by providing feedback and support.
Conclusion
This episode underscores that the next frontier for AI is not just more data, but better, more structured data. Ready AI's work on Subnet 33 demonstrates a scalable, decentralized model for creating this crucial resource. For investors and researchers, the key takeaway is to monitor the convergence of MCP, agentic AI, and decentralized data networks, as this is where the next wave of value creation will occur.