ReadyAI co-founders David and Dan outline their mission to structure the world's data for the "third AI revolution," where AI agents require deeply contextualized, multi-dimensional information. They detail how Subnet 33 is evolving from a data-tagging engine into a foundational layer for this agentic future, backed by a major partnership with Common Crawl.
From Data Piles to Data Libraries
- "Our mission here at Ready is to structure all the world's data and make it universally accessible to AI... you're turning it into a library and not just a pile of trash."
ReadyAI argues we are entering the third wave of AI, moving beyond unstructured text (1D) and simple RAG tagging (2D). This new paradigm is built on "n-dimensional data," where each data point is augmented with rich, synthetic metadata. Instead of just scraping a webpage, SN33 analyzes its CSS to infer mood from color schemes, text-to-image ratios to understand the target audience, and content to generate summaries, creating a high-fidelity data asset perfect for precise agentic workflows.
The Annotation Engine
- "We're able to deliver tags at a tenth of a cent... per tag, which is dramatically lower than these human annotation firms like Scale AI at 2 to 10 cents."
- "I have been really closely following the implementation of the GAN architecture by Subnet 1... I think that is a really elegant way to enable a much more diverse set of annotation tasks."
Subnet 33 operates as a decentralized annotation engine. Validators break down documents (like webpages) into smaller "windows" and send them to miners. Miners are incentivized to produce metadata that is both unique to the window and semantically relevant to the entire document. This "fractal mining" approach has proven to be over 50% more accurate than GPT-4 tagging on ReadyAI’s internal evals. The team is now planning to re-architect the incentive mechanism using a GAN model, enabling more complex tasks and time-based competition.
The Enterprise Play & The Agentic Web
- "Today we're excited to announce that we've actually rolled out the second part of our Common Crawl partnership where we're now doing a Common Crawl data set that's optimized for MCP."
- "We have six active proofs-of-concept right now with large enterprises. We've got about 2.7 million in sales in the pipeline."
ReadyAI is actively commercializing, targeting the $5B structured data market with a compelling value proposition. They have six active enterprise POCs and a $2.7M sales pipeline, leveraging the subnet to provide data structuring at a fraction of the cost of incumbents. The centerpiece of their strategy is a new partnership to process the entire 20TB Common Crawl dataset, optimizing it for the Model-Component-Provider (MCP) standard with formats like LLM.txt. This creates a public, high-quality structured dataset, poised to become a go-to resource for the emerging agentic web.
Key Takeaways
- ReadyAI is building a foundational layer for the next wave of AI by transforming the unstructured web into a high-fidelity, queryable library, balancing open-source contributions with a clear path to enterprise revenue. Their TAO Agent will now summarize Novelty Search podcasts and will feature a token-gated terminal, creating a direct value-accrual mechanism.
- The Gold Standard Dataset: The Common Crawl partnership is a massive value-add, creating a premium, open-source dataset structured for agentic use that could become a global standard for pre-training and RAG.
- Enterprise Adoption is Here: With 6 active POCs and a $2.7M pipeline, ReadyAI proves clear commercial demand for decentralized data structuring, offering a 95%+ cost reduction over firms like Scale AI.
- Direct Token Utility: The TAO Agent's new token-gated private terminal is a powerful experiment in direct value accrual, linking product utility to token value—a model for the entire ecosystem.
For further insights, watch the full podcast: Link

This episode reveals how ReadyAI's Subnet 33 is tackling the monumental task of structuring the web's chaotic data, creating a new paradigm of 'n-dimensional' data to power the next generation of AI agents.
Introduction: The Mission to Structure AI Data
- David, co-founder of ReadyAI, introduces the company's mission: to structure all the world's data and make it universally accessible to AI. He explains that their journey into Bittensor was driven by the need for high-quality structured data, which they identified as the critical bottleneck for building accurate AI models and agents.
- The core problem is the sheer scale of unstructured data. By 2027, 90% of the world's 335 zettabytes of data will be unstructured, with only 3-5% currently tagged and usable by AI.
- ReadyAI's experience building AI agents for groups like Common Crawl and creators like Seedphrase reinforced a key insight: structured data is the lynchpin for accurate model outcomes. Bittensor's incentive and validation mechanism presented a perfect framework to solve this problem at scale.
- David highlights the potential of using performant Large Language Models (LLMs) for data annotation, which can dramatically lower costs and democratize access beyond major tech companies.
The Three Revolutions of AI Data
- Dan, ReadyAI's co-founder and CTO, outlines a three-stage evolution of how AI uses data, framing the strategic position of Subnet 33.
- First AI Revolution (Single-Dimensional Data): This era was defined by early LLMs like GPT-2, which were trained on vast, unstructured datasets like scraped web pages and raw text blobs. The data was minimally cleaned and lacked specific context.
- Second AI Revolution (Semi-Structured Data): This is the current era, characterized by two-dimensional data that enables Retrieval-Augmented Generation (RAG)—a technique where an LLM's knowledge is supplemented with external data. Subnet 33 initially focused here, adding metadata like speaker identification and topic categorization to conversational data, making it far more useful for fine-tuning and RAG.
- Third AI Revolution (N-Dimensional Augmented Data): This emerging paradigm involves augmenting each piece of data with meaningful, context-specific synthetic data. Instead of just raw text from a webpage, this includes summaries, analysis of its CSS and JavaScript (e.g., color schemes, text-to-image ratios), and other qualitative layers. This creates a rich, multi-faceted data source for sophisticated AI agents.
MCP and the Future of Actionable Data
- Dan explains that this "n-dimensional" data becomes truly powerful when combined with the Model Component Protocol (MCP). MCP is a standard that creates a plug-in environment for LLMs, allowing them to seamlessly access and utilize diverse, specialized data sources without being retooled.
- He provides a practical example: building a cross-country skiing website.
- Without MCP: You pick a generic template.
- With MCP and N-Dimensional Data: An AI agent accesses augmented data from other cross-country skiing sites. It suggests a calmer color scheme, durable clothing (for an older demographic), slower animations, and higher text density, all based on metadata processed by Subnet 33.
- Dan states, "This whole example would be possible with a simple MCP plug-in source data from Common Crawl that was processed through Subnet 33."
Partnership with Common Crawl
- ReadyAI announces a major expansion of its partnership with Common Crawl, the non-profit that crawls and archives the public web.
- They are now processing the entire 20-terabyte Common Crawl dataset to optimize it for MCP, using Markdown and LLM.txt—a format designed for precise, machine-readable data that minimizes AI hallucinations.
- This initiative aims to create a "version 2.0" of the web crawl, moving beyond simple cleaning (like FineWeb) to produce data specifically structured for the next generation of AI agents.
- The host, Jake, clarifies ReadyAI's strategic position: "You guys are basically positioning yourself in between [LLMs and raw data] and saying well we can turn that into a perfectly distributed problem... which is how can I structure and append metadata... that makes it actually valuable. So you're turning it into a library and not just a pile of trash."
Subnet 33: Technical Architecture and Validation
- Dan provides a high-level overview of how Subnet 33 functions:
- Data Sources: Validators pull heterogeneous data from sources like the ReadyAI endpoint, Common Crawl, and private enterprise data.
- Ground Truth: The validator processes a full document (e.g., a webpage) to generate a "ground truth" set of tags and embeddings that represent the entire piece of content.
- Fractal Data Mining: The document is broken into smaller "windows" and sent to multiple miners. Miners annotate their specific window, and their returned metadata is scored against the ground truth embedding of the full document.
- Scoring: The scoring mechanism rewards miners for producing embeddings that are semantically close to the overall document's context while also providing unique, specialized tags relevant to their specific window. This incentivizes a rich, multi-layered understanding of the data.
Future Roadmap: GANs, Enterprise, and New Data
- David discusses the future evolution of the subnet's incentive mechanism, drawing inspiration from Subnet 1's implementation of a Generative Adversarial Network (GAN). A GAN is a system of two competing neural networks—a generator and a discriminator—used here to create a more robust and flexible validation process.
- Generator Miners will produce tags under time constraints.
- Validator will generate a higher-quality reference answer without time constraints.
- Discriminator Miners will be rewarded for identifying the difference between the two. This architecture allows for more diverse and complex annotation tasks.
- The team is heavily focused on enterprise adoption, with six active Proof-of-Concepts (POCs) and a $2.7 million sales pipeline. They offer a private API for enterprises to process proprietary data, such as social media feeds or financial documents in XBRL (eXtensible Business Reporting Language) format.
- The launch of the Tao Agent, an AI agent for the Bittensor ecosystem, serves as a showcase for the quality of their structured data. Access to its private terminal will be token-gated, creating a direct value sink for the TAO token.
The Strategic Advantage Over Centralized Solutions
- ReadyAI's decentralized approach offers a significant competitive advantage over centralized services like Scale AI.
- Cost: They can deliver tags at a tenth of a cent, compared to 2-10 cents for human annotation firms.
- Speed: The built-in QA process of the validator-miner system reduces processing time from days or weeks to seconds or minutes.
- Data Control: The recent disruption at Scale AI (where major clients left after its acquisition by Meta) highlights a critical enterprise need: to own and control their data. David argues that enterprises will increasingly look to host their own infrastructure on Bittensor to maintain data sovereignty.
Impact of DTAO and Community Alignment
- The conversation shifts to the impact of Dynamic TAO (DTAO) on the subnet's operations. David notes that DTAO created a forcing function for two key strategic shifts:
- Revenue Focus: A stronger push towards enterprise sales and monetization through their jobs interface and API.
- Community and Marketing: A more active approach to marketing and fostering the miner community, exemplified by their "Minor Op Kit" and one-click miner.
- The host observes that post-DTAO, miners who remain with a subnet become more aligned and act as part of the team, providing valuable feedback and support. David agrees, emphasizing the long-term value of building these relationships.
Conclusion
ReadyAI is building a foundational data supply chain for the agentic web, transforming raw information into a high-value, structured commodity. Investors and researchers should monitor the adoption of MCP and these new structured datasets, as they represent a fundamental shift in how AI will access and utilize information, creating new market opportunities.