taostats
July 4, 2025

Novelty Search July 3, 2025

David and Dan from Ready AI outline their ambitious plan to structure the world's data for the next wave of AI. They detail how Bittensor Subnet 33 is creating 'n-dimensional' data to power a future of hyper-specific, intelligent agents.

The Third AI Revolution: N-Dimensional Data

  • "In this new paradigm, in this n-dimensional data world, we launch into a world where each piece of data is augmented with meaningful synthetic data that's generated specifically for it."
  • "We see a much broader impact for that... Business data could be analyzed across time to include summary trends per company or per industry. Stock analysis could be attached and correlated to industry trends on a granular level."
  • Ready frames the AI landscape in three revolutions. The first was training on raw, unstructured data. The second, which we're in now, is semi-structured data for RAG. The third will be built on "n-dimensional data," where every data point is enriched with layers of AI-generated context.
  • This isn't just cleaning data; it's creating new, qualitative dimensions. For example, analyzing a website’s CSS to infer its target audience and tone, or its JavaScript to understand user experience preferences—data that's impossible to get from raw text alone.

Subnet 33: The Engine of Structure

  • "You guys are basically positioning yourself in between [LLMs and unstructured data] and saying we can turn that into a perfectly distributed problem, an incentivized problem... you're turning it into a library and not just a pile of trash."
  • Subnet 33 is the decentralized engine making this n-dimensional data a reality. It incentivizes miners to annotate and add rich metadata to datasets, from conversations to the entire Common Crawl web archive.
  • The validation uses a "fractal data mining" approach. Miners analyze small windows of a document and are scored on generating metadata that is both relevant to the full document and unique to their specific window, creating a rich, multi-faceted picture.
  • This process creates datasets superior to general search APIs, which are often optimized for advertising rather than precision. This was proven with Ready’s Tao Agent, which provides more accurate, context-aware answers than standard web searches.

From Open Source to Enterprise

  • "Today we're excited to announce that we've actually rolled out the second part of our Common Crawl partnership... we're now doing a Common Crawl data set that's optimized for MCP."
  • "We have six active POCs right now with large enterprises. We've got about $2.7 million in sales in the pipeline... feel like we're in a strong position to really start delivering meaningful revenue generation."
  • Ready is aggressively commercializing, reporting a $2.7 million sales pipeline and six active Proof-of-Concepts with large enterprises. The subnet provides data tagging at $0.001 per tag, a fraction of the cost of centralized services like Scale AI.
  • The flagship initiative is a partnership with Common Crawl. Subnet 33 has officially begun processing the massive web archive to create the first MCP-optimized version, formatting it with markdown and `LLM.txt` to make it perfect for AI agents.

Key Takeaways:

  • Ready’s vision moves beyond simply retrieving data to fundamentally enhancing it, creating a new market for intelligent, structured information. This positions Subnet 33 not just as a data provider, but as a foundational layer for the emerging agentic web, with a clear strategy for turning decentralized work into enterprise revenue.
  • Data Is The New Enhanced Asset: The future isn't just accessing data, but accessing data that has been intelligently processed. Ready is turning unstructured archives like Common Crawl into the highest-quality pre-training and agentic datasets ever created.
  • The Future Is A Network of Niches: Forget one monolithic Google-like index. The agentic web will run on a network of specialized, MCP-enabled data sources. Subnet 33 is building the reference platform for this new, decentralized data economy.
  • The Bridge to Revenue Is Built: With a $2.7M sales pipeline and active enterprise pilots, Ready is demonstrating a tangible path from decentralized network incentives to real-world revenue, creating a playbook for monetizing Bittensor commodities.

For further insights and detailed discussions, watch the full video: Link

This episode reveals how Ready AI is transforming the web's unstructured chaos into a structured, queryable library for next-generation AI agents, using Bittensor's Subnet 33 to incentivize this massive data annotation effort.

Introduction: Structuring Data for the Future of AI

  • David, co-founder of Ready AI, introduces the company's mission: to structure the world's data and make it universally accessible to AI. He highlights the scale of the problem, noting that 90% of the world's 335 zettabytes of data will be unstructured by 2027, with only 3-5% currently tagged for AI workloads.
  • Ready AI was drawn to Bittensor for its powerful incentive and validation mechanisms, which are perfectly suited for data annotation.
  • By using LLMs for annotation tasks within Bittensor's framework, they can dramatically lower costs and democratize access to structured data beyond major tech companies.
  • The team's background includes building AI agents for entities like Common Crawl and crypto personality Seedphrase, reinforcing their core belief that structured data is the "lynchpin to highly accurate model outcomes."

The Three Revolutions of AI Data

  • Dan, co-founder and CTO of Ready AI, outlines a three-stage evolution of how AI uses data, positioning Subnet 33 at the forefront of the next major shift.
  • First AI Revolution (Single-Dimensional Data): Characterized by early LLMs (like GPT-2) trained on vast amounts of unstructured, unrefined data, such as raw web scrapes and text blobs.
  • Second AI Revolution (Semi-Structured Data): The current era, defined by two-dimensional data that enables Retrieval-Augmented Generation (RAG). RAG is a technique where an LLM's knowledge is supplemented with external, relevant data during inference. Subnet 33 initially focused here, adding speaker identification and categorization to conversational data.
  • Third AI Revolution (N-Dimensional Augmented Data): The emerging paradigm where each piece of data is augmented with multiple layers of meaningful, synthetic data. Instead of just raw text, a webpage's data would include summaries, analysis of its CSS (e.g., color psychology, text-to-image ratio), and other rich contextual information.

Dan explains the potential of this new paradigm: "You can see how all of this information actually needs to be created but then once it's created has this profound effect in that data to be organized and used."

MCP: The Key to Actionable Data

  • The conversation pivots to how this n-dimensional data becomes usable. Dan identifies the MCP (Multi-Context Protocol) as the critical technology that unlocks this potential. MCP acts as a universal plug-in environment for LLMs, allowing them to dynamically access specialized data sources and processes, much like a printer driver allows any program to print.
  • Strategic Implication: MCP shifts the paradigm from monolithic, centralized data stores (like Google's) to a decentralized landscape of specialized, interoperable data sources.
  • Dan provides a concrete example: An MCP-enabled website builder could analyze augmented data from other successful cross-country skiing sites to automatically suggest appropriate color schemes, font sizes for an older demographic, and content themes, moving far beyond generic templates.

Major Announcement: The Common Crawl Partnership

  • David announces a major new initiative: Ready AI is expanding its partnership with Common Crawl to process its entire 20-terabyte web crawl, optimizing it for MCP with Markdown and LLM.txt formatting.
  • Common Crawl is a massive, publicly available repository of web-crawled data, forming the foundation for many large language models.
  • LLM.txt is a proposed standard for formatting web data to make it more easily and precisely understood by AI agents, crucial for reducing hallucinations and improving accuracy.
  • This project aims to create a "2.0 version of the crawl," moving beyond simple cleaning for pre-training to creating a rich, structured dataset for advanced AI agent applications.
  • A portion of this Common Crawl data is already being processed on Subnet 33 as of the episode's recording.

How Subnet 33 Works: A Technical Deep Dive

  • The team explains the mechanics of Subnet 33, detailing how it incentivizes the creation of high-quality metadata.
  • Process Flow:
    1. Validators source heterogeneous data (e.g., Common Crawl pages, social media posts, enterprise data).
    2. A validator generates a "ground truth" embedding for an entire document (e.g., a full webpage).
    3. The document is broken into smaller "windows" or chunks, which are sent to multiple miners.
    4. Miners annotate their assigned window and return metadata.
    5. The validator scores the miners' output based on its proximity to the ground truth embedding and the uniqueness of the tags provided. This "fractal data mining" approach rewards both relevance and specificity.
  • Actionable Insight: This validation mechanism inherently performs QA, ensuring high-quality data while keeping costs low. The final, annotated data is then pushed to storage endpoints like Hugging Face or private enterprise databases.

Future Roadmap: A GAN-based Incentive Mechanism

  • Looking ahead, David discusses plans to re-architect the subnet's incentive mechanism, drawing inspiration from the GAN (Generative Adversarial Network) architecture implemented by Bittensor's Subnet 1 (Apex).
  • GANs are a class of machine learning models where two neural networks, a generator and a discriminator, compete against each other.
  • Proposed Implementation:
    • Generator Miners: Would create annotations under tight time constraints.
    • Validator: Would also generate reference annotations but without time constraints, allowing for more complex, high-quality processes like chain-of-thought prompting.
    • Discriminator Miners: Would be rewarded for correctly identifying whether an annotation came from a generator miner or the validator.
  • Strategic Implication: This architecture creates a more dynamic and competitive environment, pushing miners to improve both the speed and quality of their annotations while allowing the subnet to handle a more diverse range of complex data-tagging tasks.

Enterprise Adoption and Competitive Advantage

  • David outlines Ready AI's strong position for enterprise adoption, driven by significant cost and efficiency advantages.
  • Cost & Speed: The subnet can deliver annotations for a tenth of a cent per tag, compared to 2-10 cents for human-led firms like Scale AI. The built-in QA process reduces turnaround time from days or weeks to minutes.
  • Market Opportunity: The structured data market is projected to grow from $5 billion to $17 billion. The recent disruption at Scale AI (after its acquisition by Meta, key customers like Google and OpenAI pulled out) highlights a critical enterprise need for data sovereignty and control, which decentralized solutions like Bittensor can provide.
  • Traction: Ready AI is actively engaged in six proof-of-concept trials with large enterprises and has a sales pipeline of $2.7 million, signaling strong commercial interest.

AI Agents and Open Source Contributions

  • Ready AI is demonstrating the power of its structured data by deploying AI agents.
  • The Tao Agent, an AI assistant for the Bittensor ecosystem, was launched on Twitter to provide insights using structured data.
  • Upcoming Feature: Access to a private terminal for the Tao Agent will be token-gated, requiring users to hold Ready AI's token, creating a direct value link between their commercial products and the token.
  • The agent will begin posting comprehensive summaries of Novelty Search episodes, with the transcripts being added to its queryable dataset.
  • Their first open-source conversational dataset is already the #1 most downloaded on Hugging Face, validating the quality of the subnet's output.

The Post-dτao Era: Building a Resilient Community

  • The discussion concludes with reflections on operating after Bittensor's dτao (Dynamic Tao) update, which shifted token emissions from a central root network to being controlled by individual subnets.
  • David notes the update created a forcing function for subnets to focus on revenue generation and building community awareness.
  • Strategic Shift: Ready AI has intensified its focus on enterprise sales and developed a "miner optimization toolkit" to make it easier for new participants to join Subnet 33.
  • The hosts observe that in the post-dτao world, miners are no longer just transient participants but can become long-term, aligned partners who contribute to a subnet's success by providing feedback and support.

Conclusion

This episode underscores that the next frontier for AI is not just more data, but better, more structured data. Ready AI's work on Subnet 33 demonstrates a scalable, decentralized model for creating this crucial resource. For investors and researchers, the key takeaway is to monitor the convergence of MCP, agentic AI, and decentralized data networks, as this is where the next wave of value creation will occur.

Others You May Like