Steffen Cruz & Felix Quinque: Macrocosmos, Decentralized AI, Bittensor, IOTA, Subnet 9, LLM | Ep. 50

Steffen Cruz and Felix Quinque of Macrocosmos detail their ambitious rebuild of Subnet 9 on IOTA, marking a world-first in decentralized AI. They unpack the technical architecture behind training massive LLMs collaboratively, combining pipeline and data parallelism in a trustless environment for the first time in history.

A World First: Pipeline Parallelism Goes Decentralized

"Data parallelism means you're chopping your data set into multiple small pieces... and pipeline parallelism means you chop the model... into lots of small pieces and each miner takes a small piece."
"To our knowledge, no one has done a pipeline parallel decentralized training run in history. So this is actually a world first."
Subnet 9 introduces pipeline parallelism to a decentralized setting, a technique previously confined to centralized data centers for training models like GPT-4. This approach breaks a large model into "layers," allowing miners to process a small piece of the model rather than needing the hardware to run the entire thing.
This system uniquely combines pipeline parallelism with data parallelism (where miners train on different parts of the dataset). An "Orchestrator" manages this complex dance, assigning work and routing data to prevent wasted effort and ensure the model learns correctly.
The primary bottleneck is not compute, but internet bandwidth. To combat this, the team engineered a 128x compression on data passed between layers by modifying the LLM architecture itself, dramatically speeding up the training process.

Keeping Everyone Honest: From Shadow Audits to CLASP

"What we're really looking for… is clever asymmetries where you can determine if work was done well without just doing all of the work."
The current validation method is a "Shadow Audit," where validators spot-check a miner's work by re-doing it. While secure, this is inefficient. The future is a novel method called CLASP (Contribution Loss Assessment via Sampling of Pathways).
CLASP works by statistically detecting bad actors. If any data path that touches a specific miner consistently results in a high error rate ("loss"), that miner is flagged. This allows validators to efficiently police the network by just reading logs, not re-computing work.

The Innovator’s Dilemma: Go Big or Go Home

"In an industry that's moving as fast as AI, if your north star is just a marginal improvement of what already exists, a more well-funded team will almost certainly out-compete you."
The team believes that in a rapidly evolving field like AI, pursuing incremental gains is a "death sentence." Success requires a "moonshot" vision and a relentless focus on solving a fundamentally hard problem.
The ultimate vision is to democratize AI training, enabling massive, state-of-the-art models to be trained by a network of thousands of consumer-grade machines, like MacBooks, rather than a handful of capital-rich institutions.

Key Takeaways:

The conversation underscores a pivotal shift from competitive, winner-take-all AI training to a collaborative, decentralized model. The Macrocosmos team's work on IOTA's Subnet 9 isn't just an iteration; it's a foundational redesign aimed at solving the scaling and access problems that plague the AI industry.
The New Frontier is Pipeline Parallelism: This is the key that could unlock distributed training for massive, GPT-4-class models. While centralized players have used it for years, making it work decentrally is a historic breakthrough with profound implications for who gets to build AI.
Validation is the Moat: Efficiently verifying work without re-doing it is the hardest problem in decentralized compute. Innovations like CLASP, which use statistical analysis over brute-force checks, are the true enablers of large-scale, trustless networks.
Democratization Through Architecture: By breaking models into layers, the barrier to entry for AI training plummets. This architectural choice is a direct path to a more distributed and permissionless AI ecosystem, where contributors could even earn perpetual licenses for the models they help create.

Link: https://www.youtube.com/watch?v=zjRAyYRpImA

This episode reveals how Macrocosmos is pioneering decentralized AI training by combining pipeline and data parallelism, a world-first that could democratize the creation of large-scale models beyond the reach of centralized giants.

Introduction to Macrocosmos and the IOTA Transition

The Old Approach (Subnet 9 on Bittensor): Steffen explains their previous model was like a "Kaggle competition" where miners trained entire models in isolation. They could copy from each other by downloading models from HuggingFace, but this led to duplicated work and became prohibitively expensive for larger models.
The New Vision (IOTA): The team shifted to a collaborative, distributed training model. The goal is to have many participants contribute a small piece of the work, which is then combined into a final product that represents the sum of all efforts, rather than just the best individual effort.

A New Architecture: Pipeline and Data Parallelism

Model Layers: Felix clarifies that a large model, like a 70-billion parameter LLM, is composed of many "transformer blocks," or layers. Instead of one miner hosting the entire 80-layer model, the model is split, and a pool of miners is assigned to each individual layer. This dramatically lowers the hardware and capital requirements for participation.
Pipeline Parallelism Explained: Steffen uses a car assembly line analogy. Data flows through the model layer by layer, with each miner group performing a specific task before passing it on. This is pipeline parallelism: the model itself is chopped into sequential pieces. This technique is standard in centralized data centers for training models like GPT-4 but is a world-first in a decentralized setting.
Data Parallelism Explained: The system also uses data parallelism, a more common federated learning technique where the dataset is split into chunks. Each miner in a layer group processes a different piece of data simultaneously.
Strategic Implication: By combining both methods, Macrocosmos creates a system with a low barrier to entry (from pipeline parallelism) while accelerating the learning process (from data parallelism). Steffen notes, "to our knowledge, no one has done a pipeline parallel decentralized training run in history. So this is actually a world first."

The Orchestrator: The Brains of the Network

State and Task Management: When miners join, the Orchestrator assigns them to a specific layer. It then manages the flow of data, ensuring each data sample is routed through every layer in the correct sequence.
Preventing Inefficiencies: Miners communicate with a central storage bucket (an S3 bucket) via special, authenticated URLs provided by the Orchestrator. This prevents issues like collusion or miners wasting effort by processing the same data sample.
Coordinating Learning Cycles: The Orchestrator also signals when the network should pause training and enter a "merging phase." During this phase, miners within the same layer share and synchronize their results before resuming training.
Investor Insight: The Orchestrator is the core intellectual property and the basis for a potential "training-as-a-service" product. Customers could define their model and data needs, and the Orchestrator would configure the network to execute the training run.

Validation: Ensuring Honest Work with Shadow Audits

How It Works: A validator spot-checks a miner's work by taking its place, processing the exact same data, and comparing the results. This method is highly secure and reproducible.
The Efficiency Problem: While effective, this is computationally expensive. If validators had to redo all work, it would negate the benefit of a distributed network. They currently only check about 10% of the work to maintain a balance.
The North Star for Research: Steffen emphasizes the core challenge: "If the validators were outnumbered 100 to one by miners or a thousand to one, what could we do to allow them to keep up with the network so that they don't slow everything down to a halt?"

The Future of Validation: CLASP

The Core Idea: Because miners are randomly paired in pathways through the network, they don't know whose work they are processing. If a bad actor consistently produces poor results (e.g., high "loss," which is the model's error rate), it creates a statistical anomaly.
Asymmetric Validation: A validator can detect these anomalies by simply reading logs of pathway performance. This is highly asymmetric—the validator can verify work without re-doing the expensive computation, instead performing a lightweight statistical analysis.
Maturity Requirement: Felix notes that CLASP only works after the model has "warmed up." Initially, all outputs are random, so a bad actor is indistinguishable. As the model learns and loss decreases, cheaters become statistically obvious.

Overcoming Bottlenecks: Compression and Activations

The Speed Problem: Felix states that the system's core limitation is internet speed, with more time currently spent on data uploads than on computation. The goal is to flip this ratio.
128x Compression: The team modified the LLaMA architecture to reduce the data required to be passed between layers. This was an architectural change, not just a simple compression trick, effectively giving the model less "space" and forcing it to use that space more efficiently.
Researcher Insight: This innovation in reducing activation size is critical for making decentralized pipeline parallelism viable. The team's empirical approach was later validated by a theoretical paper from another team (Pluralis), proving that the training process naturally encourages models to find compact representations.

Efficient Merging: Butterfly All-Reduce

How It Works: Felix explains that if there are 10 miners, each miner splits its gradient vector into 10 sections. The first miner is responsible for averaging the first section from all other miners, the second miner averages the second section, and so on.
Scalability and Fault Tolerance: This method has constant complexity, meaning the amount of data each miner transfers remains the same regardless of how many miners join the network. They have also added redundancy, where multiple miners are responsible for each section, allowing for the detection of bad actors who might try to "pollute" the model with bad gradients.

Future Vision and Defining Success

Short-Term (Next 3 Months): The immediate focus is on refining the design and hardening the research code into a production-ready system. They are currently running dozens of short experiments to calibrate the 15-billion parameter model before committing to a full, multi-week training run.
Long-Term Vision:
1. Accessibility: Lowering hardware requirements from high-end A100 GPUs to consumer-grade hardware like a MacBook, enabling massive, "Folding@home" scale participation.
2. Model Size: Leveraging the architecture's unique ability to scale model size indefinitely. Steffen notes, "training a 100, 400, or above billion parameter model is just not possible with any other distributed training network in the world."
Strategic Implication: Success means creating a permissionless, democratized platform for training state-of-the-art AI models, complete with novel IP models where contributors earn a perpetual license to the models they help create.

Advice for Innovators: Go for the Moonshot

Go Big or Go Home: Steffen argues that aiming for incremental improvements is a "death sentence" in the current landscape. A well-funded team will likely outcompete you. Instead, teams need a "moonshot" dream that is almost unachievable, as the passion required to pursue it is contagious and motivating.
First-Mover Advantage: Felix adds that the rapid pace of AI creates unique opportunities for small teams to develop truly novel ideas. He states, "you just have the opportunity to quite easily... come up with something that's like truly novel which I think is if you look at the history of AI is quite like a unique spot to be in."

Conclusion

This conversation highlights a potential paradigm shift in AI development, moving from centralized data centers to decentralized, global collaboration. For investors and researchers, Macrocosmos's work on IOTA is a critical experiment to watch, as its success could unlock new markets and research avenues for truly democratized AI.

Steffen Cruz & Felix Quinque: Macrocosmos, Decentralized AI, Bittensor, IOTA, Subnet 9, LLM | Ep. 50

Others You May Like

Novelty Search :: Bitmind AI :: Bittensor Subnet 34

AI at the Edge: How Gensyn Is Building Verifiable, Decentralized Machine Learning

Tech Executive Answers: Can AI Solve Healthcare's Urgent Workforce Challenges? with Ankit Jain

Steffen Cruz & Felix Quinque: Macrocosmos, Decentralized AI, Bittensor, IOTA, Subnet 9, LLM | Ep. 50

Join 4,000+ smart readers to get access to all our research and tools for free.

Others You May Like

Novelty Search :: Bitmind AI :: Bittensor Subnet 34

AI at the Edge: How Gensyn Is Building Verifiable, Decentralized Machine Learning

Tech Executive Answers: Can AI Solve Healthcare's Urgent Workforce Challenges? with Ankit Jain