This episode reveals how Macrocosmos is pioneering decentralized AI training by combining pipeline and data parallelism, a world-first that could democratize the creation of large-scale models beyond the reach of centralized giants.
Introduction to Macrocosmos and the IOTA Transition
- The Old Approach (Subnet 9 on Bittensor): Steffen explains their previous model was like a "Kaggle competition" where miners trained entire models in isolation. They could copy from each other by downloading models from HuggingFace, but this led to duplicated work and became prohibitively expensive for larger models.
- The New Vision (IOTA): The team shifted to a collaborative, distributed training model. The goal is to have many participants contribute a small piece of the work, which is then combined into a final product that represents the sum of all efforts, rather than just the best individual effort.
A New Architecture: Pipeline and Data Parallelism
- Model Layers: Felix clarifies that a large model, like a 70-billion parameter LLM, is composed of many "transformer blocks," or layers. Instead of one miner hosting the entire 80-layer model, the model is split, and a pool of miners is assigned to each individual layer. This dramatically lowers the hardware and capital requirements for participation.
- Pipeline Parallelism Explained: Steffen uses a car assembly line analogy. Data flows through the model layer by layer, with each miner group performing a specific task before passing it on. This is pipeline parallelism: the model itself is chopped into sequential pieces. This technique is standard in centralized data centers for training models like GPT-4 but is a world-first in a decentralized setting.
- Data Parallelism Explained: The system also uses data parallelism, a more common federated learning technique where the dataset is split into chunks. Each miner in a layer group processes a different piece of data simultaneously.
- Strategic Implication: By combining both methods, Macrocosmos creates a system with a low barrier to entry (from pipeline parallelism) while accelerating the learning process (from data parallelism). Steffen notes, "to our knowledge, no one has done a pipeline parallel decentralized training run in history. So this is actually a world first."
The Orchestrator: The Brains of the Network
- State and Task Management: When miners join, the Orchestrator assigns them to a specific layer. It then manages the flow of data, ensuring each data sample is routed through every layer in the correct sequence.
- Preventing Inefficiencies: Miners communicate with a central storage bucket (an S3 bucket) via special, authenticated URLs provided by the Orchestrator. This prevents issues like collusion or miners wasting effort by processing the same data sample.
- Coordinating Learning Cycles: The Orchestrator also signals when the network should pause training and enter a "merging phase." During this phase, miners within the same layer share and synchronize their results before resuming training.
- Investor Insight: The Orchestrator is the core intellectual property and the basis for a potential "training-as-a-service" product. Customers could define their model and data needs, and the Orchestrator would configure the network to execute the training run.
Validation: Ensuring Honest Work with Shadow Audits
- How It Works: A validator spot-checks a miner's work by taking its place, processing the exact same data, and comparing the results. This method is highly secure and reproducible.
- The Efficiency Problem: While effective, this is computationally expensive. If validators had to redo all work, it would negate the benefit of a distributed network. They currently only check about 10% of the work to maintain a balance.
- The North Star for Research: Steffen emphasizes the core challenge: "If the validators were outnumbered 100 to one by miners or a thousand to one, what could we do to allow them to keep up with the network so that they don't slow everything down to a halt?"
The Future of Validation: CLASP
- The Core Idea: Because miners are randomly paired in pathways through the network, they don't know whose work they are processing. If a bad actor consistently produces poor results (e.g., high "loss," which is the model's error rate), it creates a statistical anomaly.
- Asymmetric Validation: A validator can detect these anomalies by simply reading logs of pathway performance. This is highly asymmetric—the validator can verify work without re-doing the expensive computation, instead performing a lightweight statistical analysis.
- Maturity Requirement: Felix notes that CLASP only works after the model has "warmed up." Initially, all outputs are random, so a bad actor is indistinguishable. As the model learns and loss decreases, cheaters become statistically obvious.
Overcoming Bottlenecks: Compression and Activations
- The Speed Problem: Felix states that the system's core limitation is internet speed, with more time currently spent on data uploads than on computation. The goal is to flip this ratio.
- 128x Compression: The team modified the LLaMA architecture to reduce the data required to be passed between layers. This was an architectural change, not just a simple compression trick, effectively giving the model less "space" and forcing it to use that space more efficiently.
- Researcher Insight: This innovation in reducing activation size is critical for making decentralized pipeline parallelism viable. The team's empirical approach was later validated by a theoretical paper from another team (Pluralis), proving that the training process naturally encourages models to find compact representations.
Efficient Merging: Butterfly All-Reduce
- How It Works: Felix explains that if there are 10 miners, each miner splits its gradient vector into 10 sections. The first miner is responsible for averaging the first section from all other miners, the second miner averages the second section, and so on.
- Scalability and Fault Tolerance: This method has constant complexity, meaning the amount of data each miner transfers remains the same regardless of how many miners join the network. They have also added redundancy, where multiple miners are responsible for each section, allowing for the detection of bad actors who might try to "pollute" the model with bad gradients.
Future Vision and Defining Success
- Short-Term (Next 3 Months): The immediate focus is on refining the design and hardening the research code into a production-ready system. They are currently running dozens of short experiments to calibrate the 15-billion parameter model before committing to a full, multi-week training run.
- Long-Term Vision:
- Accessibility: Lowering hardware requirements from high-end A100 GPUs to consumer-grade hardware like a MacBook, enabling massive, "Folding@home" scale participation.
- Model Size: Leveraging the architecture's unique ability to scale model size indefinitely. Steffen notes, "training a 100, 400, or above billion parameter model is just not possible with any other distributed training network in the world."
- Strategic Implication: Success means creating a permissionless, democratized platform for training state-of-the-art AI models, complete with novel IP models where contributors earn a perpetual license to the models they help create.
Advice for Innovators: Go for the Moonshot
- Go Big or Go Home: Steffen argues that aiming for incremental improvements is a "death sentence" in the current landscape. A well-funded team will likely outcompete you. Instead, teams need a "moonshot" dream that is almost unachievable, as the passion required to pursue it is contagious and motivating.
- First-Mover Advantage: Felix adds that the rapid pace of AI creates unique opportunities for small teams to develop truly novel ideas. He states, "you just have the opportunity to quite easily... come up with something that's like truly novel which I think is if you look at the history of AI is quite like a unique spot to be in."
Conclusion
This conversation highlights a potential paradigm shift in AI development, moving from centralized data centers to decentralized, global collaboration. For investors and researchers, Macrocosmos's work on IOTA is a critical experiment to watch, as its success could unlock new markets and research avenues for truly democratized AI.