AI Engineer
December 13, 2025

Minimax M2 – Olive Song, MiniMax

Olive Song from Minimax introduces M2, a 10-billion-parameter open-weight model that's punching above its weight class. The core idea? Building an AI that developers actually want to use, by focusing on real-world utility, iterative reasoning, and robust generalization, all while keeping it lean and cost-effective.

Identify the "One Big Thing":

  • The "One Big Thing" is that Minimax M2, a small (10B parameters) open-weight model, achieves top-tier performance in coding and agentic tasks by integrating real-world developer feedback, human-like iterative reasoning, and robust generalization techniques, making it highly cost-effective and scalable for complex, multi-agent workflows. It's a testament to "small is mighty" when engineered smartly.

Extract Themes:

1. Developer-Centric Model Training & Evaluation:

  • Quote 1: "We actually scale something called expert developers as reward models. So as I mentioned before, we have a ton of super expert developers in house that could give us feedback to our model's performance. So they participated closely into the model development and training cycle..."
  • Quote 2: "They identify the model behaviors that developers enjoy and they identify what's reliable and what developers would trust and they give precise reward and evaluation to the model's behaviors to the final deliverables so that it is a model that developers really want to work with and that can add efficiency to the developers."

2. Interleaved Thinking for Robust Agentic Performance:

  • Quote 1: "So instead of just stopping after one round of tool calling, it actually thinks again and reacts to the environments to see if the information is enough for it to get what it wants. So basically we call the interleaved thinking or people call it interleaved thinking because it interleaves thinking with tool calling..."
  • Quote 2: "It helps adaptation to environment noise for example, just like what I mentioned the environment is not stable all the time and then something is suboptimal and then it can choose to use other tools or do other decisions."

3. Generalization and Scalability for Real-World Agents:

  • Quote 1: "We conclude that it's adaptation to perturbations across the model's entire operational space. If we think back what's the model's operational space that we talked about it can be tool information it can be system prompts it can be user prompts they can all be different they can be the chat template they can be the environment they can be the tool response."
  • Quote 2: "Because it is so small and so cost effective it can really support those long-run agentic tasks and tasks that maybe require some kind of parallelism."

Synthesize Insights:

Theme 1: Developer-Centric Model Training & Evaluation

  • The "Expert Developer as Reward Model" Approach: Minimax integrates in-house expert developers directly into the training loop, using their feedback as a high-fidelity reward signal. This is like having a master chef taste-test every dish during development, rather than just relying on a recipe book.
  • Real-World Data & Workflow Integration: M2 is trained on real internet data and scaled environments, ensuring it understands and operates within actual developer workflows (e.g., bug fixing, repo refactoring). This moves beyond synthetic benchmarks to practical utility.
  • Focus on Trust and Enjoyment: The expert feedback isn't just about correctness; it's about identifying behaviors developers enjoy and trust, leading to a model that enhances efficiency rather than just completing tasks. This is crucial for adoption.
  • Full-Stack Multilingual Capability: The model is designed to be proficient across multiple programming languages and full-stack development, reflecting the diverse needs of modern developers.

Theme 2: Interleaved Thinking for Robust Agentic Performance

  • Beyond Single-Shot Reasoning: Traditional tool-using LLMs often follow a linear "think-call tool-respond" pattern. M2 employs "interleaved thinking," where it repeatedly thinks, calls tools, and re-evaluates responses, much like a human problem-solver. This is like a detective gathering clues, reflecting on them, and then deciding which new lead to pursue, rather than just making one call and declaring the case closed.
  • Adaptation to Noisy, Dynamic Environments: This iterative process allows M2 to adapt to unexpected tool errors, suboptimal results, and environmental noise, making it robust in real-world, unpredictable scenarios (e.g., stock market fluctuations, complex workflows).
  • Long-Horizon Task Automation: The ability to self-correct and iterate enables M2 to handle complex, multi-step agentic tasks that require interacting with various tools (Gmail, Notion, terminal) over extended periods with minimal human intervention.

Theme 3: Generalization and Scalability for Real-World Agents

  • Perturbation Pipelines for Robust Generalization: Minimax explicitly designs data perturbation pipelines to train M2 to adapt to variations across its entire "operational space" – including tool information, system prompts, user prompts, chat templates, and environment responses. This is like training a pilot in a flight simulator that constantly throws unexpected weather, equipment failures, and air traffic control changes at them, rather than just perfect conditions.
  • Beyond Tool Scaling: Initial assumptions that generalization was just about training with more tools proved insufficient. True generalization requires adapting to perturbations in how those tools are presented and used within different agent scaffolds.
  • Cost-Effectiveness Enables Multi-Agent Systems: M2's small size (10B parameters) and cost-efficiency make it ideal for deploying multiple copies in parallel for complex, long-running agentic tasks (e.g., research, analysis, report generation, front-end illustration). This allows for distributed intelligence without prohibitive compute costs.

Filter for Action:

  • For Investors:
    • Opportunity: Look for companies building agentic applications that leverage smaller, highly specialized, and cost-effective models like M2, especially those with strong generalization capabilities. The "small is mighty" paradigm could disrupt the "bigger is better" LLM race for specific use cases.
    • Warning: Be wary of models that only perform well on static benchmarks; real-world agentic performance in dynamic, noisy environments is a much stronger indicator of utility and adoption.
  • For Builders:
    • Opportunity: Embrace iterative "interleaved thinking" architectures for agent design. Don't just build single-shot tool-calling agents; design them to self-correct and adapt.
    • Opportunity: Consider integrating human expert feedback directly into your model training and evaluation loops. This "developer as reward model" approach can yield highly practical and trusted AI tools.
    • Opportunity: Explore building multi-agent systems using smaller, cost-efficient models. Parallelization of specialized agents can tackle complex problems more effectively than a single monolithic model.
    • Warning: Don't assume tool scaling alone leads to generalization. Focus on training for robustness against perturbations across the entire operational space of your agent.

New Podcast Alert: Minimax M2 – Olive Song, MiniMax

By: AI Engineer

Podcast Link: Link

This episode unveils Minimax's M2, a 10-billion parameter open-weight model engineered for superior coding and agentic tasks, challenging conventional training paradigms with expert-in-the-loop feedback and dynamic environment adaptation.

Minimax M2: A Full-Stack AI Powerhouse

  • Minimax M2 is an open-weight model featuring 10 billion active parameters, specifically optimized for coding, workplace, and agentic tasks.
  • The model consistently ranks at the top of intelligence and agent benchmarks among open-source alternatives.
  • M2 achieved the most downloads and climbed to top three token usage on OpenRouter within its first week, validating its practical utility over raw benchmark scores.
  • Minimax's full-stack structure allows developers to directly influence model design, ensuring practical relevance and efficiency gains.
  • “We both develop foundation models and applications. So we have research and developers sitting side by side working on things.” – Olive Song

Engineering Superior Coding Agents with Expert Feedback

  • M2 leverages scaled environments and real internet data, enabling the model to react to dynamic coding scenarios and target verifiable goals during reinforcement learning (RL).
  • Minimax employs "expert developers as reward models," integrating their feedback on bug fixing, repo refactoring, and desired model behaviors directly into the training cycle.
  • This expert-driven reward system ensures M2 delivers reliable, trustworthy, and efficient assistance across full-stack, multilingual coding tasks.
  • “We actually scale something called expert developers as reward models... they give precise reward and evaluation to the model's behaviors.” – Olive Song

Mastering Long-Horizon Tasks with Interleaved Thinking

  • Traditional reasoning models often fail in noisy environments due to single-pass tool calling; M2 overcomes this by iteratively thinking and reacting to tool responses.
  • Interleaved thinking involves multiple rounds of reasoning interspersed with tool calls, allowing the model to re-evaluate feedback and adjust its strategy.
  • This iterative process enables M2 to adapt to environmental noise, recover from suboptimal outcomes, and automate complex workflows across multiple tools (e.g., Gmail, Notion, terminal) with minimal human intervention.
  • “We imagine how humans interact with the world. We look at something, we get feedbacks, and then we think about it... and that's why we did the same thing with our M2 model.” – Olive Song

Achieving Robust Generalization Through Data Perturbation

  • Initial assumptions that tool scaling alone would ensure generalization proved insufficient when faced with varied "agent scaffolds" (different interaction templates or environments).
  • Agent generalization is defined as adaptation to perturbations across the model's entire operational space, encompassing tool information, system prompts, user prompts, chat templates, and tool responses.
  • Minimax designed and maintains perturbation pipelines within its data, exposing M2 to a vast array of varied inputs and conditions during training.
  • This rigorous training ensures M2 can reliably perform across numerous unseen agentic setups and environmental changes.
  • “We conclude that it's adaptation to perturbations across the model's entire operational space.” – Olive Song

Multi-Agent Scalability and the Future of M2

  • M2's efficiency allows for the deployment of multiple copies working in parallel, such as agents conducting research, analyzing results, and generating reports simultaneously.
  • This scalability supports long-running agentic tasks and those requiring concurrent processing, demonstrating M2's practical utility in complex workflows.
  • Future iterations (M2.1, M3) aim for enhanced coding, advanced memory management, proactive AI for workplace verticals, and integration with Minimax's multimodal generation models (e.g., Huyoa for video).
  • “Because it is so small and so cost effective, it can really support those long run agentic tasks and tasks that maybe require some kind of parallelism.” – Olive Song

Investor & Researcher Alpha

  • Capital Movement: Investment will likely flow towards full-stack AI companies that integrate foundation model development with application-level feedback loops, demonstrating real-world utility and developer adoption over pure benchmark performance. Solutions that embed human expert feedback directly into model training (e.g., "expert developers as reward models") represent a high-signal area for capital.
  • New Bottleneck: The critical bottleneck shifts from raw model size to the quality and diversity of training data, specifically the ability to simulate and perturb real-world operational environments and integrate precise human feedback. Developing robust, dynamic data perturbation pipelines and scalable expert feedback mechanisms becomes paramount.
  • Obsolete Research: Purely benchmark-driven model evaluation, especially for agentic tasks, is increasingly insufficient. Research focusing solely on static, single-pass tool-use models without iterative reasoning or robust environmental adaptation will prove less impactful. The emphasis must shift to dynamic, adaptive agent architectures and training methodologies that account for real-world noise and long-horizon interactions.

Strategic Conclusion

Minimax M2 redefines agentic AI by prioritizing real-world utility through expert-driven training, interleaved reasoning, and robust generalization. The industry's next step involves deeply integrating human expertise and dynamic environmental adaptation into AI development to build truly reliable and scalable intelligent agents.

Others You May Like