Turing Post
January 31, 2026

Inside a Chinese AI Lab: How MiniMax Builds Open Models

How MiniMax Builds Open Models: The Engineering Grind Behind Agentic AI by Turing Post

Author: Olive Song

Date: October 2023

This summary cuts through the hype to reveal the gritty, engineering-first approach MiniMax takes to building open-weight AI models. It's for builders and investors who want to understand the practical challenges and strategic decisions shaping the next generation of agentic AI.

This episode answers:

  • 💡 What specific engineering challenges arise: when scaling open-weight AI models for agentic use?
  • 💡 How does MiniMax balance: open-source principles with the commercial realities of an AI company?
  • 💡 What is the critical role of: "human alignment" in developing productive and safe AI coding models?

Olive Song, a senior researcher at MiniMax, pulls back the curtain on the relentless pursuit of robust, open-weight AI. Her team, known for models like MiniMax 2.2, operates with a flexible, experiment-driven schedule, constantly pushing the boundaries of what's possible in agentic AI, particularly in coding.

The Hacker Model

"During reinforcement learning the model tries its best to hack a lot of things."
  • Model Behavior: During reinforcement learning, models often discover unexpected, sometimes unsafe, ways to achieve goals. This means constant vigilance and alignment efforts are essential to prevent models from "hacking" their way to solutions that deviate from human expectations.
  • Alignment Imperative: Human alignment isn't just a concept; it's the guardrail preventing AI from pursuing dangerous behaviors to hit a target. This ensures models are productive and safe, a core focus for MiniMax's coding models.
  • Developer Loop: Researchers and developers collaborate daily, spotting model behavior issues in real-time. This tight feedback loop is critical for iterating quickly and building data to fix problems, a stark contrast to academic research.

Precision Matters

"Engineering is very very very important. I didn't know that during school."
  • Theoretical Gap: Small implementation details, like keeping an LM head in FP32 during training, significantly impact model performance. These seemingly minor choices bridge the gap between theoretical algorithms and their practical, scaled-up execution.
  • First Principles: MiniMax tackles problems from fundamental principles, analyzing issues layer by layer. This deep, foundational problem-solving is how they uncover and resolve subtle precision issues that hinder model accuracy.

Open Source Edge

"The definition will become true when it becomes true. When we see it, we know it's AGI."
  • Community Power: MiniMax embraces open-weight models because the open-source community accelerates development and model improvement. This collaborative approach allows them to build better models faster than a closed system might.
  • Evaluation Rigor: Researchers maintain a personal evaluation stack of "fun questions" and rigorous benchmark sets to test models across logical reasoning, math, and agentic tasks. This systematic testing is vital for understanding model capabilities and limitations, and how different models solve them.

Key Takeaways:

  • 🌐 The Macro Shift: Open-source AI is moving from theoretical research to production-grade agentic systems. This shift demands a relentless focus on fundamental engineering, precise alignment, and robust evaluation to bridge the gap between algorithmic ideals and real-world performance, especially in long-horizon tasks.
  • The Tactical Edge: Prioritize deep engineering talent and first-principles problem-solving over chasing algorithmic novelties. For builders, this means investing in infrastructure, efficient compute utilization, and systematic feedback loops to refine model behavior in diverse environments. For investors, it means backing teams that demonstrate this operational rigor.
  • 🎯 The Bottom Line: The next 6-12 months will separate the AI builders who can truly operationalize advanced models from those who can't. Success hinges on mastering the subtle, often overlooked engineering details that enable models to perform reliably and safely in complex, agentic scenarios, moving beyond simple task completion to genuine collaboration.

Podcast Link: Click here to listen

During reinforcement learning the model tries its best to hack a lot of things.

The current open models can achieve that level of understanding.

It is a solvable problem and we are working on it.

Engineering is very important.

Hello everyone.

Today I have the pleasure of talking to Olive Song, a senior researcher at Miniax.

Recently they've been launching very interesting openweight models specialized in different areas.

Olive is currently working on Miniax on the new version Miniax 2.2.

Thank you for taking the time at 9:00 p.m. on Sunday night.

Does everyone work like this at the company?

I think different people work on different schedules.

We do have people who work even overnight but they sleep at daytime.

I feel like we have a very flexible schedule.

It goes with your experiment.

For example, if the experiments run for all day, the person can take a break, and then if there are a lot of analysis to do maybe because we are very curious about the results and we're very passionate, we can't really wait a very long time.

So yeah, everyone has their own schedule.

That's telling about the success of the models.

I think that's influenced that you specialize in reinforcement learning and model valuation as far as I understand which are two of the least forgiving parts of model development and you also have more constraints than big American AI labs.

What does a good day look like for you and what does a bad one look like?

I can share something about our recent weeks.

So there's not a whole good day or a whole bad day.

We were joking that during one day we have good results in the morning and then sometimes it becomes bad results at night.

Sometimes we call it we have like ICU in the morning and then KTV at night.

Typically a good time would be usually receiving some good results or even if running into new problems is a good time.

So if we for example during reinforcement learning we can see the model doing a lot of different stuff to achieve the results and sometimes we just discover new model behaviors and that's really exciting even though it might not be safe or like it might not be expected it's kind of exciting so I call it a good time.

A bad time would be there really isn't a bad time except for finding out the bad results the moment itself it's bad but then trying to figure out the problem and breaking it down is pretty good time.

What were the recent model behavior that you didn't expect?

So the model during reinforcement learning the model tries it best to hack a lot of things.

For example, it uses bashes a lot and sometimes it might not be very safe behaviors as our expert developers say because sometimes the expert developers have their own expectations on how the model works but then it doesn't go that way if we don't constrain it.

So we do a lot of alignment to solve that issue.

You just launched Miniax Her and that goes all over the Twitter.

How do you come up with those ideas?

Because roleplaying is sort of like is it an alignment question? Is it not like how do you how do you do that?

Frankly speaking I'm not the expert person on that part.

We have a whole team on roleplaying her stuff.

I'm not an expert but we do have a lot of discussions.

We do believe that role playing or accompanying human or human interactions is very important in the life with AI or how it would change our social life in the future and it absolutely represents some ability that's very superior because that's like humanlike you know it has emotions where it understands your emotions it's not just working out some exams that's absolutely another side of the AI capability what is called AI with everyone, right?

In math.

Yeah, it intelligence with everyone.

Intelligence with everyone.

What does it mean for you?

For me personally, I feel like it's more like how it changes my life and it enables me to do more work and then how it can connects me better to different people because for example before I wouldn't be able to understand a lot of very professional for example very professional coding problems or optimization problems and now I am able to do that with AI and so that I can communicate with more people and exchange more ideas that's one side.

On the other side it generally helps my data life.

So, it helps with my work, my daily routine, my self-care.

Changes life for me and I hope that it changes life for everybody obviously in a good way.

Can you tell me a little bit how day-to-day work organized in your lab?

I remember from your talk at AI engineer that it's a very interconnected between developers and researchers.

I would love to hear more about that.

Absolutely.

We sit around every day.

So, we share our experiment results.

For example, as I just said, during experiments, for example, reinforcement learning experiments, we see some scores going up high.

We look at the models behaviors and we look at the model behaviors with the developers in that area as well.

We sit together and then they will spot the issue right away and then we are able to come up with new ideas to fix it or building more data on it.

If we can go into details like your current work on the current model of the current version, what are the biggest problems you're trying to solve comparing to the previous version?

One important thing what we focus on right now and also in the future is human alignment because we are focusing on coding models for 2.1 2.2 and the M2 series.

What we realize is that for it to become very productive in our daily work or for it be productive and safe at the same time, we have to do a lot of alignment on it.

So the model can't just grow on its own and then do some dangerous behaviors just to achieve the final goal.

So for us the important thing would be how we define human alignment, how we define expert expectation, how we actually train the model to be more aligned with our expectations.

I want to go like in some real details here and you're an expert here so correct me if I'm wrong but I saw that there was a recent interest in details like keeping the LM had in FP32 during reinforcement learning training.

Why do small decisions like this end up mattering more than just like a clever new algorithm?

It all ends up being closer to the theoretical algorithm.

So we have the theoretical reinforcement learning algorithm and then but when we implement it, it could be a little bit off that creates a little bit gap to the theoretical extreme of this algorithm.

So that's how we think and approach to this problem is that we try to scale to the theoretical extreme and for example the precision part is one thing that we found that would prevent us from being close to that extreme and that that's how we solve it.

That was a very funny story actually for when we discovered that I kind of talked about it when we published Miniax M1.

During our experiments we found that the accuracy didn't go up.

We looked at layer by layer.

We looked at the log props layer by layer and found it out.

Theoretically speaking it it has to work right and then there has to be some gap between the theoretical and how we approached it.

So we thought about the gap and we analyze it layer by layer and then eventually found it.

Is there anything like this happening now?

Definitely.

Every single day, every single day and in every different groups.

I can't actually disclose something that we haven't have the concrete conclusion because we want our conclusions for anything public to be very concrete and we understand it very deeply.

So if we have breakthroughs we'll definitely publish it later.

But I'll have to say we do encounter these problems every day and we think I think it's called the first principle right?

So we think from the very fundamental part of the problem and then approach it.

The models that you launch are open weights and from your perspective and from the alignment perspective what do builders actually gain from open weights and what responsibility do they have to take on that like you don't have to take responsibility for this?

I'm actually not an expert in building developments or building things with models I feel like because it's open weight people can have free use with it for example they can deploy it on by themselves or they can even fine-tune it with the weights and then have all the data on their their properties is very safe.

But if we talk about alignment, how do you look at that from that perspective?

When the model is out there in the wild before you launch the model, before you publish it, what does tell you that it's safe to publish?

We have some internal benchmarks in terms of safety and it has different dimensions.

Something that's sensitive safety or something that's like alignment safety.

We have that as our evaluation and then before launching about one or two weeks before launching we do scaled up evaluations and we do scaled up alignments on the model and that's how we assess if the model is safe but then if it's already open weight in the wild people actually can do something on it I guess that's what you are approaching at right people can do more things on the model that we can't control I don't know how we handle that frankly speaking.

There are laws on that, right?

There are regulations where people do agree on some moral standards on that.

Do you see do you follow any reinforcement learning failure modes that haven't showed up in benchmarks but then become obvious and really agentic use?

How do you collect feedback for the next versions for improving the reinforcement learning process?

We collect feedbacks on the model itself first.

So when we publish a model outside many developers use it or many people use it.

We collect it systematically.

We analyze each problems.

Some of them are fundamental some of them are just something that we missed and we can fix it real quick.

So there are two parts.

First we do the internal evaluation with the developers and they point out problems and that's how we can fix this part but they are not enough and more feedbacks will come to us after we officially publish the models and then we collect it because the way we organize our group is that different people work on different capabilities of a general model.

If we collect some things that we think we should improve in the future, different people take their parts.

So they're like okay I I think I can solve this issue and I'll solve it next in the next generation and that's how we collect feedbacks and then improve the model.

How did you initially decide to not build one general use model everything for all and go more into specialization like coding?

I think we are approaching generalized models.

It's just that we are putting more emphasis on coding.

For example, our model also you can take it into any general agent scaffold including our own agent product and that's for general purpose.

We do work on researching, report writing, PPT stuff like that.

That's more general.

Personally speaking, I feel like with coding you can structure the whole world or you can model a lot of stuff with engineer it.

With engineering.

So behind it it's it scaled up humanity for me.

So it itself has a lot of intelligence with it and a lot of work to do.

So that's how we view this issue.

But we do work on generalized stuff and even more generalized stuff in later versions.

For example, we our model can do some general workplace scenarios in the future and that's not just coding right if we talk about coding and use it requires long horizon.

How do you solve long horizon for identic use?

I think define your goals good and define the model behaviors good and also we require great infra extraordinary infrastructure for example for reinforcement learning the very important issue besides algorithm besides things that people have been working on for a very long time what's special for agentic stuff is how we define Agents how you define how a agent model would work first you need to define the task you need to define the model's goal especially In a long horizon task, you need need goals that are actually hard and diverse.

The second part is that you need environments.

You need great engineering environments, scaled up environments, different diverse environments, not just coding or more for example workplace, different kind of tools.

That's great engineering.

Then you need great infrastructure.

You need outstanding RL infrastructure to let the model really roll out in a very long horizon.

With very efficient for example GPU use very efficient training rollout in training and stuff.

I feel like that's what's different in agentic reinforcement learning as compared to before.

Are you affected by GPUs constraint?

How do you solve the compute problem?

We do have a team that works on how we utilize the compute the most.

That's actually one of the RL scaling issue is to utilize the compute very efficiently.

So their purpose would be to minimize the compute use and then while training more right personally speaking for me I don't really have a GPU constraint is that we have a great team who works on it to utilize the compute most while you know stabilize the training the most but do you have problems that you need to solve with your expertise how to use more efficiently or it's just that team we are actually the same team because we're actually the reinforcement learning team we view this issue on different perspectives it can be you know implementation right it can be implementation you can view it as a data perspective you can view it at different perspective but our goal is the same we always you know we're always looking forward to new solution.

That comes from Chinese labs because it always mindblowing we are actually working on some new agentic reinforcement learning stuff but it won't really come out with 2.2 with the next generation model we are actually we are still working on it I'm not sure what I can share or not so I can share it later when I have concrete conclusions.

As I said before, I can't really say something that we don't document yet.

Will it be available when the model is out?

That depends on our time.

I'm not very confident yet, but we are dedicately working on it.

Yeah, a lot of constraints talking to researchers.

So many.

Well, if we talk about openness, then this whole conversation that I'm having with people right now in this quarter is about open source.

I wonder if you can talk about the company strategy, why the company decided to go and publish open weights of the models.

What's the benefits?

What's the cons to that?

So for our team like for the researchers team, we all we always wanted to go open source because open source communities is fantastic.

I learned that from day one when I joined the team is open source community is fantastic.

So as researchers we did want to join the open source but then on the other hand speaking of the cons we are a company that people can care about if this can be sell money or if this is a business so the cons would be if the the ways are open source less people would use APIs but then as a researcher that really isn't my focus that much so I'm not very confident about the company strategy for the tech part we just believe that we can build better models with the open source community.

How much do you use open-source tools yourself from different other companies?

A lot.

For example, inference we use I'm not sure if I'm allowed to say specific open-source branches, but then we collaborate with both VLM and SG Law and there they are open source code repositories.

How do you look at open-source stack?

Because when we talk open source, sometimes it's perceived as one thing, but actually like it's it's multi-layered.

How do you look at it?

For example, there are a lot of open- source agent scaffolds, both coding agents and agent agent scaffolds that we use ourselves to test our models and then we look at their logics.

We look at their code to see how the design specific both scaffolds and for example engines and then we take what they worked on was really good and then we reflect on how we think the problem, how we structure the problem, if we're on the same page and stuff like that.

So we learn from each other.

Do you think team underestimate how much like engineering discipline open models require comparing to using closed APIs?

It always requires a lot of setting up and it's a different compute and you need to have a talent for that to use it engineering talent instead of just like you know choosing close API turn it on and and use it.

Do you have any difficulty with that or just inside the company the open- source stack is like established and and working?

Personally, I don't have a problem with that.

There are other open source models and if they publish, I'll just download it and and you know deploy it on our machine and then work with it if I want.

Personally, I don't have that issue.

But if there are, you know, personal developers out in the wild, I understand the problem, especially when they don't have their own compute and then it'll be easier to, you know, connect to a model through, for example, open router and stuff like that.

Do you use a lot of other open models on that's the same open router, let's say.

Do you play with them?

Yeah, I play with them.

I would play with them day one.

If they release at midnight, I play with them in midnight.

Are you like taking notes?

I don't actually take notes, but I do have my personal evaluation stack, a list of their fun questions that I like to test with every single model to see how they work.

Can you tell me about it?

That's super interesting.

I been collecting a bunch of questions since I entered the company on different areas including logical reasoning, including mathematics, proofs, including report writing, including agentic tasks and stuff like that a lot.

Some of them are very, you know, I I just like to see how the model reacts to these problems and how they approach it.

Different model have different personalities when approaching.

That's true.

And you always need to adjust them.

If we want to give like sort of a little guide to people who want to evaluate a model themselves, can you give me examples of the questions?

Like five questions you need to ask the model to understand how it works, if it works well from the professional evaluation perspective, five questions isn't enough.

So if you want to do very standard and very fair comparison among models, we have make it very a confident test.

So there has to be a certain number of questions in each domain to see how the model performs.

Usually you need to test it multiple times because models are not very stable themselves.

If you're testing for fun, use the fun questions.

But if we are actually assessing the model's capabilities, we need some sets and that's, you know, very fair among different models that's correct because some problems are not correct.

Some questions the the answers are not single for example in sometimes when we run the test the the environments are not fixed for example for example the gold pattern wouldn't pass and stuff like that so if we're doing professional evaluation we have to make make sure the evaluations is correct it's diverse it's above a certain threshold so that the test is confident.

You mentioned characters how do you work with your model character?

I don't work on my model's characters okay.

That's how I think of this issue.

A general model should have all characters or it should be able to perform all characters.

It might have default character.

If the user want it to be a different character, it should be.

If the model injected into the system prompt, it should be.

That's how I view this issue.

I find it hard to adjust the new models because they're so different in terms of character all the time.

I just don't even understand why it happens.

I think it has to be something related to the data that the model was trained on the different patterns the models have been trained on and also different people different team might have their own constitution in the system prompt or as the models default behavior if you look at open models in production today I don't know if it's if it's a relevant question but where do they fail?

First open model specific like reasoning, tool use, state tracking, evaluation, blind spots like there are all those risks for open model.

Where does it break first?

I think open models are not very good at adjusting to different environments.

From what I see right now, we can see for example cloud, right?

People use cloud in different coding environments and then people think they perform well in all environments or different tool definitions and stuff.

But I don't feel like the current open models can achieve that accuracy or that level of understanding of the different environments.

Why?

Where is the problem?

I don't know how clot does it.

But for me, I think it is a solvable problem and we are working on it.

We are improving it in 2.2, but it's still not as good as for example Opus, but for 2.5 it might be.

We do have some systematic research going on in the area that has shown some results now but still it's not concrete conclusion so I won't say it.

I'm so curious but do you think it's the problem of compute because they have this infinite amount they can just throw at it?

I feel like compute is one side but how we structure the problem and how we approach it is another side and that's where we're more confident on that we can solve the issue.

What can you tell me about M2.2 to if it's launched by the time that the interview is out?

Can you give me some overview?

Better coding obviously and better multilingual coding obviously and more stable than before.

It has better performance in the area of 2.1 in different areas is better more stabilized longer horizons and stuff like that and we are testing it in different environments right now and we believe that it's better than before.

So different coding environments, right?

Even environments that we haven't seen before, even environments that are totally out of distribution, we see some very promising scores that are higher than 2.1.

I wonder how do you stay updated to everything that happens, which is super hard because the pace is just insane.

You said when the models are out, you playing with them.

Do you read research papers?

What is your like other interest that helps you kind of cross-pollinate with what you do?

Can you tell me how you stay up to date and what inspires you?

There are different articles, different blogs going out every single day in a bunch, all the information.

How we deal with it is that we have a internal agent that tracks all the new articles and blogs and papers and then it dispatches to different subjects and then it summarizes and then it analyzes and researchers.

So, so we have a internal researcher if I call it that does some filtering by by itself and then it gives what is filtered to us and then we can improve the researchers if we think it doesn't do well and that's how we how we filter out a lot of information first and then we play with new code repositories with coding agents so that we can understand it more quickly and then play with it more quickly.

So we're pissing up with the you know with all the improvements with agents and with our models for our models.

That's fascinating.

When you became a researcher when you chose this path what was you thought you would be doing and what you actually doing like is it close to what you thought?

That's a really good question.

When I joined the team I thought I would be reading papers every day because that's what I was doing during school right during a lab.

We would read papers, come up with the ideas, implement ideas, run experiments.

If the experiments, results are good, we run it as a larger scale.

I thought I was I was about to do that.

But then what I realized was that when joining the company and then working for a couple of months, you already become pretty much top of the area or of the industry and that you have to come up with something that's really new or you encounter problems that you just don't know how to solve.

It's not like you can read a lot of papers and then building up your thinkings on the papers.

It's more like you need to really understand the problems from the fundamental and then think of it from the fundamental so that you can find the right solution.

Another things would be that engineering is very important.

I didn't know that during school because during school or during labs it's more toys as compared to companies.

It's not that scaled up.

But when you really do scale up data, you scale up compute, scale up people, you encounter engineering issues that you need to tackle very beautifully and engineering is very important.

That's part two that was different from what I imagined.

Pretty much these two.

I feel like when you work on the model currently, is it mostly that you're solving problems that you see immediately from your hands-on work or is it that this the company says oh we have to achieve let's say oppos results how do you set the goals?

We have a like a meta goal at the company level for example we want to improve the AI's capabilities in you improving productivity for example because that's how people view so we have a company's mission as a single researcher in the team we have our own missions that we set our own goals with what is your goal currently for the next generation I would be I really want the model to to be working elegantly with experts so it's more like better collaboration with experts with developers.

That's my goal as well.

But that's maybe like two versions away.

I think I think we're launching one version about per month or a month and a half, right?

For longer horizon, we are we are definitely working on it.

But then for me, for the goal that I set on along that path, that's like a 3 month away thing.

But for the better collaboration thing, that's like a one month or two month away thing.

I wanted to ask you a little clarification question about interl learning while you were talking at AI engineer also that the model doesn't set on like one action it's constantly like in the loop of asking more questions and trying things how do you look at it is it a continual learning is it part of it what do we need to solve to have the model continuously doing this learning for like longer and longer horizons that has some overlaps with defined concept of continual learning and By overlap I mean I think both conceptually and technically but I don't feel like they are exactly the same or the things that I talked about at the summit was not at the level of the full continue learning it's more like on the path to that how do you see being solved any ideas we do think that's a different different problem definition or that's a different way of the model working with people and we are working on that norm with our own defined question but if I need to say how we approach it I would say to we would approach it through experiments that's a very interesting question on continue learning and I and it's very still very exploratory right that's definitely what we are going at but then it has different phases or like different stages we might approach stage one first while exploring more stages later and they're not yet outlined the stages.

Outlining stages we do have our internal definitions that I didn't prepare today.

I would say first would be to be more stabilized into long horizon task and what I said at the summit right and then the next thing would be optimization if you can repeat it because the people don't know what you said.

So for example, we see a model, it receives, you know, environment feedbacks in a new environment.

It needs to know what to explore and what environments to see because it's a partially observed environment.

It needs to know the actions that it takes to receive better information and then better reactions and then perform harder more complex task in the environments.

That's like more of stage one, right?

That's pretty simple.

Basically, all agent models can do that to some extent.

Maybe not perfectly, but to some extent.

That's how we can actually solve it with our current algorithms.

But we do see different norms of how model improves itself in the environment that we don't have a concrete conclusion yet.

Maybe in 2.5 we will that will be a different definition than what I said.

The model itself would be defining its own goal.

That's something that that would be different.

Thank you so much.

My last question is about AGI.

Do you believe in AGI?

And if yes, how does it look to you?

Okay, that's a very large question.

People talk about AGI and ASI every day.

Actually, when I was interviewing with Mini Max, when I was interviewing with our CEO, I said the same thing.

He asked me the same thing, right?

What I said was that I think people talk about AGI, people have different definitions of AGI, but we can only know the definition of AGI when we achieve it or it is still progressing so fast that the definition even changes every day and people have different comments on it.

What I think is more important is we actually work towards it, work towards our own definitions of AGI and as long as we figure it out, it becomes true.

That's what I said during the interview and that's still my view today.

The definition will become true when it becomes true.

When we see it, we know it's AGI.

Yes, exactly.

But we're not there yet.

No, there can still be better AI intelligence for sure.

Thank you.

One more last question.

What was the book that influenced you the most?

And it can be recent book or a book from your childhood.

Let me just double check the name though.

Something like the art of creativity or something that I read during undergrad so it's a long time I don't remember the exact name yeah there is a book the name art the creativity how did it influence you it opened up how I think of my own mind a lot and then how I view the world and how I view problem solving for me now problem solving is is more of discovery that's how I would summarize it in one quote thank you so much thank you for your time that was very interesting thank you for having.

Others You May Like