Latent Space
January 23, 2026

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

Captaining IMO Gold: Why General Models Win the Reasoning War by Latent Space

Author: Yi Tay

Date: October 2023

Quick Insight: Yi Tay explains how Google DeepMind traded specialized math solvers for general-purpose reasoning to secure IMO gold. This summary breaks down the shift toward on-policy RL and why data efficiency is the next frontier for AGI.

  • 💡 Why did DeepMind abandon specialized symbolic systems for a general model?
  • 💡 What is the "on-policy" advantage in human and machine learning?
  • 💡 How does AI coding change the junior researcher role?

Yi Tay returned to Google DeepMind to lead the Reasoning and AGI team in Singapore. He helped captain the effort that secured a gold medal at the International Mathematical Olympiad using Gemini.

The On-Policy Advantage

"Correct your own path instead of trying to imitate other people's path."
  • On-Policy Learning: Models train on their own generated trajectories rather than just mimicking static datasets. This allows for better generalization and error correction.
  • Montessori Intelligence: This mirrors unstructured human learning where discovery beats rote memorization. Models that fail forward develop more robust world models.
  • Learning Rate Updates: Humans often update their priors too slowly when proven wrong. High-performance researchers treat a single counter-example as a reason to pivot their entire mental model.

Generalist Supremacy

"If the model can't get to IMO gold then can we get to AGI?"
  • Tool Subsumption: Specialized symbolic systems were discarded in favor of a single end-to-end model. This proves that general reasoning can absorb domain-specific logic.
  • Inference Scaling: The breakthrough came from scaling compute at the moment of thinking rather than just pre-training. This shifts the bottleneck from data volume to inference-time logic.

The Sweet Lesson

"Ideas matter and there have been a lot of good years in the last 5 years."
  • Data Efficiency: We are approaching the limits of raw token volume. Future gains will come from extracting more signal per token through better algorithms.
  • Research Taste: High-stat talent is moving toward finding tricks that compound. Scale is the engine but architectural intuition is the steering wheel.

Actionable Takeaways:

  • 🌐 The Macro Shift: The transition from more data to better thinking via inference-time compute. Reasoning is becoming a post-training capability rather than a pre-training byproduct.
  • ⚡ The Tactical Edge: Use AI for anti-gravity coding to automate bug fixes and data visualization. Treat the model as a passive aura that buffs the productivity of every senior engineer.
  • 🎯 The Bottom Line: AGI will not be a collection of narrow tools but a single model that reasons its way through any domain. The gap between closed labs and open source is widening as these reasoning tricks compound.

Podcast Link: Click here to listen

The thing that I find the most useful about these models in general is when I have big spreadsheets of a lot of results and I understand plots of it. I think models can quite use a screenshot and make a plot of this. I hate making this mattplot like stuff about it's so annoying.

There were so many moments this year where AI suddenly crossed that immersion thing. I think AI coding is one of them where we just discuss. I think nano banana also got to the point where I usually make these images is just for fun. They just troll your friend or something like that. But nano actually really got so good.

Welcome back. How are you? Yeah, I'm good. I'm good. Great to be back. It's been one one and a half years. Yeah, it's been one and a half. Feels like a long time.

So last time we talked, you were at Rika. Yeah. And then you joined GDM again working for Quark again. Yeah. And more recently, you've started GDM Singapore. Yeah. Is it GDM Singapore? Gemini Singapore, dude. I don't know if you've named the team. Oh, I I I think we have a Germanite team in Singapore. Yeah, team in Singapore. It's called reasoning and AGI.

Is it important to have AGI in the name? It was like a vit thing that we put AGI in. Yeah, I think that one reason why we work on these models is that we want to get to AGI and this was a V thing that we added AGI to the job posting. Yeah, there's no formal name of the team yet, but it's basically the Gemini team Singapore.

I mean I think people are trying to triangulate Amazon has an AGI team, you guys have an AGI team and then let's say Meta now has a super intelligence team. What are people signaling when they choose these names for their teams? Do they have oh we have a plan or is it just vibes?

You try to fish some hot takes on the no you have officially AGI in your job title. No, it's not a thing name is is not Yeah. It's just you know we just want to signal the north style of we bringing these models to AGI. Yeah. Yeah. No, I wasn't really fishing politics. Okay.

So, you rejoined GDM. Yeah. And I think last time we talked about I listened back to the whole thing. It was amazing episode last time you were talking about how it's like externally were in Brain and came out and now you're back in GDM. Yeah. I wonder what's your just your general reflections just plugging back into the Google infrastructure?

Oh yeah. So I guess coming back it's very interesting because it felt and return to Google like everything including your LDAP your username is all the same. It's like you play Pokemon you you leave it aside and then you go back and you click continue save game. Yeah you save game and continue game. It's like that.

Obviously the last 1.5 years while I was away many things have changed. Brain is now part of GDM and stuff. So I think that obviously a lot of things have changed but I think overall the coming back has been pretty seamless. Obviously I love Google infrastructure and I think debuts are great and stuff like that. Yeah. And I'm very glad to be back to Google infra. Yeah.

And was the intention always that you were going to work on deep think? No, not really. I think I miss research a lot like doing like research not like super fundamental research but like close to model research, right? But I really miss being at the frontier and trying to go beyond that, right?

So I really missed that a lot and I think when I came back big thing wasn't a thing and I don't think there was any plans actually it was just like I'm just going to work on research and see what happens. Yeah I'm sure I guess there was some inclination that reasoning is the next frontier and that's like obviously the most rewarding research path especially this year.

Yeah, I think reasoning these days reasoning and RL is like probably quite is RL reasoning comes I spend a lot of my past life I call it the past art working on like architectures and pre-training but I think now I more I like transition more into RL research. I'm not like old school RL but the games RL and old school RL and to be honest I had almost no RL background coming back but I think like like RL is the main means of modeling these days and yeah so I think it was pretty easy to jump back in and I think a lot of fundamental skills in research is general purpose and universal and and it's it's quite easy to innovate even in a tool set that you're not super used to and yeah so I think RL is basically the main modeling tool set like we play around with.

Yeah these superficially I see some you know in your UL2 and fan T5 work um some some overlap of like you know the the focus on objectives and the focus on the stuff that you're trying to incentivize so I would have maybe guessed there was more overlap than you are saying right now which is interesting but I know I understand it's very super shift is objective and they have some like overlap right yeah I think it's just mainly like the on policy and off policiness of designing in these things that change how like also the learning algorithm itself right let's just introduce this kind of terminology to people if they're not that familiar with the sort of RL policy I do think that a lot of people are like trying to understand what is working about this generation of RL research anyway so Jason had this interesting post which I think you were co-signing which is basically you always want to be on policy instead of mimicking other people's successful trajectories to your own actions and learn from the reward given by the environment basically correct your own path instead of trying to imitate other people's path.

Yeah. Yeah. And first of all, he writes really well and I wish that more people wrote like him. But I don't know what's your reflection on that or your addition on top of that?

Yeah. So I think like the biggest analogy of the own policy and off policy is basically all policy is basically like when you SFT something is odd policy. Basically you take some other model larger model stuff and then it's basically like this off somebody else's generated outputs trajectories and whatever. I think odd policy is mainly like the core idea of like modern LM RL where you like generate and then you reward the model based on its own generations and then the model trains on its own generations. Yeah.

So it's more it's a bit like selfisolation to some extent. You the model generate it own output and then you reward it and then trains on it own output. So I think on policy is basically this idea of like model training on its own outputs and letting the model like generate own trajectories and then let letting some reward verify it and then the model train it own outputs. I think this is more generalizable in general.

I think there's still a lot of like science out still to be done about the gap between SFT and and RL itself. But I think basically on policy and off policy right and I think bring this analogy back to real like life. I mean we this on policiness is more like like humans we are more on policy because we go around the world we make mistakes and then we ah okay this is but like imitation learning is supposedly somebody else not first principle it just tells you what to do and they just copy. So I think yeah this philosophy bringing back philosophy to life is quite like powerful like when I like now I have a kid and everything like want my kid to try stuff and then you tell them like okay this is like where this went wrong where this went right and stuff rather than okay you just copy everything somebody else does.

Yeah, there's a Montasuri schooling is mostly that right like very unstructured learning like you discover your own path and we just give you a safe environment to do it. Yeah. Yeah. Yeah. What is the point in which you should transition from imitation to on policy? I do bounce back and forth with humans, right? Not models, right?

I I would say in models it seems like there mostly has been a very concrete like first you imitate then that's pre-training and then you are at the end. Technically SF is still imitation. But I think for humans alo a little bit of this right because if you basically like like sports right when you play sports you start off by imitating like hardcore imitating but you cannot imitate forever because you need to like imitation I don't know whe this is good energy but watching a lot of tutorials and stuff is more like imitating you learn try to learn certain movements and stuff like that but then like on policy nurse is like going to the game itself and trying to get a reward signal from that right but so I think that humans do need some form of imitation learning but like I think everybody starts off by imitating thing but then again the human and model kind of is not it's just fun to have analogies but we shouldn't like take things like super literally and stuff like that I actually I'm a quite a serious taker of machine learning insights into human learning that's what we learn from models now yeah because I think like machine learning is the most scientific way we have ever studied learning just in general that's true that's where we have to invent curriculum from like scratch yeah that's and things like learning rate.

If your learning rate is too high, learning rate is too low. Like wait, do humans even have a learning rate? So I do tell people to to keep an idea of their own learning rate and to be wary of it being too low. So for example, if you've been wrong once, you should ask where else have I been wrong? And typically usually, let's say learn, you know what I mean? People usually update slower than they should when they have been wrong. Where is it? Stubbornness.

It could be stubbornness. Um I don't know is that is that the right word for it? It could be like they they're too Beijian when actually is like their prior assumptions are wrong and they need to completely throw out their previous assumptions because one counter example invalidates all prior experience. Your entire world model is wrong. Throw it away. So Beijian actually wrong. Let's say you live for 10 years under some assumptions and you have one example that breaks your narrative. Okay?

You shouldn't be like, "Okay, now I have 2% update." No, actually it should be like, "Oh, like something's really freaking changed. Everything I've assumed for the last 10 years is probably wrong. What else am I wrong in?" And update 20%, update 50%, not 2%. You know what I mean? That's your that's a learning rate thing for me. So my my direct example is the whole getting into AI stuff. I was watching GANs for 10 years. Yeah. Has it been 10 years? Uh 2012, 2013. Time flies. Yeah.

I was watching GANs and I was like okay this is cool it's getting more detail not that impressive then all of a sudden stable diffusion came out and you can run it on your laptop and that was my learning rate okay like like my mental model of generative image images did not include this and so I was like okay like I am very wrong and I need to pivot everything and that's how I started later space so will this mean that like your your learning rate is high yes I yeah I will nudge it up I I schedule my learning phase because a role model has been violated Okay.

I think it's a good it's a good strategy. I think also this brings a little bit to like when new paradigms happen like how fast people are to adopt it or like to invalidate their understanding of things. I think as scientists we definitely a lot of times we do have to keep as the few progress we do have to keep like invalidating our own world model. It could be like a certain ways the way to do like something all along and suddenly something comes along and invalidates it. Yeah. Yeah.

You can be very proud of your priors until it's like becomes your prison. Yeah. Yeah, I know. Yeah, that's actually very dangerous. Yes, it is. Yeah. Okay, that was a bit of a tangent. I don't know how we got there. You did highlight Denny's LM reasoning lectures where he got traced the intellectual history of reasoning in LMS train of thought to to RLFT. And then the one part that I was going to prompt you a little bit was also self-consistency, right? Which I think people roughly know. I think is more crudely implemented with OpenAI than with you guys where it is straight up they have eight inferences and they judge or whatever. But I do think that also is relevant to on policy distillation where it's like literally you have eight different powers and they're all from the same model. So checking my intuition there.

Basically the stuff that you're saying about why on policy is important and using let's say an external verifier to to improve your reasoning. You can also do that with parallel reasoning. Oh yeah. Yeah. I mean like when we train RM models they sample multiple times. So yeah to some extent there's some form of yeah self-conidence is that directly the let's talk right yeah self quency is a little bit more the more nuanced version of if you talk to Danny will tell is not majority voting for sure but it's more I agree yeah it's more nuanced version of that but I think parallel thinking definitely is related to self-consistency yeah yeah I think for those open also actually put out some interesting papers on majority voting versus other forms of like multiple output consensus basically like you at the highest level is actual an actual LM judge that decides like this is actually a worthwhile trajectory that is more valid based on some internal consistency or just like inspecting the chain of thought which is very cool that we can train models to do that.

Yeah, for sure. Yeah. Yeah. Self consistency is a big like a big fundamental idea. I mean chain of thought itself was alo a big idea and then self consistency was also like a big fundamental idea in in in uh in modern like literature. Yeah. Amazing. Okay. So let's bring it to I guess the one of the headlines of this podcast is going to be about diving into the IMO world. So this was around about May March March in July. July you guys announced Oh, this very nice photo here. This is the photo I was looking at. This is in London I believe where you had this room. Yeah, this room. Oh, you got to be at a like you got to be at a photo taking to to get the credit. That's right? What the No, no, I'm just kidding. The contributor list is bigger than this. Yeah. Yeah. But like they were like say like oh okay you should go to third to in order to get a literal gold medal. No no no no like like to get the the credit for being the IMO at first. So it's just a joke. It's just a joke. Yeah.

But anyway, okay. Could you tell the story of studying this IMO thing? Apparently it was done in one week. So let me like be a bit more clarify a lot of things right. So the IMO effort has been like very longstanding. So Tang and basically and co has been working on this even last year, right? last year because like I was not very bad at Google at the time like they had the alpha geometry stuff and then they were like alpha proof and stuff. So it's it's a very long extending effort but I think this year was the we wanted to try to like use actually use Gemini as a end to end model basically no no no second system in text out model and even that was a not intuitive thing I covered the silver result from last year and I was like okay it's pretty close like it's one point off from silver just try be harder you'll get gold that decision to abandon it I think was pretty bold I don't know I personally was believe always believe in if we're not like in retrospect it's easy to say this but it's a bit like if the model can't get to IMO goal then can we get to AGI basically so it's basically at some point we have to use these models to to try this Olympic competitions and I think that one of the goals this year was like okay we're going to do an end to end like tech index model that's where like my involvement came in so basically I was not like involved in the IMO effort only do the model training part.

So I have to say that that tongue did most of the IMO thing. I just trained the model with a bunch of other what does that work involve? What the surface? So so basically we we just prepared the model checkpoint for the actual IMO itself, right? So that that's also something that's easily overlooked about the IMO thing was that many times you you want to chase benchmarks or stuff like that is always like thing that you can kind of keep running and running and he climbing until you get there and then you but like the IMO was a live completion like some members of the team were in Australia for the thing and there was like this happening thing was happening live was unfolding live. Oh it's a very alpha goal. You receive the thing you like punch it into your system and then like Yeah. Yeah. Yeah.

So, so like some of the professors from Kung's team were like in when they went to the IMO itself and stuff like that the conference I don't even know where the IMO is a conference but it feels like there were people there like in Australia and then so it was a live thing and there were people who actually the job was to run inference on on this IMO P1 the P6 that came out and they also came out on like different days so it's like different sets like one day one day two something like that so the fun part is that I knew nothing about IMO like at all. I'm not like a kid that took part in IMO like too down for that. You're a piano player and yeah, I have a piano but what I only knew was that okay we delivered the checkpoint and that checkpoint was used to do the IMO but then like there was somehow a week in London where everybody gathered that. So everybody was flying to London and then this photo was taken there and then you get to see how all the different parts like come together and like also being in the other rooms in the rooms with the other like co- captains and then it felt a little bit like a hackathon thing.

So yeah, I think this was like the training process of this IMO model itself was like maybe a week or so. Not the actual like the whole like basically. Yeah. Yeah. I think that the question is I'm still not over the decision to throw away alpha proof. Okay. Okay. Yeah. Basically, I think it's very major and I understand that you have this goal of AGI obviously like at some point one model should do it to do all of it, right? But I think you pointed a gun at me and said in 2024, what do you need to do IMO and IOI and and CPC and all the other stuff that you guys did was you need an LM reasoning system that knows how to operate a computer and knows how to write lean and run lean verifier and all these. But basically you rot the lean verifier into the chain of thought. Is that okay?

So basically like it is not obvious that you can do that. Okay. at all because I think so okay so I think what you mean is that like some in some way system encoded in of the model somehow yes yeah I mean it's just whether at the end of the day you just believe in like this like connection is one one model lots of parameter I mean there's also two use right there also two use which you know and stuff like that but I think to some extent the model we like I I think we should be able to get to a point where to like in the past when the LM first started the model could not even be a calculator. Now it can somewhat be a calculator. So technically like a tool like a calculator is somewhat encoded in the parameters of the model. So I think we we will eventually get a point where whether there's things that cannot be expressed in the parameters of the model is like an open question. We we don't know where is that limit but I think we will keep pushing and pushing this limit. So whether like uh some something like a lean system or like some other things to solve other like physics engine or something where it's still we still continue to push that that that boundary. Yeah.

But I actually don't know like whether there were a lot of debates about symbolic system versus yes that's the word I was training. I actually don't really know whether there was like to me I was just like oh let's train a model and then some someone told me to train the model and I train the model. Basically there was like overarching like I people like the IMO effort that that decided this and I also think that because the basically these specialized systems are very like oneoff systems that are like you could create like a chemistry engineer create math engineer could create the thing right but at the end of the day you want one model for everything so yeah so I think this kind of fits that direction a little bit more where you have one model for and then this model was also like launched as Gemini dip thing as a general purpose Gemini dip thing so it's basically unchanged but with maybe some config toned down a bit. Yeah. Yeah.

So the the the inference time config was like the one served to most people is different but and the full IMO like inference config was present like shipped to some mathematicians just because of the inference cost right but that was good enough to be a general purpose model. I think my take is that this intuition was what led to the trying to go towards one model instead of cuz this specialized systems there's no end right you can create many specialized systems yes the most I can see in the future is there'll be a model then that is there's something that really cannot be subsumed by a model then you just use a tool or something right but my prediction is that I think most things can be can be subsumed by the model I think yeah I mean researchers are quite good at he climbing Wait, history would say that you you have a lot of evidence backing you up. Is this the model output that this is it right? This is what Yeah, I think this is the model output. Yeah.

What do you see when you look at this? You just see obviously it looks like a well-ritten problem. It looks like something a real human mathematician would do. I people did compare yours versus the openi one where openi is a lot more raw or have to clean up their versions. We don't have to talk about openi but I just I think what is interesting to you when you saw this kind of output. I want to give a little bit like a special disclaimer is that I know nothing about the app. Right. So I think the wonderful thing about this era of LM is that like you can be like a AI researcher engineer and you don't have any domain knowledge and you can still yeah source get a gold medal universal tool that you don't know anything about. I can't pass this at all like this is foreign to me. But like maybe a proof is a particular kind of chain of thought.

But I would say that the other interesting thing that some of the some of your collaborators were talking about it was like oh this is the first example of reasoning in a non-verifiable domain which to me isn't proof by definition verifiable. I just want to give you things to riff on or debates that are might be worth digging into. So I think that's good. There's a lot of aside from proofs there's a lot of domains that are like nonverifiable and I think not not easy to verify. So it's like when people mean non-verify, but it's like non-trivial to verify or like just not as not as easy as like the solution of like a math problem because pools are long form and it's also that's why it's non-trivial to to verify it to lin and then you do all kinds of things, right? So I think there's a lot of work to be done in like this non-verifiable domains. Yeah, I'm getting into this territory where I'm not sure what I can say, what I cannot say. Okay, so yeah, sure. Well said.

I think another thing that is an open topic of debate was how much domain specific work or post training was done because you then went on to do the II and CBC stuff as well, right? The same model. I was not directly involved in the ICPC but I was related to some extent. That's all I can say. Yeah. Yeah. Any other interesting call outs maybe just on the team? you called out Jonathan as someone who is co- captain on on this effort and yeah basically how does the effort come together so I think there were four captains for the IMO two from London Jonathan was from Mountain View I was from Singapore so I think four of us basically trained this model together and I think one I also trying to see what T was saying but I think one one interesting thing was that like we all in different time zones and we all and there's something also very interesting about like passing on the job there's no like already like fixed workflow how to work together between captains. So it's more like oh I'm going to board the plane now. I'll be AFK for 12 hours. So someone take just babysitting the run. Sometimes there are bugs and stuff. The job comes down sometimes. So basically it's very ad hoc and it's very very it's really between the captains how we decide to like work together and yeah but I think it was a kind of interesting time also because we were all flying like I think the London folks were not having to fly but I had to fly and John had to fly and then like when you visit another country you have another like if you visit an office you have many meetings. So that was in and out meetings and it was a pretty like interesting and I also think that we nobody really knew whether we will get go at that time because IMO actually has hasn't happened. Yeah, it it was interesting, exciting and then I think like the whole process of this getting verified by the IMO committee and you know like you know that was like okay but we're not going there but I had to learn a lot about how the IMO works right apparently the goal score is not even it's not even a fixed number it's like a bell curve on right so it was like a time where you just look at the score you're like like I was even like looking at the watching the human part participants and then seeing like what the scores were because whether Gemini will get gold depends on like how the humans do. So you like looking at oh if a certain percentage you like what's the do we go yeah to some extent you don't have any control over that. So yeah but you just curious right because so I would say that is definitely more in like exciting like there's more adrenaline than like just running on a benchmark and getting a number was like a process that that that took some time. Yeah. Yeah.

But I think overall if you have specific questions you can ask also but I think this whole thing has been a highlight for me. This IMO effort has been Yeah. I I would say most people if you ask them maybe two years ago whether a model could get an IM model they would have said like impossible. Yeah. Then the silver helped right from last year but like the fact that you can throw that system completely away and then just take existing Gemini and scale up deep think and then just run it for IMO gold I think it's also like very non-conensus compared to last year. Yeah definitely to some extent I think researchers were also surprised. I I wouldn't say like surprise but like it was more like a pat of on the back kind of surprise but we actually made a lot of progress in we as in collectively the all the engineers and researchers working on Gemini it's a lot of progress being made you just look at how much we went in one year yeah and I also think that just 5 years ago right not two years like five you just you just imagine the outcome like you just look at the state of AI now like just generally the IMO and the ICBC go and like also even like things like nano banana if you just look at the AI progress now and five years ago I think people would think that we already reached like AGI some form of AI to some form of AGI we're just moving if you just traveled like you take this checkpoints and you traveled back into 5 years ago uh someone should make a drama about this but I think it's really quite impressive like how the field has moved so quickly yeah yeah the hard parts you would say were scaling inference in what like hard expensive hard as in maybe the most amount of brain power expended on the team I saw some comments where they were like actually the hardest This part was the inference optimization or like the very very long horizon inference that deep lang needed compared to normal Gemini stuff like that.

I didn't work on the inference time scales. I yeah I wouldn't know here that is mostly that oh and then there was this the code name was apparently IMOAT which you named after your desk. Okay that's not really like it was in the so I think I tweeted about it at point at some point right bring up the tweet. So the IM cat was basically like okay it's not like a official code name or something. It's just like the name that the config of the job was like I cat that's that's the you just need some kind of name. Yeah. Yeah. I mean I just like you know I just I like cats and then Yeah. Yeah. That is mostly it on unless you want to bring out anything else. We have other sort of researchy topics, but beyond before I go into sort of researchy topics, I did want to maybe leave the floor to cover. What else should people know about the reasoning effort that's going on at GDM? Let let me think of where to start this. What do people need to know? Was really good. Yeah, that's what people need. Maybe an easy one to start with would be a lot of people were focusing on maybe academic benchmarks two years ago. Last year maybe LM Marina this year Pokemon is very interesting reasoning visual reasoning and just general long horizon agent planning yeah benchmark and I I don't know you seem to focus on it a lot and I think Gemini did it very well so obviously I think it's something that is easy to talk about I think I probably should there's actually nothing specifically done for Pokemon of course yeah of course there's nothing specifically done for Pokemon and I think that I think Logan had this tweet recently about the recent Gemini on Pokemon Crystal and Pokemon Crystal is so much more efficient. Yeah, I think Pokemon is like so I used to play a lot of Pokemon and I'm a big Pokemon fan in general and I think it's a it's a great like you say it's a great long horizon benchmark and stuff like that and I think it's a good to check in once in a while on these benchmarks that like almost never get contaminated or like people actually like don't spend time to climb benchmarks. It's like kind of silly to like build like okay like what you working on some people oh I'm working on Amy I'm working on h I'm working on Pokemon maxing or something like that that's kind of like funny we did interview the clock Pokemon I think his name is David and it showed serious flaws in enthropics screen understanding vision capabilities couldn't literally couldn't tell I'm trying to like get past this wall but you keep just keep running to it doesn't know if the wall is And so that doesn't have any spatial reasoning at all. I mean that some of it could be like a harness like the harness thing or like also whether the model has access to like game state information or is it complete visual? Yeah. Cloud's implementation is very game state heavy. They dumped effectively like all the the memory of what's going on in the emulator. Yeah. Yeah. I see. Yeah. I think for I don't know whether I'm jumping off tangent or something like that. I think solving Pokémon is going to be more of like how fast you solve it. And then like the thing that I have not really seen so far is like whether the model can complete the Pokédex.

Why is that? I know you need more challenging. No, complet is so hard. You need to plan. You need to like you need to search up like information like there's some things if you don't like go online and basically you need to like have a little bit of deep research in this. Yeah. There the model will just never know like that it needs to trade. Okay. If he's able to go online post on forums and then find someone to like hey co can I trade with you Pokemon to evolve? Some Pokemon needs to be traded to be evolved. Yes. Or me. Yeah. Yeah. But anyway, I I don't know. I have not seen the model be able to complete a book that completely hard actually for models. So I was I think that's actually an interesting like like an interesting one. Yeah. Yeah. I wonder what the real world analogy would be once let's if let's say we we have a model that is capable of doing that what can we make it do that we cannot do today?

Oh, there's a lot of planning involved. Just real deep research like planning. There's also a lot of planning involved and it's more I think Pokemon the Pokemon game is very linear, right? The Poké Bax is involve a lot of like backtracking, research. Yeah. A lot of research and a lot of Yeah. So, it's a probably a different nature itself. Is that as interesting to you as for example a lot of other people in the AI for science world are trying to discover things that you cannot look up, right? No knowledge. Yeah. No knowledge. Because basically what you're saying is we're not even there yet. We're at the place where models cannot consistently apply knowledge that they look up, right? Like you give Gemini access to a web search and you say, "Okay, go try to collect all the Pokemon in a Pokédex." You don't have high confidence that they'll do it. I don't know if someone's actually tried. Probably know, right? No, I I think the hard part is actually like trying to use the like synthesize the web knowledge and then apply in in in the game itself with all that visual state going on and stuff like that. It probably will be solved in one. It's not it is challenging. It's not like super interesting like you're basically just the task really is can you look up the guide to do it and then can you apply the guide? That's it. You know what's even more intelligent than that? creating the guide like being the first to figure out how to create the guide which is what oh yeah yeah but then when it comes to this is mostly does an exhaustive search thing the model to try and try like humans guide for that yeah okay so that's actually less interesting to you interesting okay actually to be okay when you think over like not super super interesting but it's okay it's just have not seen a model to try to do this so it's yeah I think like efficient search of a novel idea space is interesting obviously you can brute force anything but we're not talking about brute forcing we're talking about trying create an AI scientist but no knowledge is actually an interesting thing that that I think is going to be quite a big thing being able to generate no Google has done stuff there which I don't you're probably not that close to those teams that has done AI scientist work there's some things that have been it's like for example if you freeze the model weights at like 2015 the free time at 2015 and then you even with the current model let's say you have okay just assume there's no leaking of information somehow can if you ask the model what's the best ML okay not 2015 like 2012 or something just tell you like SVMs are like the best right this is the way that machine learning works in general right then the question is can you invent the transformer right it might not be able to like even today models they might not even be able to invent the transformer like if you freeze the time at a certain time and even you bring the tech I mean the model is a transformer so I just say there's no assuming there's no leakage so like totally possible like so so I think there's still a lot of open questions about like whether the model can really you know innovate and generate like really No. Yeah. Knowledge. Yeah. Yeah.

One related

Others You May Like