Report

OpenAI’s o3 Model is Fantastic

Jan 2, 2025

Report

OpenAI’s o3 Model is Fantastic

Jan 2, 2025

OpenAI’s o3 Model is Fantastic, What Does that Mean for AGI and the Hardware Market?

ChatGPT creator OpenAI has just released a sneak preview of their newest models coming next year, o3 and o3-mini. This new generation of reasoning models will build off the just released o1 and o1-mini in a big way. The initial ARC-AGI benchmark had o3 scoring as high as 87%, higher than the average human score of 75%. Does this mean that the research team at OpenAI has achieved AGI with o3?

(Source: ARC)

The answer isn't clear cut, but most people agree that o3 is probably not “AGI’. To understand why, one must peer into the actual tasks in the ARC AGI challenge, the mechanics of how the o3 works, and ARC’s definition of AGI.

Let’s start off with what the ARC AGI challenge actually is. The benchmark was designed to be an indicator on a model’s ability to acquire new skills, and adapt to new situations on the fly. Essentially, it benchmarks a model’s intelligence against that of a human’s, by throwing situations that it has not been trained on, and evaluating its ability to reason through the problem. To a human, all the tasks in the benchmark are quite easy, you can go try them out at this link right here. Each task is just a basic pattern recognition problem, something you’d probably find on a toddler’s IQ test. That being said, humanity’s frontier AI models up until this point have scored quite poorly, and it’s quite surprising that a model like o1—which scores well above expert humans in difficult AI benchmarks like GPQA diamond—can only muster up 32% accuracy on the toddler level tasks in the ARC-AGI challenge. This is still leagues ahead of what our frontier models could accomplish previously. Take a look at how performance on the ARC-AGI challenge has changed over time.

(Source: ARC)

Progress on the challenge has really taken off since the GPT-4 era, and the model has made larger leaps with each new generation. Does this mean we are on the brink of AGI? Perhaps the next gen model will be the one to conquer the ARC-AGI challenge and score 100%? Contrary to what one might expect, ARC, the organization that created the benchmark, isn’t waving the white flag anytime soon. In a post that also acknowledges the great achievement by the OpenAI research team, ARC co-founder and renowned AI researcher Francois Chollet voiced his opinion that although o3 can outperform the average human on the challenge, he is hesitant to qualify the model as AGI given its inability to produce symbolic reasoning and also its lack of a self-play mechanism. To decode what he means, one must now investigate how the CoT (Chain of Thought) reasoning mechanism works in the o-series models.

Prior to the o-series models, Chollet describes all previous GPT models as merely “memorize, fetch, and apply” programs. They would ingest information via training, recognize patterns among them, and then when you prompt it with an input, it would then just apply the patterns it recognized onto your data. Think of it like a library of programs, one of which could be the “Capitalize all inputs” program. If you give the model a prompt like “Capitalize the following: abcdefg”, it will go and fetch the “Capitalize all inputs” program, apply it to your input, and then hopefully return “ABCDEFG”. It learned this “program” through training on swaths of human data, where it learned the context in which it should capitalize letters.

(Source: THA)

In reality, these programs don’t really exist in any symbolic way—barring early research on mechanistic interpretability—and are merely just a hand wavey representation of a collection of seemingly random numbers in a neural network. Even so, it is easier to think of that generation of LLMs in this library-of-programs way. This paradigm was true for all OpenAI models up to and including GPT-4o. The new o1 models changed this paradigm with Chain of Thought. Models like GPT-4o had enough programs to apply onto virtually anything the average joe would need, any regular question about math or physics could be easily answered by the millions of math and physics problems it had seen while training. Each input had 1 program that would execute and map it to an output. This system works well for tasks the model has seen/analogously seen before, but fails miserably when a task in which it has not trained on is presented to it.

The model can try and find a program to apply to a novel task, but the program it finds will likely not transform the input into a desirable or correct output. For example, GPT-4o had never seen many of the ARC-AGI challenge tasks before, thus why it failed to find a suitable metaphorical program to apply onto the task. Where CoT differs is that it unlocks the ability for the model to use multiple of its mini-programs to accomplish the task. Let’s use a simple example, addition and subtraction. Given an input like “4 + 3 x 8 =”, GPT-4o would use its “Addition and Multiplication” program on the input, which would then hopefully map it to an output of 28. A CoT model would break it up into bite sized chunks, it would first use its “PEMDAS” program to see that we need to do the multiplication first, then use its “Multiplication” program to multiply 3 x 8 to 24, and then use it “Addition” program to add 4 + 24 and reach the final answer of 28. Here is another, more visual example:

(Source: IBM)

The implementation of CoT liberates the models thought process, and allows it to mix and match each of its programs. Now, when given an ARC-AGI challenge problem, it might not have seen a task similar in its training data, but it can break the task up into bite sized chunks that it does have programs for, and finish the task that way. Unfortunately, there is no magic sauce yet for teaching a model CoT, current methods primarily rely on human created chains of thought. This involves a human taking a task, and then writing out their own step by step solution to the problem. Through training the model on these human generated step by step instructions, the model can hopefully gain the generalized ability to mix and match steps into its own chain of thought for a novel problem.

The first problem that Chollet takes up with o3’s contentious AGI title is that the mini-programs CoT models use are merely “natural language instructions…rather than executable symbolic programs” What he means by this is that when a CoT forms its chain, and arranges all the mini-programs it has learned, the mini-programs themselves are not hard-coded programs, but instead just a verbal instruction to the general o3 model on how to handle that bite-sized section of the problem. This is true, as stated previously these “programs” are just a good metaphor for how an LLM accomplishes certain tasks, and they only exist in non-human interpretable numbers. Instead of a hard-coded function executing on the bite-sized problem, the program is merely a guiding prompt for a larger model on how to handle said bite-sized problem. This creates a major theoretical flaw in the model that disqualified it from being classified as “generally intelligent”. The reasoning process produced by CoT is not grounded in reality, due to the fact that the programs are just language instructions being evaluated by a larger model. This means that theoretically a CoT model could still suffer from the same issues as a vanilla model, in that when it encounters a novel task that breaks down into novel bite-sized chunks, it might still not have the optimal programs to apply onto those chunks.

This makes the model principally as brute-forcey as vanilla models like GPT-4, which Chollet seems to think is an antirequisite to AGI. His second reason on why o3 is not AGI pertains to how it cannot “autonomously acquire the ability to generate and evaluate these programs (the way a system like AlphaZero can learn to play a board game on its own.) Instead, it is reliant on expert-labeled, human-generated CoT data.” This refers to the lack of a self-play mechanic, like how AlphaZero was able to achieve expert-level chess, go, and shogi skills just by training against itself in simulated games.

Of course, as a human, we can satisfy both of Francois Chollet’s qualms about o3. We can verify our own chains of thoughts through testing in real life, like how you test each individual part of a plane before assembling it—unless you’re Boeing. We can also acquire new skills like learning how to juggle a soccer ball, all on our own. Even though the o3 scored 87% on the semi-private ARC-AGI challenge, o3 cannot do those two aforementioned things yet, which is why Chollet argues that it is not deserving of the AGI crown.

How are the picks and shovels of AI going to change, given the transition to CoT reasoning models?

o3 is an entirely different beast when it comes to inferencing. Apparently, on o3-low, the model cost $2000 in pure electricity to run 100 tasks. Representing a cost of $20 per prompt you give it. On o3-high, ARC did not release the cost figure, but did reveal that it was 172x more computationally expensive than o3-low. Making some safe assumptions about how they calculate cost, one can just assume that o3-high was also 172x more fiscally expensive than o3-low for the same 100 tasks. This means that it cost around $344,000 to run 100 tasks on o3-high, meaning if you were an end user right now, you’d be looking at a $3440 bill to ask it a question. To put that into perspective, they didn’t even think to benchmark cost prior to o3. The same test run on o1 probably wouldn’t even break the $500 mark, and 4o is 6x cheaper than o1.

The reason for the sudden increase in costs is due to the self-consistency CoT mechanism o3 likely employs. This mechanism adds a new axis of scaling for models, which of course scales performance at the logarithmic cost of compute. Self-consistency is essentially a process in which the model generates X amount of individual chains of thought for a single prompt. Essentially thinking through the same problem over and over again, each time with a slightly different thought process, that might yield a non-unique result. At the end of its thinking process, let’s say it generated 1024 chains of thought, meaning it answered the prompt 1024 times. Now, out of the 1024 answers, pick the mode, or in other words the most common answer. That is the answer the model then spits out. This gives rise to the self-consistency name, and the logic is analogous to that of a monte carlo tree search. We do not know exactly why things are the way they are, but if a model thinks differently but ends up at the same result, perhaps that provides us with a signal of a likely correct result. This entire process is very computationally intensive, running 1024 chain of thought roll-outs for each task is the equivalent of solving the task 1024x. This can get infinitely expensive, you could run 5, 10, 5000, or even a million roll outs per task. Nobody on the outside of OpenAI knows exactly how o3 works, but the evident cost to performance scaling in post-training likely means that the model’s great performance has something to do with some variant of self-consistency.

With self consistency roll outs being a new way to scale performance, there will likely be a new prioritization on inference time performance. While pretraining will still be of uber importance to the quality and general knowledge of the model, post-training will most definitely be taking up a larger slice of available resources in the future. This is because of the aforementioned chain of thought roll-outs that involve inferencing a prompt to completion X amount of times per task. Each roll-out will involve the model taking the prompt, breaking it down into bite-size pieces, then inferencing over the context once per additional token/word generated. Just like vanilla LLM inferencing, it will load everything into the GPU’s memory, send it over to the GPU, load the new context back into memory with the additional token, send it back to the GPU, and load it back into memory ad nauseum. Except instead of doing it once per task, it’ll do it X amount of times. Here is a helpful image that represents the inferencing of just one prompt:

This process happens for every prompt in a vanilla LLM like GPT-4, but happens for each roll-out in a self-consistency model like o1 or o3. With the o-series models, we have essentially scaled inferencing workload by X number of roll-outs. This is a huge win for anybody in the AI space whose business revolves around anything pertaining to the word bandwidth. The primary example has to be memory suppliers. The speed at which data can be accessed from memory by the GPU quickly becomes a bottleneck in this back and forth process. As such, memory suppliers like Micron and SK Hynix have been sold out of HBM (High Bandwidth Memory) for all of 2025 already. Nvidia and AMD are biting at the bit (no pun intended) to get their hands on all the HBM3e memory they can get. In light of the release of o3, Nvidia actually announced their B300 GPU, a new blackwell chip whose main upgrade was cutting edge 12-hi HBM3e. Inferencing workloads might scale thousands of times faster than many anticipated. Just a few weeks ago, big voices like Marc Andressen and Sundar Pichai indicated that scaling laws have hit a ceiling, and that the low hanging fruit of AI had already been picked. Neither of those two people are supremely technical with regards to AI nowadays, but it paints a crazy picture to see OpenAI smash expectations using a new axis of scaling just weeks after Marc and Sundar made those remarks.

Needless to say, inference-time scaling seems here to stay, and with that the high bandwidth memory suppliers are one of the biggest winners. Then, a pureplay interconnect chip player like Astera Labs will likely see business boom even higher as the GPU-GPU interfacing bottleneck becomes even worse. Lastly, who could forget Nvidia, who sells the entire stack, and all of a sudden will have a ton more demand for their GPUs to use for inferencing.