GPT-5: Everything You Need to Know - by Alberto Romero
Excerpt
An in-depth analysis of the most anticipated next-generation AI model
[
My dear free readers, Iâve decided to remove the paywall of my 14,000-word GPT-5 article. This change is temporary (a couple of weeks to a month) so I urge you to take the opportunity.
(My dearer premium readers, if you didnât get your hands on this one, itâs never too late!)
Important note: I published this originally in April and I havenât updated it. Some minor things missing (they donât change the main arguments or the general thesis):
-
CTO Mira Murati said in an interview in June that the next-gen GPT model would be ready in a year and a half. Some people believe thatâs GPT-6 but it could be GPT-5. OpenAI is busy with search (SearchGPT), reasoning (Q*/Strawberry), and other peripheral products like GPT-4o mini and could be delaying the next GPT to include all these improvements.
-
Thereâs no mention of Claude 3.5 Sonnet or Llama 3.1 (I mention Claude 3 and Llama 3).
Anyway, the most valuable sections are in PART 3, which has nothing to do with timing or model comparisons and all to do with juicy speculation of the things that OpenAI could be doing but thereâs no official info about, including reasoning research and agents.
Hope you enjoy it!
This super long articleâpart review, part explorationâis about GPT-5. But it is about much more. Itâs about what we can expect from next-gen AI models. Itâs about the exciting new features that are appearing on the horizon (like reasoning and agents). Itâs about GPT-5 the technology and GPT-5 the product. Itâs about business pressures on OpenAI by its competition and the technical constraints its engineers are facing. Itâs about all those thingsâthatâs why itâs 14,000 words long.
Youâre now wondering why you should spend the next hour reading this mini-book-sized post when youâve already heard the leaks and rumors about GPT-5. Hereâs the answer: Scattered info is useless without context; the big picture becomes clear only when you have it all in one place. This is it.
Before we start, hereâs some quick background on OpenAIâs success streak and why the immense anticipation of GPT-5 puts them under pressure. Four years ago, in 2020, GPT-3 shocked the tech industry. Companies like Google, Meta, and Microsoft hurried to challenge OpenAIâs lead. They did (e.g. LaMDA, OPT, MT-NLG) but only a couple of years later. By early 2023, after the success of ChatGPT (which showered OpenAI in attention), they were ready to release GPT-4. Again, companies rushed after OpenAI. One year later, Google has Gemini 1.5, Anthropic has Claude 3, and Meta has Llama 3. OpenAI is about to announce GPT-5 but how far away are its competitors now?
The gap is closing and the race is at an impasse again so everyoneâcustomers, investors, competitors, and analystsâis looking at OpenAI, holding excitement to see whether they can repeat, a third time, a jump to push them one year into the future. Thatâs the implicit promise of GPT-5; OpenAIâs hope to remain influential in the battle with the most powerful tech companies in history. Imagine the disappointment it would be for the AI world if expectations arenât met (which insiders like Bill Gates believe may happen).
Thatâs the vibrant and expectant environment in which GPT-5 is brewing. One wrong step and everyone will jump down OpenAIâs throat. But if GPT-5 exceeds our prospects, itâll become a key piece in the AI puzzle for the next few years, not just for OpenAI and its rather green business model but also for the people paying for itâinvestors and users. If that happens, Gemini 1.5, Claude 3, and Llama 3 will fall back into discoursive obscurity and OpenAI will breathe easy once again.
For the sake of clarity, the article is divided into three parts.
-
First, Iâve written some meta stuff about GPT-5: Whether other companies will have an answer to GPT-5, doubts about the numeration (i.e. GPT-4.5 vs GPT-5), and something Iâve called âthe GPT brand trap.â You can skip this part if you just want to know about GPT-5 itself.
-
Second, Iâve compiled a list of info, data points, predictions, leaks, hints, and other evidence revealing details about GPT-5. This section is focused on quotes from sources (adding my interpretation and analysis when ambiguous), to answer these two questions: When is GPT-5 coming and how good will it be?
-
Third, Iâve exploredâby following breadcrumbsâwhat we can expect from GPT-5 in the areas we still know nothing about officially (not even leaks): the scaling laws (data, compute, models size) and algorithmic breakthroughs (reasoning, agents, multimodality, etc.) This is all informed speculation, so the juiciest part.
Hereâs the exact outline in case you want to skim:
-
Part 1: Some meta about GPT-5
-
Part 2: Everything we know about GPT-5
-
Part 3: Everything we donât know about GPT-5
-
In closing
Part 1: Some meta about GPT-5
The GPT-5 class of models
Between March 2023 and January 2024, when you talked about state-of-the-art AI intelligence or ability across disciplines, you were talking about GPT-4. There was nothing else it could compare to. OpenAIâs model was in a league of its own.
Thatâs changed since February. Google Gemini (1.0 Ultra and 1.5 Pro) and Anthropic Claude 3 Opus are GPT-4-class models (Itâs also GPT-4-class the upcoming Meta Llama 3 405B, still training at the time of writing). Long overdue contenders for that sought-after title but here after all. Strengths and weaknesses vary depending on how you use them, but all three are in the same ballpark performance-wise.
This new realityâand the seemingly consensual opinion among early adopters that Claude 3 Opus, in particular, is better than GPT-4 (after the recent GPT-4 turbo upgrade, perhaps not anymore) or that Llama 3 405B evals are looking strong already for intermediate checkpointsâhas shed doubts over OpenAIâs leadership.
But we shouldnât forget thereâs a one-year gap between OpenAI and the rest; GPT-4 is an old model by AI-pace-of-progress standards. Admittedly, the newest GPT-4 turbo version isnât old at all (released on April 9th). Itâs hard to argue, however, that the modest iterative improvements that separate GPT-4 versions are comparable with an entirely new state-of-the-art model from Google, Anthropic, or Meta. GPT-4âs skeleton is 1.5 years old; thatâs what counts against Gemini, Claude, and Llama, which surely leverage the most recent research at deeper levels (e.g. architectural changes) than GPT-4 can possibly adopt just by updating the fine-tuning.
The interesting question is this: Has OpenAI maintained its edge from the shadows while building GPT-5? Or have its competitors finally closed the gap?
One possibility is that Google, Anthropic, and Meta have given us everything theyâve got: Gemini 1.0/1.5, Claude 3, and Llama 3 are the best they can do for now. I donât think this is the case for either (Iâll skip Metaâs case here because theyâre in a rather unique situation that should be analyzed separately).1 Letâs start with Google.
Google announced Gemini 1.5 a week after releasing Gemini Advanced (with the 1.0 Ultra backend). They have only given us a glimpse of what Gemini 1.5 is capable of; they announced the intermediate version, 1.5 Pro, which is already GPT-4-class, but I donât think thatâs the best they have. I believe Gemini 1.5 Ultra is ready. If they havenât launched it yet itâs because theyâve learned a lesson OpenAI has been exploiting since the early days: Timing the releases well is fundamental for success. The generative AI race is just too broadly broadcast to ignore that part.
Knowing thereâs a big gap between 1.0 Pro and 1.0 Ultra, itâs reasonable to assume Gemini 1.5 Ultra will be significantly better than 1.5 Pro (Google has yet to improve the naming part, though). But how good will Gemini 1.5 Ultra be? GPT-5-level much? We donât know but given 1.5 Pro eval scores, itâs possible.
The takeaway is that Gemini 1.0 being GPT-4-level isnât casualâthe consequence of having hit a wall or a sign of Googleâs limitationsâbut instead a predefined plan to tell the world they, too, can create that kind of AI (let me remind you that the team that builds the models is not the team in charge of doing the marketing part that Google so often fails at).
Anthropicâs case isnât so clear to me because theyâre more press-shy than Google and OpenAI but I have no reason to exclude them given that Claude 3âs performance is so subtly above GPT-4 that itâs hard to believe itâs a coincidence. Another key point in favor of Anthropic is that it was founded in 2021. How much time does a world-class AI startup need to start competing at the highest level? Partnerships, infrastructure, hardware, training times, etc. require time and Anthropic was just settling down when OpenAI began training GPT-4. Claude 3 is Anthropicâs first real effort so I wonât be surprised if Claude 4 comes sooner than expected and matches anything OpenAI may achieve with GPT-5.
The pattern I see is clear. For each new state-of-the-art generation of models (first GPT-3 level, then GPT-4 level, next GPT-5 level) the gap between the leader and the rest shrinks. The reason is evident: The top AI companies have learned how to build this technology reliably. Building best-in-class large language models (LLMs) is a solved problem. Itâs not OpenAIâs secret anymore. They had an edge at the start because they figured out stuff others hadnât yet, but those others have caught up.
Even if companies are good at keeping trade secrets from spies and leakers, tech and innovation eventually converge on whatâs possible and affordable to do. The GPT-5-class of models may have some degree of heterogeneity (just like it happens with the GPT-4 class) but the direction theyâre all going is the same.
If I am correct, this takes relevance away from GPT-5 itselfâwhich is why I think this 14,000-word analysis should be read more broadly than just a preview of GPT-5âand puts it into the whole class of models. Thatâs a good thing.
GPT-5 or GPT-4.5?
There were rumors in early March that GPT-4.5 had been leaked (the announcement, not the weights). Search engines caught the news before OpenAI removed it. The web page said the âknowledge cut-offâ (up to what point in time the model knows about the state of the world) was June 2024. This means the hypothetical GPT-4.5 would train until June and then go through the months-long process of safety testing, guardrailing, and red-teaming, delaying release until the end of the year.
If this were true, does this mean GPT-5 isnât coming this year? Possibly, but not necessarily. The thing we need to remember is that these namesâGPT-4, GPT-4.5, GPT-5 (or something else entirely)âare placeholders for some level of ability OpenAI considers sufficiently high to deserve a given release number. OpenAI is always improving its models, exploring new research venues, doing training runs with different levels of compute, and evaluating model checkpoints. Building a new model isnât a trivial, straightforward process but requires tons of trial and error, tweaking details, and âYOLO runsâ that may yield unexpectedly good results.
After all the experimenting, when they feel ready, they go and do the big training run. Once it reaches the âthatâs good enoughâ performance point, they release it under the most appropriate name. If they called GPT-4.5 GPT-5 or vice versa, we wouldnât notice. This step-by-step checkpointed process also explains why Gemini 1.0/1.5 and Claude 3 can be so slightly above GPT-4 without it meaning thereâs a wall for LLMs.
This implies that all the sources Iâll quote below talking about a âGPT-5 releaseâ may actually be talking, without realizing it, about GPT-4.5 or some novel kind of thing with a different name. Perhaps, the GPT-4.5 leak that puts the knowledge cut-off at June 2024 will be GPT-5 after a few more improvements (perhaps they tried to reach a GPT-4.5 level and couldnât quite get there and had to discard the release). These decisions change on the go depending on internal results and the moves from competitors (perhaps OpenAI didnât expect Claude 3 to be the publicâs preferred model in March and decided to discard the GPT-4.5 release for that reason).
Hereâs one strong reason to think there wonât be a GPT-4.5 release: It makes no sense to do .5 releases when the competition is so close and scrutiny so intense (even if Sam Altman says he wants to double down on iterative deployment to avoid shocking the world and give us time to adapt and so on).
People will unconsciously treat every new big release as being âthe next model,â whatever the number, and will test it against their expectations. If users feel itâs not good enough they will question why OpenAI didnât wait for the .0 release. If they feel itâs very good then OpenAI will wonder if they shouldâve named it .0 instead because now theyâll have to make an even bigger jump to get an acceptable .0 model. Not everything is what customers want but generative AI is now more an industry than a scientific field. OpenAI should go for the GPT-5 model and make it good.
There are exceptions, though. OpenAI released a GPT-3.5 model, but if you think about it, it was a low-key change (later overshadowed by ChatGPT). They didnât make a fuss out of that one as they did for GPT-3 and GPT-4 or even DALL-E and Sora. Another example is Googleâs Gemini 1.5 Ultra a week after Gemini 1 Ultra. Google wanted to double down on its victory against GPT-4 by doing two consecutive releases above OpenAIâs best model. It failedâGemini 1 Ultra wasnât better than GPT-4 (people expected more, not a tricky demo) and Gemini 1.5 was pushed to the side by Sora, which OpenAI released a few hours later (Google has still a lot to learn from OpenAIâs marketing tactics).2 Anyway, OpenAI needs a good reason to do a GPT-4.5 release.
The GPT brand trap
The last thing I want to mention in this section is the GPT trap: Contrary to the other companies, OpenAI has associated its products heavily with the GPT acronym, which is now both a technical term (as it was originally) but also a brand with a kind of prestige and power thatâs hard to give up. A GPT, Generative Pre-trained Transformer, is a very specific type of neural network architecture that may or may not survive new research breakthroughs. Can a GPT escape the âautoregressive trapâ? Can you imbue reasoning into a GPT or upgrade it into an agent? Itâs unclear.
My question is: Will OpenAI still call its models GPTs to maintain the powerful brand with which most people associate AI or will they stay rigorous and switch to something else (Q* or whatever) once the technical meaning is exhausted by better things? If OpenAI sticks to the invaluable acronym (as the trademark registrations suggest) wouldnât they be self-sabotaging their future by anchoring it in the past? OpenAI risks letting people falsely believe theyâre interacting with another chatbot when they may have in their hands a powerful agent instead. Just a thought.
Part 2: Everything we know about GPT-5
When will OpenAI release GPT-5?
On March 18th, Lex Fridman interviewed Sam Altman. One of the details he revealed was about GPT-5âs date release. Fridman asked âSo, when is GPT-5 coming out, again?â to which Altman responded, âI donât know; thatâs the honest answer.â
I believe in his honesty to the degree that there are different possible interpretations for his ambiguous âI donât know.â I think he knows exactly what he wants OpenAI to do but the inherent uncertainty of life allows him the semantic space to say that, honestly, he doesnât know. To the extent that Altman knows what thereâs to know, he may not be saying more because first, theyâre still deciding whether to release an intermediate GPT-4.5, second, theyâre measuring distance with competitors, and third, he doesnât want to reveal the exact date to not give competitors the option to overshadow the release somehow, as they do all the time to Google.
He then hesitated to answer whether GPT-5 is coming out this year at all, but added: âWe will release an amazing new model this year; I donât know what weâll call it.â I think this vagueness is solved with my arguments above in the âThe name GPT-5 is arbitraryâ section. Altman also said they have âa lot of other important things to release firstâ (some things he could be referring to: Public Sora and Voice engine, a standalone web/work AI agent, a better ChatGPT UI/UX, a search engine, a Q* reasoning/math model). So building GPT-5 is a priority but not releasing it.
Altman also said OpenAI has before missed the mark on ânot to have shock updates to the worldâ (e.g. the first GPT-4 version). This can shed light on the reasons for his ambiguity on GPT-5âs release date. He added: âMaybe we should think about releasing GPT-5 in a different way.â We could interpret this as a hand-waving comment but I think it helps explain Altmanâs hesitancy to say something like âI know when weâll release GPT-5 but I wonât tell you,â which would be fair and understandable.
It may even explain the notable improvement in math reasoning of the latest GPT-4 turbo release (April 9th): Perhaps the way theyâre releasing GPT-5 differently to not shock the world is by testing its parts (e.g. new math/reasoning fine-tuning for GPT-4) in the wild before bringing them together into a cohesive whole for a much more powerful base model. That would be equal parts irresponsible and inconsistent with Altmanâs words.
Letâs hear other sources. On March 19th, the next day of the Fridman-Altman interview, Business Insider published a news article entitled âOpenAI is expected to release a âmaterially betterâ GPT-5 for its chatbot mid-year, sources say,â which squarely contradicts what Altman said the day before. How can a non-OpenAI source know the date if Altman doesnât? How can GPT-5 be coming out mid-year if OpenAI has still so many things to release first? The info is incoherent. Hereâs what Business Insider wrote:
The generative AI company helmed by Sam Altman is on track to put out GPT-5 sometime mid-year, likely during summer, according to two people familiar with the company [identities confirmed by Business Insider]. ⊠OpenAI is still training GPT-5, one of the people familiar said. After training is complete, it will be safety tested internally and further âred teamedââŠ
So GPT-5 was still training on March 19th (the only data point from the article thatâs not a prediction but a fact). Letâs take the generous estimate and say itâs finished training already (April 2024) and OpenAI is already doing satefy tests and red-teaming. How much will that last before theyâre ready to deploy? Letâs take the generous estimate again and say âthe same as GPT-4â (GPT-5 being presumably more complex, as weâll see in the next sections, makes this a safe lower bound). GPT-4 finished training in August 2022 and OpenAI announced it in March 2023. Thatâs seven months of safety layering. But remember that Microsoftâs Bing Chat already had GPT-4 under the hood. Bing Chat was announced in early February 2023. So half a year it is.
All in all, the most generous estimates put GPT-5âs release half a year away from now, pushing the date not to Summer 2024 (June seems to be a hot date for AI releases) but to October 2024âin the best case! Thatâs one month before the elections. Surely OpenAI isnât that reckless given the antecedents for AI-powered political propaganda.
Could the âGPT-5 going out sometime mid-yearâ be a mistake by Business Insider and refer to GPT-4.5 instead (or refer to nothing)? I already said I donât think OpenAI will replace the GPT-5 announcement with 4.5 but they may add this release as an intermediate low-key milestone while making it clear GPT-5 is coming soon (fighting Google and Anthropic before they release something else is a good reason to release a 4.5 versionâas long as the GPT-5 model is on the way a few months later).
This view reconciles all the info weâve analyzed so far: it reconciles Altmanâs âI donât know when GPT-5 is coming outâ and the âwe have a lot of other important things to release first.â Itâs also in line with the doubling down on iterative deployment and the threat that a âshockingâ new model would pose to the elections. Talking about the elections, the other candidate for the GPT-5 release date is around DevDay in November (my favored prediction). Last year, OpenAI did its first developer conference on November 6th, which this year is the day after the elections.
Given all this info (including the incoherent parts that make sense once we understand that âGPT-5â is an arbitrary name and that non-OpenAI sources may confuse the names of coming releases) my bet is this: GPT-4.5 (possibly something else thatâs also a sneak advance to GPT-5) is coming in Summer and GPT-5 after the elections. OpenAI will release something new in the coming months but it wonât be the biggest release Altman says is coming this year. (Recent events suggest an even earlier surprise is still possible.)3
How good will GPT-5 be?
This is the question everyoneâs waiting for. Let me advance that I donât have privileged information. That doesnât mean you wonât get anything from this section. Its value is twofold: first, itâs a compilation of sources you may have missed, and second, itâs an analysis and interpretation of the info, which can shed some further light on what we can expect. (In the âalgorithmic breakthroughsâ section Iâve gone much more in-depth on what GPT-5 may integrate from cutting-edge research. Thereâs no official info yet on that, just clues and breadcrumbs and my self-confidence that I can follow them reasonably well.)
Over the months, Altman has given hints of his confidence in GPT-5âs improvement over existing AIs. In January, in a private conversation held during the World Economic Forum at Davos, Altman spoke in private to Korean media Maeil Business Newspaper, among other news outlets, and said this (translated with Google): âGPT2 was very bad. GPT3 was pretty bad. GPT4 was pretty bad. But GPT5 will be good.â A month ago he told Fridman that GPT-4 âkinda sucksâ and that GPT-5 will be âsmarterâ, not just in one category but across the board.
People close to OpenAI have also spoken in vague terms. Richard He, via Howie Xu, said: âMost GPT-4 limitations will get fixed in GPT-5,â and a non-disclosed source told Business Insider that â[GPT-5] is really good, like materially better.â All this information is fine, but also trivial, vague, or even unreliable (can we trust Business Insiderâs sources at this point?).
However, thereâs one thing Altman told Fridman that I believe is the most important data point we have about GPT-5âs intelligence. Hereâs what he said: âI expect that the delta between 5 and 4 will be the same as between 4 and 3.â This claim is substantially more SNR-rich than the others. If it sounds similarly cryptic itâs because what it says isnât about GPT-5âs absolute intelligence level, but about its relative intelligence level, which may be trickier to analyze. In particular: GPT-3 â GPT-4 = GPT-4 â GPT-5.
To interpret this âequationâ (admittedly still ambiguous) we need the technical means to unpack it as well as know a lot about GPT-3 and GPT-4. Thatâs what Iâve done for this section (also, unless some big leak happens, this is the best weâll get from Altman). The only assumption I need to make is that Altman knows what heâs talking aboutâhe understands what those deltas implyâand that he already knows the ballpark of GPT-5âs intelligence, even if itâs not finished yet (just like Zuck knows Llama 3 405B checkpoint performance). From that, Iâve come up with three interpretations (for the sake of clarity, Iâve used only the model numbers, without the âGPTâ):
The first reading is that the 4-5 and 3-4 deltas refer to comparable jumps across benchmark evaluations, which means that 5 will be broadly smarter than 4 as 4 was broadly smarter than 3 (this one starts tricky because itâs common knowledge that evals are broken, but letâs set this aside). Thatâs surely an outcome people would be happy with knowing that as models get better, climbing the benchmarks becomes much harder. So hard, actually, that I wonder if itâs even possible. Not because AI canât become that intelligent but because such intelligence would make our human measurement sticks too short, i.e. benchmarks would be too easy for GPT-5.
This graph above is a 4 vs 3.5 comparison (3 would be lower). In some areas, 4 doesnât improve much but in others, itâs so much better than it already risks making the scores meaningless for being too high. Even if we accepted that 5 wouldnât get better at literally everything, in those areas it did, itâd surpass the limits of what the benchmarks can offer. This makes it impossible for 5 to achieve a delta from 4 the size of 3-4. At least if we use these benchmarks.
If we assume Altman is considering harder benchmarks (e.g. SWE-bench or ARC) where both GPT-3 and GPT-4âs performances are so poor (GPT-4 on SWE-bench, GPT-3 on ARC, GPT-4 on ARC), then having GPT-5 show a similar delta would be underwhelming. If you take exams made for humans instead (e.g. SAT, Bar, APs), you canât trust GPT-5âs training data hasnât been contaminated.
The second interpretation suggests the delta refers to the non-linear âexponentialâ scaling laws (increases in size, data, compute) instead of linear increases in performance. This implies that 5 continues the curves delineated before by 2, 3, and 4, whatever that yields performance-wise. For instance, if 3 has 175B parameters and 4 has 1.8T, then 5 will have around 18 trillion. But parameter count is just one factor in the scaling approach, so the delta may include everything else: how much computing power they use, how much training data they feed the model, etc. (Iâve explored more in-depth GPT-5âs relationship with the scaling laws in the next section.)
This is a safer claim from Altman (OpenAI controls these variables) and a more sensible one (emergent capabilities require new benchmarks for which previous data is non-existent, making the 3â4 vs 4â5 comparison impossible). However, Altman says he expects that delta, which suggests he doesnât know for sure and this (e.g. how many FLOPs did it take to train GPT-5) he would know.
The third possibility is that Altmanâs delta refers to user perception, i.e. users will perceive 5 to be better than 4 to the same degree that they perceived 4 to be better than 3 (ask heavy users and you will know the answer is âa damn lotâ). This is a bold claim because Altman canât possibly know what weâll think, but he may be talking from experience; thatâs what he felt from initial evaluations and heâs simply sharing his anecdotal evaluation.
If this interpretation is correct then we can conclude GPT-5 will be impressive. If it truly feels that way for the people most used to play with its previous versionsâwho are also the people with the highest expectations and for whom the novelty of the tech has faded away the most. If Iâm feeling generous and had to bet which interpretation is most correct, Iâd go for this one.
If Iâm not feeling generous, thereâs a fourth interpretation: Altman is just hyping his companyâs next product. OpenAI has delivered in the past but the aggressive marketing tactics have always been there (e.g. releasing Sora hours after Google released Gemini 1.5). We can default to this one to be safe but I believe thereâs some truth to the above three, especially the third one.
How OpenAIâs goals shape GPT-5
Before we go further into speculation territory, let me share what I believe to be the right framing to understand what GPT-5 can and canât be, i.e. how to tell informed speculation from delusion. This serves as a general perspective to understand the entirety of OpenAIâs approach to AI. Iâll concretize it on GPT-5 because thatâs our topic today.
OpenAIâs stated goal is AGI, which is so vague as to be irrelevant to serious analysis. Besides AGI, OpenAI has two âunofficial goalsâ (instrumental goals, if you will), more concrete and immediate that are the true bottlenecks moving forward (in a technical sense; product-wise there are other considerations, like âMake something people wantâ). These two are augmenting capabilities and reducing costs. Whatever we may hypothesize about GPT-5 must obey the need to balance the two.
OpenAI can always augment capabilities mindlessly (as long as its researchers and engineers know how) but that could yield unacceptable costs on the Azure Cloud, which would resent Microsoftâs partnership (which is already not as exclusive as it used to be). OpenAI canât afford to become a cash drain. DeepMind was Googleâs money pit early on but the excuse was âin the name of science.â OpenAI is focused on business and products so they have to bring in some juicy profits.
They can always decrease costs (in different ways e.g. custom hardware, squeezing inference times, sparsity, optimizing infra, and applying training techniques like quantization) but doing it blindly would hinder capabilities (in spring 2023 they had to drop a project codenamed âArrakisâ to make ChatGPT more efficient through sparsity because it wasnât performing well). Itâs better to spend more money than lose the trust of customersâor worse, investors.
So anyway, with these two opposing requirementsâcapabilities and costsâat the top of OpenAIâs hierarchy of priorities (just below the always-nebulous AGI), we can narrow down what to expect from GPT-5 even if we lack official informationâwe know they care about both factors. The balance further tilts against OpenAI if we add the external circumstances limiting their options: a GPU shortage (not as extreme as it was in mid-2023 but still present), an internet data shortage, a data center shortage, and a desperate search for new algorithms.
Thereâs a final factor that directly influences GPT-5 and somehow pushes OpenAI to make the most capable model they can: Their special spot in the industry. OpenAI is the highest-profile AI startup, at the lead economically and technically, and we hold our breaths every time they release something. All eyes are on themâcompetitors, users, investors, analysts, journalists, even governmentsâso they have to go big. GPT-5 has to kill expectations and shift the paradigm. Despite what Altman said about iterative deployment and not shocking the world, in a way they have to shock the world. Even if just a little.
So despite costs and some external constraintsâcompute, data, algorithms, elections, social repercussionsâlimiting how far they can go, the insatiable hunger for augmented capabilities and the need to shock the world just a little will push them to go as far as they can. Letâs see how far that might be.
Part 3: Everything we donât know about GPT-5
GPT-5 and the ruling of the scaling laws
In 2020, OpenAI devised an empirical form of the scaling laws that have defined AI companiesâ roadmap since. The main idea is that three factors are enough to define and even predict model performance: model size, number of training tokens, and compute/training FLOPs (in 2022, DeepMind refined the laws and our understanding of how to train compute-efficient models into whatâs known as âChinchilla scaling lawsâ, i.e. the largest models are heavily undertrained; you need to scale dataset size in the same proportion you scale model size, to make the most of the available compute and achieve the most performant AI).
The bottom line of the scaling laws (either OpenAIâs original form or DeepMindâs revised version) implies that as your budget grows, most of it should be allocated to scale the models (size, data, compute). (Even if the specifics of the laws are disputed, their existence, whatever the constants happen to be, is beyond doubt at this point.)
Altman claimed in 2023 that âweâre at the end of the era where itâs gonna be these giant models, and weâll make them better in other ways.â One of the many ways this approach shaped GPT-4âand will surely shape GPT-5âwithout giving up on scale was by making it a Mixture of Experts (MoE) instead of a large dense model, like GPT-3 and GPT-2 had been.
An MoE is a clever mix of smaller specialized models (experts) that are activated depending on the nature of the input (you can imagine it as a math expert for math questions, a creative expert for writing fiction, and so on), through a gated mechanism thatâs also a neural network that learns to allocate inputs to experts. At a fixed budget, an MoE architecture improves performance and inference times compared to its smaller dense counterpart because only a tiny subset of specialized parameters is active for any given query.
Does Altmanâs claim about âthe end of the era of giant modelsâ or the shift from dense to MoE contradict the scaling laws? Not at all. It is, if anything, a smarter application of the lessons of scale by leveraging other tricks like architecture optimization (I was mistaken to criticize OpenAI for making GPT-4 an MoE). Scale is still king in generative AI (especially in language and multimodal models) simply because it works. Can you make it work even more by improving the models in other aspects? Thatâs great!
The only way to compete at the highest level is to approach AI innovation with a holistic view: It makes no sense to heavily research a better algorithm if more compute and data can close the performance gap for you. Neither does it make sense to waste millions on H100s when a simpler architecture or an optimization technique can save you half that money. If making GPT-5 10x larger works, fine. If making it a super-MoE works, fine.
Friedman asked Altman what the main challenges to creating GPT-5 are (compute or technical/algorithmic), and Altman said: âItâs always all of these.â He added: The thing that OpenAI does really well is that âwe multiply 200 medium-sized things together into one giant thing.â4
Artificial intelligence has always been a field of trade-offs but once generative AI jumped to the market and became an industry to return a profit, more trade-offs were added. OpenAI is juggling with all of this. Right now, the preferred heuristic to find the better route is following Richard Suttonâs advice from the Bitter Lesson, which is an informal formulation of the scaling laws. Hereâs how Iâd summarize OpenAIâs holistic approach to dealing with these trade-offs in one sentence: Believe strongly in the scaling laws but hold that opinion loosely in the face of promising research.
GPT-5 is a product of this holistic view, so itâll take the most out of the scaling lawsâand anything else as long as it brings OpenAI closer to its goals. In which way does scale define GPT-5? My bet is simple: In all of them. Increase model size, increase training dataset, and increase compute/FLOPs. Letâs do some rough numbers.
Model size
GPT-5 will also be an MoE (AI companies are mostly making MoEs now for good reason; high performance with efficient inference. Llama 3 is an interesting exception, probably because itâs intendedâespecially the smaller versionsâto be run locally so GPU-poors can fit it in their limited memory). GPT-5 will be larger than GPT-4 (in total parameter count which means, in case OpenAI hasnât found a better architectural design than an MoE, that GPT-5 will have either more experts or larger ones than GPT-4, whatever yields the best mix of performance and efficiency; there are other ways to add parameters but this makes the most sense to me).
How much larger will GPT-5 be is unknown. We could naively extrapolate the parameter count growth trend: GPT, 2018 (117M), GPT-2, 2019 (1.5B), GPT-3, 2020 (175B), GPT-4, 2023 (1.8T, estimated), but the jumps donât correspond to any well-defined curve (especially because GPT-4 is an MoE so itâs not an apples-to-apples comparison with the others). Another reason this naive extrapolation doesnât work is that how big it makes sense to go on a new model is contingent on the size of the training dataset and the number of GPUs you can train it on (remember the external constraints I mentioned earlier; data and hardware shortages).
Iâve found size estimates published elsewhere (e.g. 2-5T parameters) but I believe thereâs not enough info to make an accurate prediction (Iâve calculated mine anyway to give you something juicy even if it ends up not being super accurate).
Letâs see why making informed size estimates is harder than it sounds. For instance, the above 2-5T number by Alan Thompson is based on the assumption that OpenAI is using twice the compute (â10,000 â 25,000 NVIDIA A100 GPUs with some H100sâ) and twice the training time (â~3 months â ~ 4-6 monthsâ) for GPT-5 compared to GPT-4.
GPT-5 was already training in November and the final training run was still ongoing a month ago so double the training time makes sense but the GPU count is off. By the time they started raining GPT-5, and despite the H100 GPU shortage, OpenAI had access to the majority of Microsoft Azure Cloudâs compute, i.e. â10k-40k H100s.â So GPT-5 could be bigger than 2-5T by a factor of up to 3x (Iâve written down the details of my calculations below).
Dataset size
The Chinchilla scaling laws reveal that the largest models are severely undertrained, so it makes little sense to make GPT-5 larger than GPT-4 without more data to feed the additional parameters.
Even if GPT-5 was similar in size (which Iâm not betting on but wouldnât violate the scaling laws and could be sensible under a new algorithmic paradigm), the Chinchilla laws suggest more data alone would also yield better performance (e.g. Llama 3 8B-parameter model was trained on 15T tokens, with is heavily âovertrainedâ, yet it was still learning when they stopped the training run).
GPT-4 (1.8T parameters) is estimated to have been trained for around 12-13 trillion tokens. If we conservatively assume GPT-5 is the same size as GPT-4, then OpenAI could still improve it by feeding it with up to 100 trillion tokensâif they find a way to collect that many! If itâs larger, well, then they need those succulent tokens.
One option for OpenAI was to use Whisper to transcribe YouTube videos (which theyâve been doing against YouTubeâs TOS). Another option was synthetic data, which is already a commonplace practice among AI companies and will be the norm once human-made internet data âruns out.â I believe OpenAI is still squeezing the last remnants of accessible data and searching for new ways to ensure the high quality of synthetic data.
(They may have found an intriguing way to do the latter to improve performance without increasing the number of pre-training tokens. Iâve explored that part in the âreasoningâ subsection of the âalgorithmic breakthroughsâ section.)
Compute
More GPUs allow for bigger models and more epochs on the same dataset, which yields better performance in both cases (up to some point they havenât found yet). To draw a rough conclusion from this entire superficial analysis we should focus on the one thing we know for sure changed between the August 2022-March 2023 period (span of GPT-4âs training run) and now: OpenAIâs access to Azureâs thousands of H100s and the subsequent augment in available FLOPs to train the next models.
Perhaps OpenAI also found a way to optimize the MoE architecture further and fit more parameters at the same training/inference cost, perhaps they found a way to make synthetic AI-generated data into high-quality GPT-5-worthy tokens, but we canât be sure of either. Azureâs H100s, however, entail a certain edge we shouldnât ignore. If thereâs an AI startup getting out of the GPU shortage, thatâs OpenAI. Compute is where costs play a role but Microsoft is, for now, taking care of that part as long as GPT-5 yields great results (and isnât AGI yet).
My estimate for GPT-5âs size
Letâs say OpenAI has used not 25k A100s, as Thompson suggests, but 25k H100s to train GPT-5 (the average of Microsoft Cloudâs â10k-40k H100sâ reserved for OpenAI). Rounding the numbers, H100s are 2x-4x faster than A100s for training LLMs (at a similar cost). OpenAI could train a GPT-4-sized model in one month with this amount of compute. If GPT-5 is taking them 4-6 months, then the resulting estimate for its size is 7-11T parameters (assuming the same architecture and training data). Thatâs more than twice Thompsonâs estimate. But, does it even make sense to make it that large or is it better to train a smaller model on more FLOPs? We donât know; OpenAI may have made another architectural or algorithmic breakthrough this year to improve performance without increasing size.
Letâs now do the analysis assuming inference is the limiting factor (Altman said in 2023 that OpenAI is constrained GPU-wise in both training and inference but heâd prefer to 10x efficiency on the latter, which is a sign that inference costs will eventually surpass training costs). With 25k H100s, OpenAI has for GPT-5 vs GPT-4 twice as many max flops, larger inference batch sizes, and the ability to do inference at FP8 instead of FP16 (half precision). This entails a 2x-8x increase in performance at inference. GPT-5 could be as big as 10-15T parameters, an order of magnitude larger than GPT-4 (if the existing parallelism configurations that distribute the model weights across GPUs at inference time donât break at that size, which I donât know). OpenAI could also choose to make it one order of magnitude more efficient, which is synonymous with cheaper (or some weighed mix of the two).
Another possibility, one I think deserves consideration given that OpenAI keeps improving GPT-4, is that part of the newly available compute will be redirected to make GPT-4 more efficient/cheaper (or even free, replacing GPT-3.5 altogether; one can dream, right?). That way, OpenAI can capture revenue from dubious users who know ChatGPT exists but are unwilling to go paid or unaware that the jump between the 3.5 free version and the 4 paid version is huge. I wonât comment more on the price of the service (not sure whether GPT-5 will go on ChatGPT at all) because without the exact specs, itâs impossible to tell (size/data/compute is first-order uncertainty but price is second-order uncertainty). Itâs just business-lens speculation: ChatGPT usage isnât growing and OpenAI should do something about that.5
Algorithmic breakthroughs in GPT-5
This is the juiciest section of all (yes, even more than the last one) and, as the laws of juiciness dictate, also the most speculative. Extrapolating the scaling laws from GPT-4 to GPT-5 is doable, if tricky. Trying to predict algorithmic advances given how much opacity thereâs in the field at the moment is the greater challenge.
The best heuristics are following OpenAI-adjacent people, lurking on alpha places with high SNR, and reading papers coming out of top labs. I only do these partially, so excuse any outlandish claims. If youâve made it this far, youâre too deep into my delirium anyway. So thank you for that. Hereâs a hint of what we can expect (i.e. what OpenAI has been working on since GPT-4):
[
This is, of course, Altmanâs marketing, but we can use this structured vision to take away valuable insights.6 Some of these capabilities are more heavy on the behavioral side (e.g. reasoning, agents) while others are more on the consumer side (e.g. personalization). All of them require algorithmic breakthroughs.7 The question is, will GPT-5 be the materialization of this vision? Letâs break it down and make an informed guess.
Multimodality
A couple of years ago multimodality was a dream. Today, itâs a must. All the top AI companies (interested in AGI or not) are working hard on giving their models the ability to capture and generate various sensory modalities. AI people like to think thereâs no need to replicate all of the evolutionary traits that make us intelligent, but the multimodality of the brain isnât one they can afford to exclude. Two examples of these efforts: GPT-4 can take text and images and generate text, images, and audio. Gemini 1.5 can take text, images, audio, and video and generate text and images.
The obvious question is this: Whereâs multimodality going? What additional sensory skills will GPT-5 (and next-gen AI models in general) have? Naively, we may think humans have five and once those are integrated, weâre done. Thatâs not true, humans have a few more actually. Are all of those necessary for AI to be intelligent? Should we implement those modes animals have that we donât? These are interesting questions but weâre talking about GPT-5, so Iâve stuck to the immediate possibilities; those OpenAI has given hints at having solved.
Voice Engine suggests emotional/human synthetic audio is fairly achieved. Itâs already implemented into ChatGPT so itâll be in GPT-5 (perhaps not from the onset). The not-solved-but-almost hottest area is video generation. OpenAI announced Sora in February but didnât release it. The Information reported that Google DeepMindâs CEO, Demis Hassabis, said âIt may be tough for Google to catch up to OpenAIâs Sora.â Given Gemini 1.5âs capabilities, this isnât a confirmation of Googleâs limitation to ship AI stuff but an acknowledgment of how impressive a feat Sora is. Will OpenAI put it in GPT-5? Theyâre testing first impressions among artists and TED; itâs anyoneâs guess what would happen once anyone can create videos of anything.
The Verge reported that Adobe Premiere Pro will integrate AI video tools and possibly OpenAI Sora among them. I bet OpenAI will first release Sora as a standalone model but will eventually merge it with GPT-5. Itâd be a nod to the ânot shock the worldâ promise given how much weâre accustomed to text models vs video models. They will roll out access to Sora gradually, as theyâve done before with GPT-4 Vision, and then will give GPT-5 the ability to generate (and understand) video.
Robotics
Altman doesnât mention humanoid robots or embodiment in his âAI capabilitiesâ slide but the partnership with Figure (and the slick demo you shouldnât believe at all even if itâs real) says it all about OpenAIâs future bets in the area (note that multimodality isnât just about eyes and ears but also haptics and proprioception as well as motor systems, i.e. walking and dexterity. In a way, robotics is the common factor between multimodality and agents.
One of my most confident takes thatâs less accepted in AI circles is that a body is a requisite to reach the intelligence level of a human, whether itâs silicon-based or carbon-based. We tend to think that intelligence lies in our brains but thatâs an intellectual disservice to the critical role our bodies (and the bodies of others) play in perception and cognition. Melanie Michell wrote a Science review on the topic of general intelligence and said this about embodiment and socialization:
Many who study biological intelligence are also skeptical that so-called âcognitiveâ aspects of intelligence can be separated from its other modes and captured in a disembodied machine. Psychologists have shown that important aspects of human intelligence are grounded in oneâs embodied physical and emotional experiences. Evidence also shows that individual intelligence is deeply reliant on oneâs participation in social and cultural environments. The abilities to understand, coordinate with, and learn from other people are likely much more important to a personâs success in accomplishing goals than is an individualâs âoptimization power.â
I bet that OpenAI is coming back to robotics (weâll see to what degree GPT-5 signals this shift). They gave up on it not out of philosophical conviction (even if some members of the company still say things like âvideo generation will lead to AGI by simulating everything,â which suggests a body is unnecessary) but out of pragmatic considerations: Not enough readily available data, simulations not rich enough to extrapolate results to the real world, real-world experiments too expensive and slow, Moravecâs Paradox, etc.
Perhaps theyâre coming back to robotics by outsourcing the work to partners focused exclusively on that. A Figure 02 robot with GPT-5 inside, capable of agentic behavior and reasoningâand walking straightâwould be a tremendous engineering feat and a wonder to witness.
Reasoning
This is a big one possibly coming with GPT-5 in an unprecedented way. Altman told Fridman GPT-5 will be broadly smarter than previous models, which is a shorter way to say itâll be much more capable of reasoning. If human intelligence stands out from animal intelligence in one thing it is that we can reason about stuff. Reasoning, to give you a definition, is the ability to derive knowledge from existing knowledge by combining it with new information following logical rules, like deduction or induction so that we get closer to the truth. Itâs how we build mental models of the world (a hot concept in AI right now), and how we develop plans to reach goals. In short, itâs how weâve built the wonders around us we call civilization.
Conscious reasoning is hard. To be precise, it feels hard to us. Rightfully so because itâs cognitively harder than most other things we do; multiplying 4-digit numbers in the head is an ability reserved for the most capable minds. If itâs so hard, how can naive calculators do it instantly with larger numbers than we know how to name? This goes back to Moravecâs Paradox (which I just mentioned in passing). Hans Moravec observed that AI can do stuff that seems hard to us, like high number arithmetic, very easily yet it struggles to do the tasks that seem most mundane, like walking straight.
But then, if dumb devices can do god-level arithmetic instantly, why does AI struggle to reason to solve novel tasks or problems much more than humans do? Why is AIâs ability to generalize so poor? Why does it show superb crystallized intelligence but terrible fluid intelligence? Thereâs an ongoing debate on whether current state-of-the-art LLMs like GPT-4 or Claude 3 can reason at all. I believe the interesting data point is that they canât reason like we do, with the same depth, reliability, robustness, or generalizability but only âin extremely limited ways,â in Altmanâs words. (Scoring rather high in âreasoningâ benchmarks like MMLU or BIG-bench isnât the same as being capable of human-like reason; it can be shortcutted with memorization and pattern matching not to mention tainted by data contamination.)
We could argue itâs a âskill issueâ or that âSampling can prove the presence of knowledge, but not its absence,â which are both fair and valid reasons but canât quite explain GPT-4âs absolute failure with e.g. the ARC challenge that humans can solve. Evolution may have provided us with unnecessary hurdles to reason because itâs an ineffective optimization process, but thereâs plenty of empirical evidence that suggests AI is still behind us in ways Moravec didnât predict.8
All this is to introduce you to what I believe are deep technical issues underpinning AIâs reasoning flaws. The biggest factor I see is that AI companies have focused too heavily on imitation learning, i.e. taking vast amounts of human-made data on the internet and feeding huge models with it so they can learn by writing like we write and solving problems like we solve problems (thatâs what pure LLMs do). The rationale was that by feeding AI with human data created throughout centuries, itâd learn to reason like we do, but itâs not working.
There are two important limitations to the imitation learning approach: First, the knowledge on the internet is mostly explicit knowledge (know-what) but tacit knowledge (know-how) canât be accurately transmitted with words so we donât even tryâwhat you find online is mostly the finished product of a complex iterative process (e.g. you read my articles but youâre blissfully unaware of the dozens of drafts I had to go through). (I get back to the explicit-tacit distinction in the agentsâ section.)
Second, imitation is only one of the many tools in the human kidâs learning toolkit. Kids also experiment, do trial and error, and self-playâwe enjoy several means to learn beyond imitation by interacting with the world through feedback loops that update knowledge and integration mechanisms that stack it on top of existing knowledge. LLMs lack these critical reasoning tools. However, theyâre not unheard of in AI: Itâs what DeepMindâs AlphaGo Zero did to destroy AlphaGo 100-0âwithout any human data, just playing games against itself leveraging a combination of deep reinforcement learning (RL) and search.
Besides this powerful loop mechanism of trials and errors, both AlphaGo and AlphaGo Zero have an additional feature that, once again, not even the best LLMs (GPT-4, Claude 3, etc.) have today: the ability to ponder about what to do next (which is a mundane way to say they use a search algorithm to discern between bad, good, and better options against a goal by contrasting and integrating new information with prior knowledge). The ability to distribute computing power according to the complexity of the problem at hand is something humans do all the time (DeepMind has already tested this approach with interesting results). Itâs what Daniel Kahneman called system 2 thinking in his popular book Thinking, Fast and Slow. Yoshua Bengio and Yann LeCun have tried to give AI âsystem 2 thinkingâ abilities.
I believe these two featuresâself-play/loops/trial and error and system 2 thinkingâto be promising research venues to start closing the reasoning gap between AIs and humans. Interestingly, the very existence of AIs that have these abilities, like DeepMindâs AlphaGo Zeroâalso AlphaZero and MuZero (which wasnât even given the rules of the games)âcontrasts with the fact that the most recent AI systems today, like GPT-4, lack them. The reason is that the real world (even just the linguistic world) is much harder to âsolveâ than a chessboard: a game of imperfect information, ill-defined rules and rewards, and an unconstrained action space with quasi-infinite degrees of freedom are the closest to an impossible challenge you will find in science.
I believe bridging this gap between reasoning game-player AIs and reasoning real-world AIs is what all the current reasoning projects are about (I believe Gemini has something of this already but I donât think itâs shown satisfactory results yet). Evidence leads me to think OpenAI has been focused particularly on leaving behind pure imitation learning by integrating the power of search and RL with LLMs. Thatâs what the speculation about Q* suggests and what public clues from leading researchers quietly scream. Perhaps the key person to look for at OpenAI for hints on this is Noam Brown, an expert in AI reasoning who joined the company from Meta in June 2023. In his announcement tweet he said this:
For years Iâve researched AI self-play and reasoning in games like Poker and Diplomacy. Iâll now investigate how to make these methods truly general. If successful, we may one day see LLMs that are 1,000x better than GPT-4. In 2016, AlphaGo beat Lee Sedol in a milestone for AI. But key to that was the AIâs ability to âponderâ for ~1 minute before each move ⊠if we can discover a general version, the benefits could be huge. Yes, inference may be 1,000x slower and more costly, but what inference cost would we pay for a new cancer drug? Or for a proof of the Riemann Hypothesis?
I guess he just lays it all out once you have the background I provided above. More recently, in a tweet that has been since deleted, he said, âYou donât get superhuman performance by doing better imitation learning on human data.â
In a recent talk at Sequoia, Andrej Karpathy, who left OpenAI recently, said something along the same lines:
I think people still havenât really seen whatâs possible in the space ⊠I think weâve done step one of AlphaGo. Weâve done the imitation learning part. Thereâs step two of AlphaGo which is the RL and people havenât done that yet ⊠this is the part that actually made it work and made something superhuman. ⊠The model needs to practice itself ⊠it needs to figure out what works for it and what does not work for it [he suggests that our teaching ways arenât adapted to the psychology of AIs].
Brown and Karpathyâs remarks on the limits of imitation learning echo something DeepMindâs cofounder Shane Legg said on Dwarkesh Patelâs podcast, again referencing AlphaGo:
To get real creativity you need to search through spaces of possibilities and find these sorts of hidden gems [heâs talking about the famous move 37 on AlphaGoâs second match against Lee Sedol] ⊠I think current language models ⊠donât really do that kind of thing. They really are mimicking the data ⊠the human ingenuity ⊠thatâs coming from the internet.
So to go beyond imitation learning you have to integrate it with search, self-play, reinforcement learning, etc. Thatâs what people believe Q* is. Thatâs what I believe Q* is. There are a few papers on how to introduce search abilities into LLMs or how to generalize self-play across games but I havenât found conclusive evidence of what exactly OpenAI is using to add reasoning skills to GPT-5.
Will Q*/GPT-5 with reasoning be as impressive as the above suggests? Yann LeCun said we should âignore the deluge of complete nonsense about Q*,â claiming that all top AI labs are working on similar things (technology converges on whatâs possible so that makes sense). He accused Altman of having âa long track record of self-delusion,â as a criticism of Altmanâs words presumably on Q* one day before he was fired in the boardroom drama: â[for the fourth time] Iâve gotten to be in the room when we pushed the veil of ignorance back and the frontier of discovery forward.â
But LeCun may also be trying to defend Metaâs work or perhaps heâs just bitter that OpenAI got Brown, who created Libratus (Poker) and CICERO (Diplomacy) at LeCunâs FAIR lab. (In favor of LeCunâs warning, we should also note that Karpathy says itâs not done yet and Brown was merely hinting at his future work, not something that already exists.)
As far as real results go, and with the amount of background and evidence we now have on AI reasoning, this comment by Flowers, whoâs a half-reliable OpenAI leaker, suggests the latest GPT-4 turbo version is OpenAIâs current state-of-the-art on this. The Information reported that Q* can solve previously unseen math problems and, as it happens, the new GPT-4 turbo has improved the most on math/code problems (math tasks give the best early signals of reasoning ability). It also makes sense that OpenAI has chosen this low-key preview to assess Q* as a reasoning-focused model through GPT-4, to make an intermediate ânon-shockingâ public release before giving GPT-5 this kind of intelligence.
I bet that GPT-5 will be a pure LLM with notably enhanced reasoning abilities, borrowing them from a Q*-like RL model.9 Beyond that, OpenAI will keep further exploring how to bring together these two lines of research whose complete merging remains elusive.
Personalization
Iâll keep this one short. Personalization is all about empowering the user with a more intimate relationship with the AI. Users canât make ChatGPT their customized assistant to the degree they may want to. System prompts, fine-tuning, RAG, and other techniques allow users to steer the chatbot to their desired behavior but thatâs insufficient in terms of both the knowledge the AI has of the user and the control the user has of the AI (and of the data it sends to the cloud to get a response from the servers). If you want the AI to know more about you, you need to provide more data, which in turn lowers your privacy. Thatâs a key trade-off.
AI companies need to find a compromise solution that satisfies them and their customers if they donât want them to take the chance to go open-source even if that entails more effort (Llama 3 makes that shift more attractive than ever). Is there a satisfactory middle ground between power and privacy? I donât think so; if you go big, you go cloud. OpenAI isnât even trying to make personalization GPT-5âs strength. For one reason: The model will be extremely large and compute-heavy, so forget about local processing and data privacy (most enterprises wonât be comfortable sending OpenAI their data).
Thereâs something else besides privacy and on-device processing that will unlock a new level of personalization (achieved by other companies already, Google and Magic in particular, although only Google has released publicly a model with this feature): several-million-token context windows.
Thereâs a big jump in applicability when you go from asking ChatGPT a two-sentence question to being able to fill the prompt window with a 400-page PDF that contains a decadeâs worth of work so that ChatGPT can help you retrieve whatever may be hidden in there. Why wasnât this available already? Because doing inference on so many input prompts was expensive in a way that became quadratically more unaffordable with every additional word you added. Thatâs known as the âquadratic attention bottleneck.â However, it seems the code has been cracked; new research from Google and Meta suggests the quadratic bottleneck is no more.
Ask Your PDF is a great app once the PDFs can be infinite in length but thereâs something new that is now possible with million-token windows that wasnât with hundred-thousand-token-windows: The âAsk My Lifeâ category of apps. Iâm not sure what will be GPT-5âs context window size, but given that a young startup like Magic seems to have achieved great results with many-million-token windowsâand given Altmanâs explicit reference to personalization as a must-have AI capabilityâOpenAI must, at least, match that bet.
Reliability
Reliability is the skepticâs favorite. I think LLMs being unreliable (e.g. hallucinations) is one of the main reasons why people donât see the value proposition of generative AI clear enough to go paid, why growth has stalled and use has plateaued, and why some experts consider them a âfun distractionâ but not productivity-enhancing (and when they are, it doesnât always go well). This isnât everyoneâs experience with LLMs, but itâs sufficiently salient that companies shouldnât deny reliability is a problem they need to tackle (especially if they expect humanity to use this technology to help in high-stakes category cases).
Reliability is key for any tech product so why is it so hard to get it right with these large AI models? A conceptualization Iâve found useful to understand this point is that things like GPT-5 are neither inventions nor discoveries. Theyâre best portrayed as discovered inventions. Not even the people more closely building modern AI (much less users or investors) know how to interpret whatâs going on inside the models once you input a query and get an output. (Mechanistic interpretability is a hot research area aimed at this problem but still in its early days. Read Anthropicâs work if youâre interested in this.)
It is as if GPT-5 and its ilk were ancient devices left behind by an advanced civilization and we happened to find them serendipitously in our archaeological silicon digs. Theyâre inventions weâve discovered and now weâre trying to figure out what they are, how they work, and how we can make their behavior explainable and predictable. The unreliability we perceive is merely a downstream consequence of not understanding the artifacts well. Thatâs why this flaw remains unsolved despite costing companies millions in customer churn and enterprise doubt.
OpenAI is trying to make GPT-5 more reliable and safe with heavy guardrailing (RLHF), testing, and red-teaming. This approach has shortcomings. If we accept, as I explained above, that AIâs inability to reason is because âSampling can prove the presence of knowledge, but not its absence,â we can just apply the same idea to safety testing: Sampling can prove the presence of safety cracks, but not their absence. This means that no matter how much testing OpenAI does, they wonât ever be sure their model is perfectly reliable or perfectly safe against jailbreaks, adversarial attacks, or prompt injections.
Will OpenAI improve reliability, hallucinations, and external attack vectors for GPT-5? The GPT-3 â GPT-4 trajectory suggests they will. Will they solve them? Donât count on it.
Agents
This section is, in my opinion, the most interesting of the entire article. Everything Iâve written up to this point matters, in one way or another, for AI agents (with special emphasis on reasoning). The big question is this: Will GPT-5 have agentic capabilities or will it be, like the previous GPT versions, a standard language model that can do many things but not make plans and act on them to achieve goals? This question is relevant for three reasons Iâve broken down below: First, the importance of agency for intelligence canât be overstated. Second, we know a primitive version of this is somewhat possible. Third, OpenAI has been working on AI agents.
Many people believe agencyâdescribed as the ability to reason, plan, and act autonomously over time to reach some goal, using the available resourcesâis the missing link between LLMs and human-level AI. Agency, even more so than pure reasoning, is the landmark of intelligence. As we saw above, reasoning is the first step to getting thereâa key ability for any intelligent agentâbut not enough. Planning and acting in the real world (for AIs a simulated environment can work well as a first approximation) are skills all humans have. Early on we start to interact with the world in a way that reveals a capacity for sequential reasoning targeted to predefined goals. At first, itâs unconscious and thereâs no reasoning involved (e.g. a crying toddler) but as we grow it becomes a complex, conscious process.
One way to explain why agency is a must for intelligence and reasoning in a vacuum isnât that useful is through the difference between explicit and tacit/implicit knowledge. Letâs imagine a powerful reasoning-capable AI that experiences and perceives the world passively (e.g. a physics expert AI). Reading all the books on the web would allow the AI to absorb and then create an unfathomable amount of explicit knowledge (know-what), the kind that can be formalized, transferred, and written down on papers and books. However, no matter how smart at physics the AI might be, itâd still lack the ability to take all those formulas and equations and apply them to, say, secure funding for a costly experiment to detect gravitational waves.
Why? Because that requires understanding the socioeconomic structures of the world and applying that knowledge in uncertainly novel situations with many moving parts. That kind of applied ability to generalize goes beyond what any book can cover. Thatâs tacit knowledge (know-how); the kind you only learn by doing and by learning directly from those who already know how to do it.10 The bottom line is this: No AI can be usefully agentic and achieve goals in the world without the ability to acquire know-how/tacit knowledge first, however great it might be at pure reasoning.11
To acquire know-how, humans do stuff. But âdoingâ in a way thatâs useful to learn and understand requires following action plans toward goals mediated by feedback loops, experimentation, tool use, and a way to integrate all that with the existing pool of knowledge (which is what the kind of targeted reasoning beyond imitation learning that AlphaZero does is for). So reasoning, for an agent, is a means to an end, not an end in itself (thatâs why itâs useless in a vacuum). Reasoning provides new explicit knowledge that AI agents then use to plan and act to acquire the tacit knowledge required to achieve complex goals. Thatâs the quintessence of intelligence; that is AIâs ultimate form.
This kind of agentic intelligence contrasts with LLMs like GPT-4, Claude 3, Gemini 1.5, or Llama 3 which are bad at conducting plans satisfactorily (early LLM-based agentic attempts like BabyAGI and AutoGPT or failed autonomy experiments are evidence for that). The current best AIs are sub-agentic or, to use a more or less official nomenclature, theyâre AI tools (Gwern has a good resource on AI tool vs AI agent dichotomy).
So, how do we go from AI tools to AI agents that can reason, plan, and act? Can OpenAI close the gap between GPT-4, an AI tool, to GPT-5, potentially an AI agent? To answer that question we need to walk backward from OpenAIâs current focus and beliefs on agency and consider whether thereâs a path from there. In particular, OpenAI seems to be convinced that LLMsâor more generally token-prediction algorithms (TPAs), which is an overarching term that includes models for other modalities, e.g. DALL-E, Sora, or Voice Engineâare enough to achieve AI agents.
If we are to believe OpenAIâs stance, we need to first answer this other question: Can AI agents emerge from TPAs, bypassing the need for tacit knowledge or even handcrafted reasoning features?12
The rationale behind these questions is that a great AI predictor/simulatorâwhich is theoretically possibleâmust have developed, somehow, an internal world model to make accurate predictions. Such a predictor could bypass the need to acquire tacit knowledge just by having a deep understanding of how the world works. For instance, you donât learn to ride a bike from books, you have to ride it, but if you could somehow predict whatâs going to happen next with an arbitrarily high level of detail, that might be enough to nail it on your first ride and all subsequent rides. Humans canât do that so we need practice, but could AI?13 Letâs shed some light on this before going on real examples of AI agents, including what OpenAI is working on.
Token-prediction algorithms (TPAs) are extremely powerful. So powerful that the entirety of modern generative AI is built on the premise that a sufficiently capable TPA can develop intelligence.14 GPT-4, Claude 3, Gemini 1.5 and Llama 3 are TPAs. Sora is a TPA (whose creators say âwill lead to AGI by simulating everythingâ). Voice Engine and Suno are TPAs. Even unlikely examples like Figure 01 (âvideo in, trajectories outâ) and Voyager (an AI Minecraft player that uses GPT-4) are essentially TPAs. But a pure TPA is perhaps not the best solution to do everything. For instance, DeepMindâs AlphaGo and AlphaZero arenât TPAs but, as I said in the âreasoningâ section, a clever combination of reinforcement learning, search, and deep learning.
Can an intelligent AI agent emerge out of a GPT-5 trained like GPT-4, as a TPA, or is it the case that to make GPT-5 an agent OpenAI needs to find a completely different function to optimize or even a new architecture? Can a (much) better GPT-4 eventually develop agentic capabilities or does an AI agent need to be something else entirely? Ilya Sutskever, the scientific mind behind OpenAIâs earlier successes, has little doubt about the power of TPAs:
⊠When we train a large neural network to accurately predict the next word in lots of different text from the internet ⊠we are learning a world model ⊠it may look on the surface that we are just learning statistical correlations in text but it turns out that to âjust learnâ statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produces the text. This text is actually a projection of the world⊠this is whatâs being learned by accurately predicting the next word.
Bill Peebles, one of the Sora creators, went even further in a recent talk:
As we continue to scale this paradigm [TPAs], we think eventually itâs going to have to model how humans think. The only way you can generate truly realistic video with truly realistic sequences of actions is if you have an internal model of how all objects, humans, etc., environments work.
You may not buy this view but we can safely extrapolate Sutskever and Peeblesâ arguments to understand that OpenAI is, internal debates aside, in agreement. If successful, this approach would debunk the idea that AIs need to capture tacit knowledge or specific reasoning mechanisms to plan and act to achieve goals and be intelligent. Perhaps itâs just tokens all the way.
I donât buy OpenAIâs view for one reason: They donât bypass the tacit knowledge challenge. They simply move it somewhere else. Now the problem is not learning to reason, plan, and act but simulating worlds. They want to solve, quite literally, precognition. Peebles goes over this so casually that it seems unimportant. But, isnât it even harder to create a perfect predictor/simulator than an entity that can plan and act in the world? Is it even possible to create an AI that can simulate âtruly realistic sequences of actions,â as Peebles claimed in his talk? I donât think soâI donât think we can build that and I donât think we could assess such an ability anyway. Perhaps OpenAIâs trust and reliance on the Bitter Lesson goes too far (or perhaps Iâm wrong, weâll see).
Anyway, AI companiesâ options are narrow nowadaysâno one knows how to build plan/act systems although Yann LeCun keeps tryingâso theyâre approaching the agency challenge with transformer-based TPAs in the form of LLMs (including OpenAI) whether they like it or not because itâs the best technology they have at their disposal. Letâs start with existing prototypes and then jump to what we know about OpenAIâs efforts.
Besides the examples I shared above (e.g. BabyAGI, AutoGPT, Voyager, etc.) there are other LLM-based agentic attempts. The first one that grabbed my attention was pre-ChatGPT. In September 2022, Adept AI announced the first version of what they called the Action Transformer, âa large-scale transformer trained to use digital toolsâ by watching videos of people. They released a few demos, but little beyond that. A year ago two co-founders left the company, which isnât a good sign at all (The Information reported that Adept is preparing the launch of an AI agent in the summer. Weâll see how it goes). Another young startup that has recently joined the AI agents gold rush is Cognition AI, best known as the creator of Devin, âthe first AI software engineerâ (which has now an open-source cousin, OpenDevin). It went well at first, but then a review video entitled âDebunking Devinâ came out and went viral by exposing Cognitionâs overhyping of Devinâs abilities. The result? Cognition had to publicly acknowledge that Devin isnât good enough to âmake money taking on messy Upwork tasks.â
Those are purely software agents. Thereâs another branch, admittedly even harder to accomplish: AI agent devices. The best-known examples are the Rabbit R1 and Humane AI Pin. The reviews on R1 are coming out, so weâll wait for them (around the same day this post is scheduled for publication). The reviews on Humane AI Pin came out last week and theyâre absolutely devastating. In case you didnât read my âWeekly Top Picks #71,â you can read The Vergeâs review here or watch Marques Brownleeâs here.
Just know that the conclusion, taking into account all the above evidence, is that LLM-based AI agents arenât a thing yet. Can OpenAI do better?
We know very little about OpenAIâs attempts at agents. We know that Andrej Karpathy was âbuilding a kind of JARVISâ before he left OpenAI (why would he leave if he was working at the best AI company on the most promising future for AI?) Business Insider reported that GPT-5 will have the âability to call AI agents being developed by OpenAI to perform tasks autonomously,â which is as vague as it gets. The Information reported some new info earlier this week:
OpenAI is quietly designing computer-using agents that could take over a personâs computer and operate different applications at the same time, such as transferring data from a document to a spreadsheet. Separately, OpenAI and Meta are working on a second class of agents that can handle complex web-based tasks such as creating an itinerary and booking travel accommodations based on it.
But even if these projects succeeded, this isnât really what I described above as AI agents with human-like autonomous capabilities that can plan and act to reach goals. As The Information says, companies are using their marketing prowess to dilute the concept, turning âAI agentsâ into a âcatch-all term,â instead of backing off from their ambitions or rising up to the technical challenge. OpenAIâs Ben Newhouse says theyâre building what âcould be an industry-defining zero to one product that leverages the latest and greatest from our upcoming models.â Weâll see about that.
As a conclusion to this subsection on agents, I believe OpenAI isnât ready to make the final jump to AI agents with its biggest release just yet. A lot of work is left to be done. TPAs, despite being the only potential solution for now (until the reasoning challenges I described above are solved), wonât be enough by themselves to achieve the sought-after agentic capabilities in a way that people consider using them for serious projects.
I bet GPT-5 will be a multimodal LLM like those weâve seen beforeâan improved GPT-4 if you will. Itâll probably be surrounded by systems that donât exist yet in GPT-4, including the ability to connect to an AI agent model to do autonomous actions on the internet and your device (but itâll be far from the true dream of a human-like AI agent). Whereas multimodality, reasoning, personalization, and reliability are features of a system (they will all be improved in GPT-5), an agent is an entirely different entity. GPT-5 doesnât need to be an agent to enjoy the power of agency. It will likely be a kind of primitive âAI agent manager,â perhaps the first we consensually recognize as such.
OpenAI will integrate GPT-5 and AI agents at the product level to test the waters. They will also not release GPT-5 and the AI agent fleet at once (as an antecedent, GPT-4 and GPT-4V were separated for a while). I assume OpenAI considers the agentic capabilities harder to control than âjustâ a better multimodal LLM so they will roll out AI agents much more slowly. Let me repeat, with emphasis, the above quote by Newhouse to make it clear why I believe this is the case: âWeâre building what ⊠could be an industry-defining zero to one product that leverages the latest and greatest from our upcoming models [emphasis mine].â A product (AI agent) that leverages the greatest from the upcoming models (GPT-5).
In closing
So that was it.
Congratulations, you just read 14,000 words on GPT-5 and surroundings!
Hope it helped you get a better understanding not just of GPT-5 itself (weâll get the full picture once itâs out) but of how to think about these things, the many parts that have to move in harmony to make it possible, and the many considerations that are necessary to have a better picture of the future.
It was a fun experiment to dive this deep into a topic (if you like super long-form articles like this, Iâll do more as time permits). Hope it was fun, interesting, and useful for you as well.