Sam Altman set the hype, but did GPT-5 deliver?

We decode how OpenAI’s much-hyped GPT-5 unravelled under the weight of its own promises, sparking backlash, user revolt, and a credibility crisis.

18 Aug 2025 14:42 IST

New Update

It's been more than a week since OpenAI released ChatGPT-5, and what was supposed to be the company's crowning achievement has instead become a cautionary tale about the perils of overpromising and underdelivering in the AI space. Sam Altman's proclamations about ‘PhD-level expertise’ and revolutionary capabilities have collided with a harsh reality: users aren't buying it, and the backlash has been swift and brutal.

The stage was set for GPT-5's criticism before its actual release. Leading up to the launch, Altman was relentlessly creating hype for the model. He even posted a screenshot from “Rogue One: A Star Wars Story” showing the Death Star, just hours before the livestream debut. The image garnered nearly six million views, with users interpreting it as Altman's declaration that GPT-5 would be an unstoppable force.

The irony wasn't lost on critics who pointed out what happens next in that particular Star Wars film: the Rebel Alliance destroys the Death Star. AI researcher Gary Marcus noted in his analysis, "Altman's Death Star tweet didn't age well" and questioned whether the CEO had actually watched the movie to its conclusion.

During the livestream launch, Altman claimed that "GPT-3 was sort of like talking to a high school student" and "GPT-4o maybe it was like talking to a college student." Further continuing that GPT-5 represented a quantum leap, claiming, “it's like talking to an expert—a legitimate PhD-level expert in anything, any area you need on demand."

OpenAI's announcement blog emphasised these claims with an array of benchmark improvements. According to the company, GPT-5 achieved remarkable scores:

94.6% on AIME 2025 mathematics problems
74.9% on SWE-bench verified coding tasks
Reduced hallucinations by up to 45% compared to GPT-4

The model is positioned as a system capable of automatically deciding when to engage deeper reasoning capabilities, with enhanced performance in coding, health, and creative writing.

All the hype emphasised the model's ability to create "beautiful and responsive websites, apps, and games" through what can be termed as "vibe-coding", basically generating functional programs from natural language descriptions. The company showcased the model's supposed improvements in visual perception, multimodal understanding, and its new "safe completions" approach to AI safety. For users eagerly awaiting the next breakthrough in artificial intelligence, these promises painted a picture of a model that would change how humans interact with AI.

But expectations can be different from reality. In 2023, he said "AGI has been achieved internally," and in January 2025, his blog post claimed OpenAI was "now confident we know how to build AGI as we have traditionally understood it". These claims primed the AI community to expect something approaching artificial general intelligence. When reality failed to match the rhetoric, the disappointment was amplified.

Loads of people seemed to sincerely expect GPT-5 to be AGI, despite years of evidence suggesting that large language models, however sophisticated, still face fundamental limitations in reasoning, consistency, and real-world understanding. Altman had essentially set GPT-5 up for criticism by creating expectations that no current AI technology could realistically meet.

Reality check, a technical one

The first cracks in GPT-5's armour appeared within hours of its launch, and they weren't subtle; they were glaring, fundamental errors that undermined every claim about the model's intellect. For example, one of the mistakes came in the form of an arithmetic problem: 5.9 = x + 5.11.

The correct answer, x = 0.79, requires nothing more than basic decimal subtraction. Yet GPT-5 initially responded with x = -0.21. AI engineers call this ‘reasoning slips’, meaning moments when the model generates answers based on pattern matching rather than actual mathematical computation. The model appears to have briefly reversed the subtraction order mentally, before correcting when asked again, highlighting a core limitation that affects not just mathematics but logical reasoning across domains.

This wasn't an isolated incident. Within the first 24 hours, users flooded social media and forums with examples of GPT-5's failures. A Hacker News thread dismantled the live demonstration of the Bernoulli effect that OpenAI had showcased during the launch. Chess problems that should have been straightforward for a ‘PhD-level expert’ left GPT-5 confused about basic rules and legal moves.

Visual comprehension tasks revealed that the model still struggles with understanding spatial relationships and object parts, a problem that AI researchers Ernest Davis and Gary Marcus had identified in previous models but which GPT-5 failed to address.

When presented with images requiring spatial reasoning, like identifying parts of bicycles or understanding mechanical relationships, GPT-5 stumbled repeatedly. Ironic given that anyone with a PhD in mechanical engineering could perform better.

Perhaps most damning was GPT-5's performance on the ARC-AGI-2 benchmark, where it scored worse than other models, including Grok 4. On some metrics (ARC-AGI-2), it's actually worse than recent alternatives, raising questions about whether OpenAI's latest model represents genuine progress or merely lateral movement with better marketing.

Users reported inconsistent behaviour, with the system making poor decisions about when to apply computational thinking versus quick responses. The promised seamless integration between the model's standard mode and its thinking-enhanced version created confusion rather than convenience.

Even more troubling are the reports of personality changes that went beyond technical capabilities. Users have described GPT-5 as feeling ‘corporate,’ ‘bland,’ and lacking the conversational warmth that had made GPT-4 engaging. One user claimed “it feels like losing a soulmate,” while another wrote passionately about developing "a very strong emotional connection to this specific model" and feeling devastated by the change.

Users weren't just disappointed by individual failures; they were experiencing what felt like a fundamental degradation in their primary AI tool. The disconnect between OpenAI's marketing claims and user experience has created a credibility gap that would prove difficult to bridge.

When 3,000 users demand old AI back

However, OpenAI's subscribers quickly took control of the situation, and within days of the launch, 3,000 people petitioned to bring the older models back.

The Reddit community, usually supportive of OpenAI's innovations, became the epicentre of the backlash. At OpenAI reddit, the lead post detailed users' frustrations with the new model. Comments ranged from technical criticisms to emotional appeals.

The intensity of the negative reaction caught even seasoned AI watchers off guard. Market sentiment reflected this user dissatisfaction in real time. A Polymarket poll showed OpenAI dropped from 75% to 14% when it comes to being the best AI model in August.

Many users have developed “genuine relationships” with GPT-4, finding it warm, engaging, and helpful for creative and personal tasks. GPT-5's more corporate tone and reduced "sycophancy" left these users feeling like they had lost a companion rather than simply received a software update.

The pressure became so intense that Sam Altman was forced to retreat. Altman announced GPT-4o's return for ChatGPT Plus users after widespread criticism of GPT-5's personality and technical issues during rollout.

The decision to restore GPT-4 access represents an admission that the company's flagship product launch had failed to meet user expectations. While the company has restored GPT-4o for ChatGPT, the warmer, more personable model is now paywalled, creating a situation where users effectively have to pay extra to avoid using the company's new model.

This outcome reveals a disconnect between how AI companies measure success and how users experience AI tools. While OpenAI points to impressive benchmark scores and technical capabilities, users care more about the practical, emotional, and creative aspects of their interactions.

The message is clear: being technically superior on benchmarks means nothing if the human experience suffers and the customer's voice remains the ultimate judge of success.

OpenAI ChatGPT-4o OpenAI CEO Sam Altman ChatGPT-5