OpenAI ‘Strawberry’ and What-Bench?

OpenAI ‘Strawberry’ and What-Bench?

90% on MATH ≠ Human Reasoning

What Happened? Leakers within OpenAI claim that the company has a model that can score 90% on the tough MATH benchmark. That performance matches that of a “three-time IMO gold medallist”, working under time constraints, and far surpasses the 78% achieved by OpenAI last May, using ‘verifier’ models to automatically vet reasoning processes.

The Let’s Verify process was thought by some to be a possible pre-cursor to Q*, and per Reuters the ‘Strawberry project was formerly known as Q*’, so maybe I was onto something. However, the leakers go on to suggest this system exhibits human-level reasoning, which cannot be proven simply by acing a benchmark.

  • According to Bloomberg, referring to the chart above, ‘OpenAI executives told employees that the company believes it is currently on the first level … but on the cusp of reaching the second, which it calls “Reasoners.”’

  • Level 2 Reasoners ‘refer to systems that can do basic problem-solving tasks as well as a human with a doctorate-level education who doesn’t have access to any tools.’ But we have heard before, from the likes of Anthropic’s CEO Dario Amodei and OpenAI CTO Mira Murati, that current models are as intelligent as smart high-schoolers, claims that don’t hold up under scrutiny.

  • Or is this a more fundamental breakthrough? - per the article ‘OpenAI thinks shows some new skills that rise to human-like reasoning’. ‘The document describes a project that uses Strawberry models with the aim of enabling the company’s AI to not just generate answers to queries but to plan ahead enough to navigate the internet autonomously and reliably to perform what OpenAI terms “deep research,”’

  • For me, the bar of scepticism should be high, based on past exaggerations from OpenAI leadership. Furthermore, Reuters points out that ‘[t]he source described the plan to Reuters as a work in progress.’

So What? ‘How Strawberry works is a tightly kept secret, even within OpenAI’ So, clearly, ascertaining much more will be difficult. But what I can say is that 90% on MATH does not mean human-reasoning, even though 99% of humans couldn’t score as highly. Rather than make the trite calculator analogy, I would point you either to the next piece in the newsletter or this piece on the reasoning gap within the MATH benchmark itself.

Does It Change Everything? Rating = (for now)

Naming a Benchmark

What Happened? Nothing. Well, except that I have been working on a new benchmark, expanding on and refining, the hundreds of private questions I use to test models, spanning logic, mathematics, common-sense, coding, medicine and more. Drawing on independent PhDs to check questions too, to avoid benchmark shenanigans. These are not tokenizer-exploiting gotchas, these are questions that get to the heart of the fundamental reasoning vs memorization limitations of LLMs.

For a sample of 30, Claude 3.5 Sonnet gets 2/30, Gemini 1.5-Pro 3/30 and GPT-4o 1/30 - all worse than random choice. Given that current models are ‘smart high-schoolers’, a human should get about the same, right? No, in my unofficial tests so far its around 80-90% correct.

  • Keeping the vast bulk of the questions private for the most part, to avoid contamination, but my strategy includes functionalization. This would alert me if a lab is gaming the benchmark by memorising answers.

  • But for my loyal and patient newsletter readers, here is a single sample, can let me know what you would pick:

  • Beth places four whole ice cubes in a fire at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the fire was five, how many whole ice cubes are in the fire at the end of the third minute? A) 5 B) 11 C) 0 D) 20

So What? This might only interest those who have either been following my channel and my rants about benchmarks, or those who want to understand if we are on the cusp of AGI or far from it. Though I could see it being useful for labs and businesses.

Hoping to release ‘official’ scores soon (100Qs, w/ self-consistency, zero-shot), and would love to make the benchmark famous (No Mo Gawdat, ChatGPT is a not yet ‘smarter than Einstein’). Not sure of the name though … DUMB-bench? Little harsh though. LLM-IQ Test? Doesn’t roll off the tongue…feel free to suggest.

Does It Change Everything? Rating = won’t rate my own work!

Check out my Insiders video below that guest-stars an Andrej Karpathy comment!

For the full suite of exclusive videos, podcasts, and a Discord community of hundreds of truly top-flight professionals w/ networking (in-person + online) and GenAI best-practice-sharing across 30+ fields, I would love to invite you to our growing Patreon. I also have a new Coursera course.

Subscribe to Premium to read the rest.

Become a paying subscriber of Premium to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

A subscription gets you:

  • • Exclusive posts, with hype-free analysis.
  • • Sample Insider videos, hosted ad-free on YT, of the quality you have come to expect.
  • • Access to an experimental SmartGPT 2.0 - see multiple answers to the same prompt w/ GPT-4 Turbo, for example, then have Claude 3 Opus review its work. Community-driven - so you can take the lead.
  • • Support for balanced, nuanced AI commentary, with no middleman fees, for you or me. Would love one day to be in a position to have a small team of equally engaged independent researchers/journalists.