Signal to Noise
Posts
o3 steals the spotlight, but Google shipped some serious goods

o3 steals the spotlight, but Google shipped some serious goods

AI Explained
December 21, 2024

o3 steals the spotlight, but Google shipped some serious goods

o3 is a beast, but with a possible asterisk

What Happened? OpenAI announced o3, and o3-mini. Not going to re-hash the video I made here, but suffice to say it was a big night last night. The first half of 2025 will be dominated by coverage of those models. Time to give you some alpha though, for you loyal newsletter readers: an AI Insider tested o3-mini on the 10-question public set of Simple Bench and it got 3/10 (worse than 3.5 sonnet’s 5/10). Plus more alpha: o1 gets 36.7% on the full bench, down slightly on o1-preview (o1 is faster to run, so likely less thinking under-the-hood). Could this be a sign that the o-series is slightly over-optimised on technical benchmarks?

The lighter part of the bar is when a model only gets the score through picking the most common of its answers - i.e. the ‘consensus’ answer.

o3 is obviously wildly good, and is a 99.9th percentile mathematician and competitive coder, while also knowing dozens of languages etc etc. But one quiet question for the new year will be whether it is so costly to run ($350k to beat the full private ARC test set, for example) that it represents the upper limits of what consumers will be able to realistically use in early ‘25. Not at all the limits of scaling thinking time, but - like pre-training - a sign that further progress will get costlier.
A counter argument is the fact that Sam Altman said it would be available ‘shortly after o3-mini, which is out January’. It may come inside a $2000 tier, or be heavily rate limited in the $200 Pro tier. If it is not heavily rate-limited, that would be a bullish signal that OpenAI have plenty more inference gas in the tank, and that o4 will be announced in Q2 of 2025.
Costly or not, progress on technical benchmarks does not look set to abate. Noam Brown, one of the key figures behind the o-series, said to expect that ‘the trajectory will continue’. As I highlighted in the video, the o-series represents a reproducible technique for crushing any technical benchmark. Reliability may truly be the last hurdle to mass adoption of AI in the workplace.

So What? If a model has truly uncommon sense (allowing it to crush crazy-difficult tests) but occasionally lacks common sense, what are we to make of it? That will be a question for 2025, whatever your definition of AGI is. But all eyes must turn to reliability now - e.g dependable task-automation - because if that is solved, there seems to be few fundamental limits to how complex the tasks are that models like o3 can then complete.

Does It Change Everything? Rating = ⚄

Speaking of Simple Bench, Weights & Biases will be sponsoring a new competition on a fuller 20 question set on Simple, to see if any of you can prompt your way to 20/20. That’s 10 extra questions for you to analyze. Grateful for Weights supporting this, and yes, it’s all run on Weave.

Google should not be forgotten, Veo 2 is amazing

What Happened? For 10 or 11 of the 12 days of OpenAI’s Shipmas, Google had announcements at roughly the same time that were as impressive, if not more impressive than OpenAI’s, especially if you factor in price and accessibility. A great example was Veo 2, Google’s stunning new text-to-video model. I was lucky enough to get early access, and it just beats Sora almost every time. It is slightly more ‘Disney-like’ in it’s graphics, but it adheres to prompts much better and displays better physics. Plus, hints of a superior agent in ‘Project Mariner’, Imagen 3 and much more…

A screengrab from Veo 2 video, linked below

Check out the difference between these two outputs from Sora and Veo 2 on this prompt: ‘A small pygmy hippo leaves three courtiers in Henry VIII’s dining hall aghast at the sudden intrusion.’
While Sora’s is arguably more visually spectacular, the physics are much more ‘off’ and the prompt adherence is way down.
Google also demoed live, interactive screen sharing in Google AI Studio, a new SOTA agent for web-browsing (according to the WebVoyager benchmark), their own o1 prototype in Gemini-thinking, Imagen 3 inside Gemini Advanced ($20/month), Whisk for unprecedented controls over those visuals, and an upgrade to NotebookLM that allows you to interact with the hosts.

So What? If it wasn’t for o3, Google would have stolen OpenAI’s Christmas. As it is, they’ll have to settle for the best text-to-video model on the market, and a host of free offerings that are genuinely handy and impressive. Things move so slowly in AI, and then so fast.

Does It Change Everything? Rating = ⚂

To support hype-free journalism, and to get a full suite of exclusive AI Explained videos, explainers and a Discord community of hundreds of (edit: now 1000+) truly top-flight professionals w/ networking, I would love to invite you to our newly discounted $7.59/month Patreon tier.