- Signal to Noise
- Posts
- OpenAI Fights ‘Simple’ with ‘Simple’ + Contradictory Leaks + Dev Day
OpenAI Fights ‘Simple’ with ‘Simple’ + Contradictory Leaks + Dev Day
OpenAI Fights ‘Simple’ with ‘Simple’ + Contradictory Leaks & Dev Day
Simple Bench vs SimpleQA
What Happened? OpenAI stole my benchmark name (just kidding). But really, they released SimpleQA, which does have a similar name to Simple Bench, but is quite different, in that it is not about comparing LLM and human reasoning, but rather testing the factual recall of models and whether their stated confidence matches their accuracy. TLDR: if a model says it’s 50-50 it’s 90% likely to be bull****ing. Coincidentally, the new Simple Bench technical note (v1) is out, which I think is far, far cooler.
Don’t listen to any models ‘stated confidence’ in it’s answer (see left chart vs actual accuracy)
Here’s a more badass-looking paper
SimpleQA asks questions like ‘Which Dutch player scored an open-play goal in the 2022 Netherlands vs Argentina game in the men’s FIFA World Cup?’, and the top performing model in both absolute percentage correct, 42.7%, and percentage of questions it did answer that were correctly answered, 47.0%, was o1-preview. But the new Claude 3.5 Sonnet was not tested, and the older version wasn’t that far behind o1-preview.
Human performance on the full Simple Bench, and you can try 10 public questions yourself, was found to be 83.7% in among 9 participants vs o1-preview’s 41.7%. I suspect human performance would be close to 0-5% on OpenAI’s SimpleQA, but of course that benchmark is more for model differentiation.
Reviewing the chart above from SimpleQA, you may have noticed that in the chart on the top right, ‘answer frequency’ is much more calibrated with accuracy than is ‘stated confidence’. To translate, do not ask models if they are confident, ask them repeatedly and use the relative frequency of the answers as a gauge of confidence.
So What? A full Simple Bench video is coming soon, but my loyal newsletter readers deserve a fulsome preview, and what better timing than the release by OpenAI of a rival (in name only) benchmark.
Does It Change Everything? Rating = ⚂
Don’t trust messages like ‘it’s progressing well’ + London Dev Day
What Happened? OpenAl CFO Sarah Friar stated this week that, regarding internal models, it "would blow your mind to see what's coming". This comes after the Verge reported Orion/GPT-5 was done training in September, a fake news date per Sam Altman. But an exiting OpenAI employee has just said that ‘that there isn’t actually a large gap between what capabilities exist in labs and what is publicly available to use’. And that’s all before we get to Google’s contradictory mood music.
OpenAI CFO Sarah Friar, attended ‘research meetings’ which ‘blew her mind’, suggesting it was the combo of o1-reasoning steps and a larger model that made the difference. But she is CFO, so let’s take this with a gallon of salt
Sundar Pichai, CEO of Google said on the 29th October that ‘We’ve had two generations of Gemini model. We are working on the third generation, which is progressing well.’ But the Verge reported that they heard ‘that the model isn’t showing the performance gains the Demis Hassabis-led team had hoped for, though I would still expect some interesting new capabilities.’
Sam Altman at Dev Day replied to a question by Trenton, an AI Insider on Patreon, that ‘Without spoiling anything... I would expect rapid progress in image-based model’. More details may be coming on here soon … :)
But not much else was made known at Dev day, at least solidly. The theme is clear. Public statements/leaks by/from companies have to be weighed very carefully against actual performance in actual releases. Conversely, as o1-preview showed us all, myself included, a lack of leaks and hints does not mean a big step forward is not imminent.
So What? Well, not much really, other than not to get whiplash from following ‘leaks’ too closely. I will try to sift through the noise for you, as best I can, hence the newsletter name, and my new benchmark.
Does It Change Everything? Rating = ⚀
To support hype-free journalism, and to get a full suite of exclusive AI Explained videos, explainers and a Discord community of hundreds of (edit: now 1000+) truly top-flight professionals w/ networking, I would love to invite you to our newly discounted $7/month Patreon tier.