- Signal to Noise
- Posts
- OpenAI’s Orion ‘Flagship LLM’
OpenAI’s Orion ‘Flagship LLM’
OpenAI’s Orion ‘Flagship LLM’
Few details, other than a ‘Connections Puzzle’
What Happened? We learnt that OpenAI’s next flagship LLM is code-named Orion, and that the whole synthetic data Strawberry/Q-star debacle will help generate synthetic data for Orion. And that we could test Strawberry’s performance (possibly before the full release of Orion) in ‘the fall’.
Screenshot from The Information, linked
The system can apparently ‘solve complex word puzzles’ like NYT Connections. But GPT-4o can already solve many of the clues in such puzzles, for example, half of today’s Connection puzzle (finding sets of 4 related words in a group of 16 words).
The article also mentions that this system will likely be used within Orion, the ‘flagship LLM’ being developed now.
The sources are two people ‘familiar with the matter’ so the veracity is hard to pin down. But there is no scare-story, as with Reuters reporting on Q-star, so one suspects it is not a step-change, more-so an incorporation of Let’s Verify Step by Step.
So What? Orion will likely be the ‘prove it’ moment for model scaling, yielding a step-change over GPT-4 or more evidence of diminishing returns. But the first public demo of a verifier by Nov-Dec? That could actually be interesting, new underlying model or not.
Does It Change Everything? Rating = ⚂
Simple-Bench is born
What Happened? The ‘What-Bench’ I described in the last newsletter has finally got a name, Simple Bench, and a website. It is also attracting some positive early feedback. I am now working with a Senior ML engineer (inc. an NDA) to both expand the benchmark and get it hosted so we can have one-click assessments of new models via API.
Preview Website
Simple Bench tests basic reasoning - that ice cubes melt in a hot frying pan, that running to the top of a skyscraper takes longer than waving to a fan - that humans find hilariously easy and models stunning tough.
Human performance, albeit selected for smarter folks who are interested in taking the test, has so far averaged 92% but my instinct is that even a sufficiently motivated ‘man-on-the-street’ would get 80%+. The best LLM gets 27%.
And this is their home territory, don’t forget, English-language text based ‘reasoning’. But the benchmark is private, so no memorising answers, and doesn’t rely on oversaturated basic calculations or age-old ML benchmarking. It has also been vetted by a handful of doctoral-level researchers (every question - 2 didn’t make it).
So What? Modelling language isn’t the same as modelling reality. This is the gap Simple Bench aims to illuminate, testing spatio-temporal (space and time) scenarios, as well as situational judgement.
Does It Change Everything? Rating = tbc
Here is a recent Insiders video demonstrating some of the crucial details you need to bear in mind when testing frontier LLMs…
For the full suite of exclusive videos, podcasts, and a Discord community of hundreds of truly top-flight professionals w/ networking (in-person + online) and GenAI best-practice-sharing across 30+ fields, I would love to invite you to our growing Patreon.
Subscribe to Premium to read the rest.
Become a paying subscriber of Premium to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In.
A subscription gets you:
- • Exclusive posts, with hype-free analysis.
- • Sample Insider videos, hosted ad-free on YT, of the quality you have come to expect.
- • Access to an experimental SmartGPT 2.0 - see multiple answers to the same prompt w/ GPT-4 Turbo, for example, then have Claude 3 Opus review its work. Community-driven - so you can take the lead.
- • Support for balanced, nuanced AI commentary, with no middleman fees, for you or me. Would love one day to be in a position to have a small team of equally engaged independent researchers/journalists.