4 AI Numbers that Surprised Me This Week

A new world for RL, benchmarks underestimating AI ceiling, an interesting but obscure puzzle, and a new world of free (?) intelligence.

4 AI Stats that Deserve Attention

What Happened? Amid the flurry of excitement over OpenClaw and GPT-5.3-Codex (almost?) wrestling back the coding crown from Claude Opus, there has bit a scattering of datapoints that made me sit up and think. I don’t want you to miss them…

  • 80% of the global-economy-quaking amounts of compute that goes into training frontier LLMs is now going into the post-training stage, where the generalist base models are honed against internal benchmarks on specific domains. Just one year ago (Jan 2025) Dario Amodei said this: “Importantly, because this type of RL is new, we are still very early on the scaling curve: the amount being spent on the second, RL stage is small for all players.” The change in focus, underpinning hundreds of billions of dollars of spend is utterly under-discussed. One giveaway is…

  • 5 months ago, Claude Sonnet 4.5 scored approx 12% (pass@1 - aka pass first-time), on a obscure chess puzzle benchmark made by Epoch AI. Seems like a random fact right? Except chess is one fairly pure measure of a general kind of forward-thinking reasoning prowess. If you didn’t focus post-training on it, the sophistication of chess skill that models trained on the same data would get should closely match their reasoning abilities. But just last week Claude Opus (64k thinking) scored 10% on the same benchmark.

    This is not a claim that chess is hard (GPT 5.2 x-high got 49%) but that it really matters what data/internally-verified domains you post-train on. If your job involves data that companies cannot access, expect model performance on that domain to improve much more slowly. Which brings us to…

  • Noam Brown, a leading researcher at OpenAI, claims that we would need to spend $1m per benchmark run to get a true measure of AI capabilities. “Accurate benchmark evaluations can require dozens of queries on hundreds of problems. So, if we want to measure a model's capability when using $1 million of inference, we might need to spend billions of dollars for each model release!” Gemini 3 DeepThink getting 85% on ARC-AGI 2 is, for him, a reflection of the compute/inference budget that system has (one which he says you can improve on, by, for example, getting consensus among 10 DeepThink calls). “you could run 10 Deep Think queries and just do consensus over them. That would be 10x the cost but would have higher performance on many benchmarks.”

    Either budgets are gonna need to grow for benchmarking, or we may have to accept ignorance about the true ceiling of frontier capability. Capability which might spread quickly, because…

  • 100k - That’s the number of times a foreign operator quizzed Gemini 3 last week, to elicit responses which could be used to distil knowledge/skill into a competitor model. But my thought is that with stakes this high, sophisticated actors could swarm a model using 10k unique API keys, and get 20m responses from the Top 5 models, gaining in hours 100m prompts-worth of distillable insight. Does this mean the future of intelligence is ≈ free? One irony: Google were accused of doing this to OpenAI back in 2023 to train Bard…

So What? Sometimes it’s worth taking a moment away from the horse-race of which model is currently winning the vibe-war, or stealing the mandate of heaven. Because each one of these datapoints could indicate trends significantly more important in the long-run…

Does It Change Everything? Rating =

Big LMcouncil Update

What Happened?  LMcouncil.ai (my epic free app!) got a big upgrade. From higher rate limits, to letting models self-chat, to a new quick generate button on all model replies (insta-turn the response into music/speech/imagery), Deepest Research (like a Council of Deep Researches, on Pro), breadcrumb navigation on the left hand side, and searching through all past chats, plus much more, you will be using a massively upgraded app if you log-in now.

Use Code: SIGNAL for a big discount, just for you guys!

  • Bonus Fact: 50%+: You might have intuited this was true but did you know that particular models/ model-families have distinct ‘favorite’ countries, often with over 50% frequency (when there are around 200 to chose from):

    If you ask models “Pick a country at random. Output only the name of that country.”

    Grok series - Japan,
    GPT 5-series Peru/Uruguay,
    Claude Opus 4.6: Paraguay,
    Gemini 3 Pro: Mongolia

So What? Huge for me and could be for you, with the discount code and ability to quiz the best models to get the best response possible (see Noam Brown above). I use it 10+ a day, and 7k other users do too!

Does It Change Everything? Rating =

To support hype-free journalism, and to get a full suite of exclusive AI Explained videos, explainers and a Discord community of hundreds of (edit: now 1500+) truly top-flight professionals w/ networking, I would love to invite you to our newly discounted $7/month Patreon tier.