Kimi K2-Thinking Simple Bench + AI Explained app

An early insight into this fascinating new model, and a new curated benchmarking hub/app.

Kimi K2 Thinking means business, as does LM Council

But business for who?

What Happened? Well, first of all what happened is that I made for you an epic benchmarking page, with all the benchmarks I follow most, with scores as independently verified by Epoch/Scale AI*. Simple Bench will be updated there before it’s updated on simple-bench.com. Kimi K2 Thinking does really well, it’s for real, but if raw intelligence is your concern, it’s not like this Chinese open weights model will forge new records, HLE score notwithstanding (more on that below). And yes, my new hub is connected to my new epic free app for comparing models, lmcouncil.ai.

The hub, with almost all scores sourced/independently checked via Epoch AI, Scale AI. *Not Simple Bench scores, that’s from me, and Kimi K2 Thinking.

  • The reason Kimi K2 Thinking is a big deal, in my opinion, is it shows that the magic sauce of closed source models might be more corporeal than thought. Open weight Chinese models are getting closer and closer to SOTA, and show no signs (ironically, given the state) of ceasing to be open. This means, for me, that we might simultaneously get AGI and a stock-market burst, as AI becomes a commodity. For sure the richest will still get the best, as they can spend more on tokens, but the almost-SOTA models look set to be open.

  • Now like everyone else, Moonshot AI are choosing which headlines to focus on. Yes, technically the website saying “K2 Thinking achieved a​state-of-the-art​ ​​score of ​​44.9%” is true but more accurate would be “…in the division of the test in which tools are allowed, including web search, and elsewhere we note that web search can lead to leakage.” Moonshot try to control for this but no one score (particularly on this benchmark) can be taken to say a model is ‘the best’. As my hub shows, without tools, it is not SOTA but not that far off. Not bad, but knowledge benchmarks are more about practical use than raw intelligence.

So What? One model won’t change the world alone just yet, but as GPT 3.5 showed, it can wake the world to what already is. The steady improvements and persist open approach of Chinese providers says to me things will keep getting smarter, and the pressure will only intensify on the top dogs in the West to keep galloping, lest something goes pop.

Does It Change Everything? Rating = 

So WTF is lmcouncil.ai then?

What Happened? As you know, I have loved comparing language models well before I even came up with Simple Bench. And even before SmartGPT, in 2023. I’ve always believed comparing the best of N answers, sourced models training on different data, will yield an edge. So I made lmcouncil.ai, a free consumer-friendly way of comparing models in a group chat, featuring: an auto-updated model list (got Kimi K2 Thinking within 20 mins of launch) with options for running polls, comparing image gens, one-click speech gen, music gen, transcriptions, rapid roleplays (see Roles), (chat code previews (see FAQ), the benchmarking hub (try the best 4 models instantly per benchmark), recommended combinations, an OG SmartGPT feature (one model reflecting on others’ automatically), nano banana portal, complete background customization and much more.

My custom background, yours I am sure will be different!

  • This is my first ‘app’, and will soon be available via the CLI, and possible as a mobile app. Because I am the sole developer behind it (nobly assisted by AI of course and will have lots more to say on that), I can add new features very quickly, without needing permission, so do ask!

  • Making this has really help ground my Patreon/AI Explained videos in more nuance as to the cutting edge of AI for code, so I hope that showing the coming months, have some blockbuster videos planned.

  • Would love for you to check it out, lmcouncil.ai 

So What? Well, might leave this one blank, as that is up for you guys to decide! haha

Does It Change Everything? Rating =

To support hype-free journalism, and to get a full suite of exclusive AI Explained videos, explainers and a Discord community of hundreds of (edit: now 1000+) truly top-flight professionals w/ networking, I would love to invite you to our newly discounted $7/month Patreon tier.