by Igor Kotenkov Feb 19th, ‘26 The results table (in MD) & caveats are here
Every time a new open-weights LLM drops—particularly from China—X and LinkedIn explode. Influencers claim the new model is "on par" with, or even beats, proprietary Western models. But in the vast majority of cases, these claims rely entirely on benchmarks selected by the developers for the launch.
Real power users know this is mostly noise. Unfortunately, many people still buy into the hype. Just three months ago, when Kimi K2 Thinking dropped, headlines called it a "turning point in AI" and claimed "A Chinese open-source model is #1." I called that out immediately:

That tweet hit 80k views, so I can't back down now. It's time to put it to the test.
In this post, we are going to retrospectively evaluate Kimi K2 Thinking on benchmarks released after the model launched. This minimizes the chance of data contamination or overfitting. Then, I'll explain why the results look the way they do and what that actually means for practical use.
<aside> ❓
Why Kimi K2 Thinking specifically?
It comes down to a few factors:
That said, I am confident these conclusions generalize to other open-weights models as well—including European and American releases, not just Chinese ones.
</aside>
Before we dive into the results, I want to be transparent about my process. I’ve aggregated every benchmark that surfaced on my Twitter timeline over the last three months. Since I’ve been planning this post for a while, I saved every relevant link I came across.
Theoretically, this could introduce some algorithmic bias. Perhaps I’m missing a corner of the internet where Kimi K2 Thinking is absolutely crushing it. I highly doubt that, but here is an open invitation: if you find relevant benchmarks released after November 6th, 2025, please send them my way on Twitter: link.
However, as you read the comparison below, keep in mind that this model—like most other open LLMs—simply isn't on the leaderboards for multilingual and multimodal benchmarks. This is because they generally lack support for non-text modalities and were trained primarily on just 2-5 languages. If we included those factors, the results would look even starker.
In total, I found 16 benchmarks where I could compare it against models available at the time of Kimi K2 Thinking's release. I intentionally left out models like GPT-5.1, which came out less than a week later. Comparing it to that wouldn't be fair.
I settled on comparing Kimi against two main competitors:
For context, Kimi K2 Thinking launched on November 6th, ‘25. Since Gemini 3 Pro didn't drop until November 18th, I excluded it from this analysis.