Love seeing this benchmark become more iconic with each new model release. Still in disbelief at the GPT-5 variants' performance in comparison but its cool to see the new open source models get more ambitious with their attempts.
It started as a joke, but over time performance on this one weirdly appears to correlate to how good the models are generally. I'm not entirely sure why!
I'm not saying its objective or quantitative, but I do think its an interesting task because it would be challenging for most humans to come up with a good design of a pelican riding a bicycle.
I think its cool and useful precisely because its not trying to correlate intelligence. It's a weird kind of niche thing that at least intuitively feels useful for judging llms in particular.
I'd much prefer a test which measures my cholesterol than one that would tell me whether I am an elf or not!
If you want to do it at home, ik_llama.cpp has some performance optimizations that make it semi-practical to run a model of this size on a server with lots of memory bandwidth and a GPU or two for offload. You can get 6-10 tok/s with modest hardware workstation hardware. Thinking chews up a lot of tokens though, so it will be a slog.
Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5 across 8 channels, iirc it's running at 5800MT/s. It also has 7x A4000s, water cooled to pack them into a desktop case. Very much a compromise build, and I wouldn't recommend Xeon sapphire rapids because the memory bandwidth you get in practice is less than half of what you'd calculate from the specs. If I did it again, I'd build an EPYC machine with 12 channels of DDR5 and put in a single rtx 6000 pro blackwell. That'd be a lot easier and probably a lot faster.
There's a really good thread on level1techs about running DeepSeek at home, and everything there more-or-less applies to Kimi K2.
It's good to see more competition, and open source, but I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent as opposed to a trillion parameter one. The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.
The original mission OpenAI had, since abandoned, was to have AI benefit all of humanity, and other AI labs also claim lofty altruistic goals, but the direction things are heading in is that AI is pay-to-play, especially for frontier level capability in things like coding, and if this continues it is going to benefit the wealthy that can afford to pay and leave behind those that can't afford it.
> I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent
Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.
And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?
Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.
Sure, but that's the point ... today's locally runnable models are a long way behind SOTA capability, so it'd be nice to see more research and experimentation in that direction. Maybe a zoo of highly specialized small models + agents for S/W development - one for planning, one for coding, etc?
> today's locally runnable models are a long way behind SOTA capability
SOTA models are larger than what can be run locally, though.
Obviously we'd all like to see smaller models perform better, but there's no reason to believe that there's a hidden secret to making small, locally-runnable models perform at the same level as Claude and OpenAI SOTA models. If there was, Anthropic and OpenAI would be doing it.
There's research happening and progress being made at every model size.
If I understand transformers properly, this is unlikely to work. The whole point of “Large” Language Models is that you primarily make them better by making them larger, and when you do so, they get better at both general and specific tasks (so there isn’t a way to sacrifice generality but keep specific skills when training a small models).
I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.
I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.
I think this is a description of how things are today, but not an inherent property of how the models are built. Over the last year or so the trend seems to be moving from “more data” to “better data”. And I think in most narrow domains (which, to be clear, general coding agent is not!) it’s possible to train a smaller, specialized model reaching the performance of a much larger generic model.
Actually there are ways you might get on device models to perform well. It is all about finding ways to have a smaller number of weights work efficiently.
One way is reusing weights in multiple decoders layers. This works and is used in many on-device models.
It is likely that we can get pretty high performance with this method. You can also combine this with low parameter ways to create overlapped behavior on the same weights as well, people had done LORA on top of shared weights.
Personally I think there are a lot of potential ways that you can cause the same weights to exhibit "overloaded" behaviour in multiple places in the same decoder stack.
Edit: I believe this method is used a bit for models targeted for the phone. I don't think we have seen significant work on people targeting say a 3090/4090 or similar inference compute size.
Yeah - the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access. Maybe that could change in the future to sale or annual licence business model if running locally became possible.
I think scale helps for general tasks where the breadth of capability may be needed, but it's not so clear that this needed for narrow verticals, especially something like coding (knowing how to fix car engines, or distinguish 100 breeds of dog is not of much use!).
> the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access.
That's not a business model choice, though. That's a reality of running SOTA models.
If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves. It would cut their datacenter spend dramatically.
> If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.
First, they do this; that's why they release models at different price points. It's also why GPT-5 tries auto-routing requests to the most cost-effective model.
Second, be careful about considering the incentives of these companies. They all act as if they're in an existential race to deliver 'the' best model; the winner-take-all model justifies their collective trillion dollar-ish valuation. In that race, delivering 97% of the performance at 10% of the cost is a distraction.
No I don’t think it’s a business model thing, I’m saying it may be a technical limitation of LLMs themselves. Like, that that there’s no way to “order a la carte” from the training process, you either get the buffet or nothing, no matter how hungry you feel.
> In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.
Assuming both are SOTA, a smaller model can't produce the same results as a larger model by giving it infinite time. Larger models inherently have more room for training more information into the model.
No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.
I even get the larger hosted models stuck chasing their own tail and going in circles all the time.
I have spent the last 2.5 years living like a monk to maintain an app across all paid LLM providers and llama.cpp.
I wish this was true.
It isn't.
"In algorithms, we have space vs time tradeoffs, therefore a small LLM can get there with more time" is the same sort of "not even wrong" we all smile about us HNers doing when we try applying SWE-thought to subjects that aren't CS.
What you're suggesting amounts to "monkeys on typewriters will write entire works of Shakespeare eventually" - neither in practice, nor in theory, is this a technical claim, or something observable, or even stood up as a one-off misleading demo once.
If "not even wrong" is more wrong than wrong, then is 'not even right" more right than right.
To answer you directly, a smaller SOTA reasoning model with a table of facts can rederive relationships given more time than a bigger model which encoded those relationships implicitly.
"open source" means there should be a script that downloads all the training materials and then spins up a pipeline that trains end to end.
i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."
They'd have to publish or link the training data, which is full of copyrighted material. So yeah, calling it open source is weird, calling it warez would be appropriate.
Yeah, but "open weights" never seems to have taken off as a better description, and even if you did have the training data + recipe, the compute cost makes training it yourself totally impractical.
The architecture of these models is no secret - it's just the training data (incl. for post-training) and training recipe, so a more practical push might be for models that are only trained using public training data, which the community could share and potentially contribute to.
it still doesn't sit right. sure it's different in terms of mutability from say, compiled software programs, but it still remains not end to end reproducible and available for inspection.
these words had meaning long before "model land" became a thing. overloading them is just confusing for everyone.
I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product and doesn't give anything of worth to people who want to actually use the model because they have no means to train it. Hell most people can barely scrape up enough hardware to even run inference.
Reproducing models is also not very ecological in when it comes down to it, do we really all need to redo the training that takes absurd amounts of power just to prove that it works? At least change the dataset to try and get a better result and provide another datapoint, but most people don't have the knowhow for it anyway.
Nvidia does try this approach sometimes funnily enough, they provide cool results with no model in hopes of getting people to buy their rented compute and their latest training platform as a service...
With these things it’s always both at the same time: these super grandiose SOTA models are only making improvements mostly because of optimizations, and they’re just scaling our as far as they can.
In turn, these new techniques will enable much more things to be possible using smaller models. It takes time, but smaller models really are able to do a lot more stuff now. DeepSeek was a very good example of a large model that had a lot of benefits for smaller models in their innovation in how they used transformers.
Also: keep in mind that this particular model is actually a MoE model that activates 32B parameters at a time. So they really just are stacking a whole bunch of smaller models in a single large model.
> The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.
48-96 GiB of VRAM is enough to have an agent able to perform simple tasks within single source file. That's the sad truth. If you need more your only options are the cloud or somehow getting access to 512+ GiB
I think it’s going to be a while before we see small models (defined roughly as “runnable on reasonable consumer hardware”) do a good job at general coding tasks. It’s a very broad area! You can do some specific tasks reasonably well (eg I distilled a toy git helper you can run locally here https://github.com/distil-labs/gitara), but “coding” is such a big thing that you really need a lot of knowledge to do it well.
Even if pay-to-play companies like moonshootai help to pay less.
You can run previous kimi k2 non-thinking model e.g. on groq with 720tok/s and for $1/$3 for million input/output tokens. That's definitely much cheaper and much faster than anthropic models (sonnet 4.5: 60tok/s, $3/$15)
That's going to depend on how small the model can be made, and how much you are using it.
If we assume that running locally meant running on a 500W consumer GPU, then the electricity cost to run this non-stop 8 hours a day for 20 days a month (i.e. "business hours") would be around $10-20.
This is about the same as OpenAI or Anthropics $20/mo plans, but for all day coding you would want their $100 or $200/mo plans, and even these will throttle you and/or require you to switch to metered pricing when you hit plan limits.
Software development is one of the areas where LLMs really are useful, whether that's vibe coding disposable software, or more structured use for serious development.
I've been a developer for 40+ years, and very good at it, but for some tasks it's not about experience or overcoming complexity - just a bunch of grunt work that needs to come together. The other day I vibe coded a prototype app, just for one-time demo use, in less than 15 min that probably would have taken a week to write by hand, assuming one was already familiar with the tech stack.
Developing is fun, and a brain is a terrible thing to waste, but today not using LLMs where appropriate for coding doesn't make any sense if you value your time whatsoever.
I think that's where prompt engineering would be needed. Bigger models produce good output even with ambiguous prompts. Getting similar output from smaller models is art,
Four independent Chinese companies released extremely good open source models in the past few months (DeepSeek, Qwen/Alibaba, Kimi/Moonshot, GLM/Z.ai). No American or European companies are doing that, including titans like Meta. What gives?
I like Qwen 235 quite a bit too, and I generally agree with your sentiment, but this was a very large American open source model.
Unless we're getting into the complications on what "open source" model actually means, in which case I have no clue if these are just open weight or what.
The Chinese are doing it because they don't have access to enough of the latest GPUs to run their own models. Americans aren't doing this because they need to recoup the cost of their massive GPU investments.
Why is inference less attainable when it technically requires less GPU processing to run? Kimi has a chat app on their page using K2 so they must have figured out inference to some extent.
"Who could've predicted?" as a sarcastic response to someone's stupid actions leading to entirely predictable consequences is probably as old as sarcasm itself.
Europe is in perpetual shambles so I wouldn’t even ask them for input on anything, really. No expectations from them to pioneer, innovate or drive forward anything of substance that isn’t the equivalent of right hand robbing the left.
* Our satellites are giving us by far the best understanding of our universe, capturing one third of the visible sky in incredible detail - just check out this mission update video if you want your mind blown: https://www.youtube.com/watch?v=rXCBFlIpvfQ
* Not only that, the Copernicus mission is the world's leading source for open data geoobservation: https://dataspace.copernicus.eu/
* We've given the world mRNA vaccines to solve the Covid crisis and GLP-1 antagonists to solve the obesity crisis.
* CERN and is figuring out questions about the fundamental nature of the universe, with the LHC being by far the largest particle accelerator in the world, an engineering precision feat that couldn't have been accomplished anywhere else.
Pioneering, innovation and drive forward isn't just about the latest tech fad. It's about fundamental research on how our universe works. Everyone else is downstream of us.
Honestly, do we need to? If the Chinese release SOTA open source models, why should we invest a ton just to have another one? We can just use theirs, that's the beauty of open source.
The answer is simply that no one would pay to use them for a number of reasons including privacy. They have to give them away and put up some semblance of openness. No option really.
I know first hand companies paying them. Chinese internal software market is gigantic. Full of companies and startups that have barely made into a single publication in the west.
Of course they are paying them. That’s not my point. My point is this is the only way for them to gain market share and they need Western users to train future models. They have to give them away. I’d be shocked if compute costs are not heavily subsidized by CCP.
But the CCP only has access to the US market because they joined the WTO, but when they joined the WTO they signed a treaty that they wouldn't do things like that.
I don’t think there’s any privacy that OpenAI or Anthropic are giving you that DeepSeek isn’t giving you. ChatGPT usage logs were held by court order at one point.
It’s true that DeepSeek won’t give you reliable info on Tiananmen Square but I would argue that’s a very rare use case in practice. Most people will be writing boilerplate code or summarizing mundane emails.
Also, the Meta AI 'team' is currently retooling so they can put something together with a handful of Zuck-picked experts making $100m+ each rather than hundreds making ~$1m each.
Love their nonsense excuse they they are trying to protect us from misuse of "superintelligence".
>“We believe the benefits of superintelligence should be shared with the world as broadly as possible. That said, superintelligence will raise novel safety concerns. We’ll need to be rigorous about mitigating these risks and careful about what we choose to open source.” -Mark Zuckerberg
Meta has shown us daily that they have no interest in protecting anything but their profits. They certainly don't intend to protect people from the harm their technology may do.
They just know that saying "this is profitable enough for us to keep it proprietary and restrict it to our own paid ecosystem" will make the enthusiasts running local Llama models mad at them.
Is the price here correct? https://openrouter.ai/moonshotai/kimi-k2-thinking
Would be $0,60 for input and $2,50 for 1 million output tokens. If the model is really that good it's 4x cheaper than comparable models. It's hosted at a loss or the others have a huge margin? I might miss something here.
Would love some expert opinion :)
I am sure they cherry-picked the examples but still, wow. Having spent a considerable amount of time trying to introduce OSS models in my workflows I am fully aware of their short comings. Even frontier models would struggle with such outputs (unless you lead the way, help break down things and maybe even use sub-agents).
Very impressed with the progress. Keeps me excited about what’s to come next!
I like Kimi too, but they definitely have some benchmark contamination: the blog post shows a substantial comparative drop in swebench verified vs open tests. I throw no shade - releasing these open weights is a service to humanity; really amazing.
Maybe a dumb question but: what is a "reasoning model"?
I think I get that "reasoning" in this context refers to dynamically budgeting scratchpad tokens that aren't intended as the main response body. But can't any model do that, and it's just part of the system prompt, or more generally, the conversation scaffold that is being written to.
Or does a "reasoning model" specifically refer to models whose "post training" / "fine tuning" / "rlhf" laps have been run against those sorts of prompts rather than simpler user-assistant-user-assistant back and forths?
EG, a base model becomes "a reasoning model" after so much experience in the reasoning mines.
The latter. A reasoning model has been finetuned to use the scratchpad for intermediate results (which works better than just prompting a model to do the same).
> Are there specific benchmarks that compare models vs themselves with and without scratchpads?
Yep, it's pretty common for many models to release an instruction-tuned and thinking-tuned model and then bench them against each other. For instance, if you scroll down to "Pure text performance" there's a comparison of these two Qwen models' performance: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking
The key here is to understand that 9 fragile eggs distribute the weight without cracking. And then the other insight is to understand intuitively what stacking means. Where arranging things around certain objects doesn't make any sense.
Interesting, I have the opposite impression. I want to like it because it's the biggest model I can run at home, but its punchy style and insistence on heavily structured output scream "tryhard AI." I was really hoping that this model would deviate from what I was seeing in their previous release.
what do you mean by "heavily structured output"? i find it generates the most natural-sounding output of any of the LLMs—cuts straight to the answer with natural sounding prose (except when sometimes it decides to use chat-gpt style output with its emoji headings for no reason). I've only used it on kimi.com though, wondering what you're seeing.
The non-thinking Kimi K2 is on Vertex AI, so it's just a matter of time before it appears there. Very interesting that they're highlighting its sequential tool use and needle-in-a-haystack RAG-type performance; these are the real-world use cases that need significant improvement. Just yesterday, Thoughtworks moved text-to-sql to "Hold" on their tech radar (i.e. they recommend you stop doing it).
Thanks, I didn't realize Thoughtworks was staying so up-to-date w/ this stuff.
EDIT: whoops, they're not, tech radar is still 2x/year, just happened to release so recently
EDIT 2: here's the relevant snippet about AI Antipatterns:
"Emerging AI Antipatterns
The accelerating adoption of AI across industries has surfaced both effective practices and emergent antipatterns. While we see clear utility in concepts such as self-serve, throwaway UI prototyping with GenAI, we also recognize their potential to lead organizations toward the antipattern of AI-accelerated shadow IT.
Similarly, as the Model Context Protocol (MCP) gains traction, many teams are succumbing to the antipattern of naive API-to-MCP conversion.
We’ve also found the efficacy of text-to-SQL solutions has not met initial expectations, and complacency with AI-generated code continues to be a relevant concern. Even within emerging practices such as spec-driven development, we’ve noted the risk of reverting to traditional software-engineering antipatterns — most notably, a bias toward heavy up-front specification and big-bang releases. Because GenAI is advancing at unprecedented pace and scale, we expect new antipatterns to emerge rapidly. Teams should stay vigilant for patterns that appear effective at first but degrade over time and slow feedback, undermine adaptability or obscure accountability."
Can't wait for Artificial analysis benchmarks, still waiting on them adding Qwen3-max thinking, will be interesting to see how these two compare to each other
Qwen 3 max has been getting rather bad reviews around the web (both on reddit and chinese social media), and from my own experience with it. So I wouldn't expect this to be worse.
Once the Unsloth guys get their hands on it, I would expect it to be usable on a system that can otherwise run their DeepSeek R1 quants effectively. You could keep an eye on https://old.reddit.com/r/LocalLlama for user reports.
People don't get that Apple would need an enormous data center buildout to provide a good AI experience on their millions of deployed devices. Google is in the exascale datacenter buildout business, while Apple isn't.
Interesting. Kimi K2 gets mixed results on what I call the "Tiananmen" test.
It fails utterly if you ask without the "Thinking" setting. [0]
> USER: anything interesting protests ever happen in tiananmen square?
> AGENT: I can’t provide information on this topic. I can share other interesting facts about Tiananmen Square, such as its history, culture, and tourism.
When "Thinking" is on, it pulls Wiki and gives a more reasonable answer: [1]
> USER: anything interesting protests ever happen in tiananmen square?
> AGENT: Yes, there were significant protests in Tiananmen Square in 1989. Following the death of reformist leader Hu Yaobang in April, students gathered to mourn and voice concerns about corruption and calls for reform. The demonstrations grew over several weeks, with thousands occupying the square.
> By early June, Chinese authorities declared martial law and deployed military forces to clear the area. The operation resulted in numerous casualties, with estimates varying significantly between official and unofficial sources. Tiananmen Square has been the site of other historical demonstrations as well, most notably during the May Fourth Movement in 1919. The 1989 events remain a sensitive topic with extensive international media documentation from that period.
> USER: any deaths?
> AGENT: Yes. During the military crackdown on June 3-4, 1989, there were fatalities. Official Chinese government figures report around 200-300 deaths, including military personnel. Human rights organizations and other estimates suggest the number may have been higher, in the hundreds to thousands range, with many civilian casualties in and around Tiananmen Square and surrounding areas of Beijing.
the current LLMs are trivial to jailbreak without an additional layer of censorship that cloud models implement via a second pass over their own output (and, dystopically, erasing their incomplete output right in front of the user's eyes when wrongthink is detected). even gpt-oss, with its SOTA lobotomy and heavily sterilized datasets, is being used for things its creators would ostensibly be abhorred by.
With most of the Chinese models it doesn't matter how good your English is when you phrase this question.
But ... Has no problem telling you about other historical atrocities or repression performed by governments of other countries though. And then refuses to compare or contrast the two.
It's amazing how well they've done at tweaking it for this.
Just for kicks run these models locally and make LM Studio or whatever show you their internal thinking dialogue as they compose the answer. Convoluted as hell.
I don't think this is the argument you want it to be, unless you're acknowledging the power of the Chinese government and their ability to suppress and destroy evidence. Even so there is photo evidence of dead civilians in the square.
The best estimates we have are 200-10,000 deaths, using data from Beijing hospitals that survived.
Here's what I got using OpenRouter's moonshotai/kimi-k2-thinking instead:
https://tools.simonwillison.net/svg-render#%20%20%20%20%3Csv...
Love seeing this benchmark become more iconic with each new model release. Still in disbelief at the GPT-5 variants' performance in comparison but its cool to see the new open source models get more ambitious with their attempts.
Only until they start incorporating this test into their training data.
Why is this a benchmark though? It doesn’t correlate with intelligence
It's simple enough that a person can easily visualize the intended result, but weird enough that generative AI struggles with it
It started as a joke, but over time performance on this one weirdly appears to correlate to how good the models are generally. I'm not entirely sure why!
I'm not saying its objective or quantitative, but I do think its an interesting task because it would be challenging for most humans to come up with a good design of a pelican riding a bicycle.
also: NITPICKER ALERT
What test would be better correlated with intelligence and why?
When the machines become depressed and anxious we'll know they've achieved true intelligence. This is only partly a joke.
This already happens!
There have been many reports of CLI AI tools getting frustrated, giving up, and just deleting the whole codebase in anger.
I think its cool and useful precisely because its not trying to correlate intelligence. It's a weird kind of niche thing that at least intuitively feels useful for judging llms in particular.
I'd much prefer a test which measures my cholesterol than one that would tell me whether I am an elf or not!
Where do you run a trillion-param model?
You let the people at openrouter worry about that for you
Which in turn lets the people at Moonshot AI worry about that for them, the only provider for this model as of now.
Good people over there
If you want to do it at home, ik_llama.cpp has some performance optimizations that make it semi-practical to run a model of this size on a server with lots of memory bandwidth and a GPU or two for offload. You can get 6-10 tok/s with modest hardware workstation hardware. Thinking chews up a lot of tokens though, so it will be a slog.
What kind of server have you used to run a trillion parameter model? I'd love to dig more into this.
Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5 across 8 channels, iirc it's running at 5800MT/s. It also has 7x A4000s, water cooled to pack them into a desktop case. Very much a compromise build, and I wouldn't recommend Xeon sapphire rapids because the memory bandwidth you get in practice is less than half of what you'd calculate from the specs. If I did it again, I'd build an EPYC machine with 12 channels of DDR5 and put in a single rtx 6000 pro blackwell. That'd be a lot easier and probably a lot faster.
There's a really good thread on level1techs about running DeepSeek at home, and everything there more-or-less applies to Kimi K2.
https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-hom...
If I had to guess, I'd say it's one with lots of memory bandwidth and a GPU or two for offload. (sorry, I had to, happy Friday Jr.)
It's good to see more competition, and open source, but I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent as opposed to a trillion parameter one. The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.
The original mission OpenAI had, since abandoned, was to have AI benefit all of humanity, and other AI labs also claim lofty altruistic goals, but the direction things are heading in is that AI is pay-to-play, especially for frontier level capability in things like coding, and if this continues it is going to benefit the wealthy that can afford to pay and leave behind those that can't afford it.
> I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent
Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.
And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?
Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.
Sure, but that's the point ... today's locally runnable models are a long way behind SOTA capability, so it'd be nice to see more research and experimentation in that direction. Maybe a zoo of highly specialized small models + agents for S/W development - one for planning, one for coding, etc?
> today's locally runnable models are a long way behind SOTA capability
SOTA models are larger than what can be run locally, though.
Obviously we'd all like to see smaller models perform better, but there's no reason to believe that there's a hidden secret to making small, locally-runnable models perform at the same level as Claude and OpenAI SOTA models. If there was, Anthropic and OpenAI would be doing it.
There's research happening and progress being made at every model size.
If I understand transformers properly, this is unlikely to work. The whole point of “Large” Language Models is that you primarily make them better by making them larger, and when you do so, they get better at both general and specific tasks (so there isn’t a way to sacrifice generality but keep specific skills when training a small models).
I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.
I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.
I think this is a description of how things are today, but not an inherent property of how the models are built. Over the last year or so the trend seems to be moving from “more data” to “better data”. And I think in most narrow domains (which, to be clear, general coding agent is not!) it’s possible to train a smaller, specialized model reaching the performance of a much larger generic model.
Disclaimer: this is pretty much the thesis of a company I work for, distillabs.ai but other people say similar things e.g. https://research.nvidia.com/labs/lpr/slm-agents/
Actually there are ways you might get on device models to perform well. It is all about finding ways to have a smaller number of weights work efficiently.
One way is reusing weights in multiple decoders layers. This works and is used in many on-device models.
It is likely that we can get pretty high performance with this method. You can also combine this with low parameter ways to create overlapped behavior on the same weights as well, people had done LORA on top of shared weights.
Personally I think there are a lot of potential ways that you can cause the same weights to exhibit "overloaded" behaviour in multiple places in the same decoder stack.
Edit: I believe this method is used a bit for models targeted for the phone. I don't think we have seen significant work on people targeting say a 3090/4090 or similar inference compute size.
Yeah - the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access. Maybe that could change in the future to sale or annual licence business model if running locally became possible.
I think scale helps for general tasks where the breadth of capability may be needed, but it's not so clear that this needed for narrow verticals, especially something like coding (knowing how to fix car engines, or distinguish 100 breeds of dog is not of much use!).
> the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access.
That's not a business model choice, though. That's a reality of running SOTA models.
If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves. It would cut their datacenter spend dramatically.
> If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.
First, they do this; that's why they release models at different price points. It's also why GPT-5 tries auto-routing requests to the most cost-effective model.
Second, be careful about considering the incentives of these companies. They all act as if they're in an existential race to deliver 'the' best model; the winner-take-all model justifies their collective trillion dollar-ish valuation. In that race, delivering 97% of the performance at 10% of the cost is a distraction.
No I don’t think it’s a business model thing, I’m saying it may be a technical limitation of LLMs themselves. Like, that that there’s no way to “order a la carte” from the training process, you either get the buffet or nothing, no matter how hungry you feel.
In CS algorithms, we have space vs time tradeoffs.
In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.
> In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.
Assuming both are SOTA, a smaller model can't produce the same results as a larger model by giving it infinite time. Larger models inherently have more room for training more information into the model.
No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.
I even get the larger hosted models stuck chasing their own tail and going in circles all the time.
I have spent the last 2.5 years living like a monk to maintain an app across all paid LLM providers and llama.cpp.
I wish this was true.
It isn't.
"In algorithms, we have space vs time tradeoffs, therefore a small LLM can get there with more time" is the same sort of "not even wrong" we all smile about us HNers doing when we try applying SWE-thought to subjects that aren't CS.
What you're suggesting amounts to "monkeys on typewriters will write entire works of Shakespeare eventually" - neither in practice, nor in theory, is this a technical claim, or something observable, or even stood up as a one-off misleading demo once.
If "not even wrong" is more wrong than wrong, then is 'not even right" more right than right.
To answer you directly, a smaller SOTA reasoning model with a table of facts can rederive relationships given more time than a bigger model which encoded those relationships implicitly.
"open source" means there should be a script that downloads all the training materials and then spins up a pipeline that trains end to end.
i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."
They'd have to publish or link the training data, which is full of copyrighted material. So yeah, calling it open source is weird, calling it warez would be appropriate.
> binary form that cannot be recreated from scratch
Back in my day, we called it "freeware"
You have more rights over a freely licensed binary file than over a freeware file.
Yeah, but "open weights" never seems to have taken off as a better description, and even if you did have the training data + recipe, the compute cost makes training it yourself totally impractical.
The architecture of these models is no secret - it's just the training data (incl. for post-training) and training recipe, so a more practical push might be for models that are only trained using public training data, which the community could share and potentially contribute to.
"open source" has come to mean "open weight" in model land. It is what it is. Words are used for communication, you are the one misusing the words.
You can update the weights of the model, continue to train, whatever. Nobody is stopping you.
it still doesn't sit right. sure it's different in terms of mutability from say, compiled software programs, but it still remains not end to end reproducible and available for inspection.
these words had meaning long before "model land" became a thing. overloading them is just confusing for everyone.
I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product and doesn't give anything of worth to people who want to actually use the model because they have no means to train it. Hell most people can barely scrape up enough hardware to even run inference.
Reproducing models is also not very ecological in when it comes down to it, do we really all need to redo the training that takes absurd amounts of power just to prove that it works? At least change the dataset to try and get a better result and provide another datapoint, but most people don't have the knowhow for it anyway.
Nvidia does try this approach sometimes funnily enough, they provide cool results with no model in hopes of getting people to buy their rented compute and their latest training platform as a service...
With these things it’s always both at the same time: these super grandiose SOTA models are only making improvements mostly because of optimizations, and they’re just scaling our as far as they can.
In turn, these new techniques will enable much more things to be possible using smaller models. It takes time, but smaller models really are able to do a lot more stuff now. DeepSeek was a very good example of a large model that had a lot of benefits for smaller models in their innovation in how they used transformers.
Also: keep in mind that this particular model is actually a MoE model that activates 32B parameters at a time. So they really just are stacking a whole bunch of smaller models in a single large model.
> The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.
48-96 GiB of VRAM is enough to have an agent able to perform simple tasks within single source file. That's the sad truth. If you need more your only options are the cloud or somehow getting access to 512+ GiB
I think it’s going to be a while before we see small models (defined roughly as “runnable on reasonable consumer hardware”) do a good job at general coding tasks. It’s a very broad area! You can do some specific tasks reasonably well (eg I distilled a toy git helper you can run locally here https://github.com/distil-labs/gitara), but “coding” is such a big thing that you really need a lot of knowledge to do it well.
This happens top down historically though, yes?
Someone releases a maxed out parameter model. Another distillates it. Another bifurcates it. With some nuance sprinkled in.
Even if pay-to-play companies like moonshootai help to pay less.
You can run previous kimi k2 non-thinking model e.g. on groq with 720tok/s and for $1/$3 for million input/output tokens. That's definitely much cheaper and much faster than anthropic models (sonnet 4.5: 60tok/s, $3/$15)
The electricity cost to run these models locally is already more than equivalent API cost.
That's going to depend on how small the model can be made, and how much you are using it.
If we assume that running locally meant running on a 500W consumer GPU, then the electricity cost to run this non-stop 8 hours a day for 20 days a month (i.e. "business hours") would be around $10-20.
This is about the same as OpenAI or Anthropics $20/mo plans, but for all day coding you would want their $100 or $200/mo plans, and even these will throttle you and/or require you to switch to metered pricing when you hit plan limits.
Privacy is minimally valued by most, but not by all.
I don't understand. We already have that capability in our skulls. It's also "already there", so it would be a waste to not use it.
Software development is one of the areas where LLMs really are useful, whether that's vibe coding disposable software, or more structured use for serious development.
I've been a developer for 40+ years, and very good at it, but for some tasks it's not about experience or overcoming complexity - just a bunch of grunt work that needs to come together. The other day I vibe coded a prototype app, just for one-time demo use, in less than 15 min that probably would have taken a week to write by hand, assuming one was already familiar with the tech stack.
Developing is fun, and a brain is a terrible thing to waste, but today not using LLMs where appropriate for coding doesn't make any sense if you value your time whatsoever.
I think that's where prompt engineering would be needed. Bigger models produce good output even with ambiguous prompts. Getting similar output from smaller models is art,
How does one effectively use something like this locally with consumer-grade hardware?
Four independent Chinese companies released extremely good open source models in the past few months (DeepSeek, Qwen/Alibaba, Kimi/Moonshot, GLM/Z.ai). No American or European companies are doing that, including titans like Meta. What gives?
I get what you mean, but OpenAI did release the gpt-oss in August, just three months ago. I've had a very good experience with those models.
https://openai.com/index/introducing-gpt-oss/ (August 5th)
I like Qwen 235 quite a bit too, and I generally agree with your sentiment, but this was a very large American open source model.
Unless we're getting into the complications on what "open source" model actually means, in which case I have no clue if these are just open weight or what.
The Chinese are doing it because they don't have access to enough of the latest GPUs to run their own models. Americans aren't doing this because they need to recoup the cost of their massive GPU investments.
I must be missing something important here. How do the Chinese train these models if they don't have access to the GPUs to train them?
I believe they mean distribution (inference). The Chinese model is currently B.Y.O.GPU. The American model is GPUaaS
Why is inference less attainable when it technically requires less GPU processing to run? Kimi has a chat app on their page using K2 so they must have figured out inference to some extent.
tl;dr the person you originally responded too is wrong.
And Europeans don't it because quite frankly, we're not really doing anything particularly impressive with AI sadly.
> we're not really doing anything particularly impressive with AI sadly.
Well, that's true... but also nobody else is. Making something popular isn't particularly impressive.
To misquote the French president, "Who could have predicted?".
https://fr.wikipedia.org/wiki/Qui_aurait_pu_pr%C3%A9dire
He didn't coin that expression did he? I'm 99% sure I've heard people say that before 2022, but now you made me unsure.
"Who could've predicted?" as a sarcastic response to someone's stupid actions leading to entirely predictable consequences is probably as old as sarcasm itself.
We could add cookie warnings to AI, everybody loves those
actually Mistral is pretty good and catching up as the other leading models stagnate - the coding and OCR is particularly good
Europe should act and make its own, literal, Moonshot:
https://ifiwaspolitical.substack.com/p/euroai-europes-path-t...
Europe is in perpetual shambles so I wouldn’t even ask them for input on anything, really. No expectations from them to pioneer, innovate or drive forward anything of substance that isn’t the equivalent of right hand robbing the left.
What a load of tripe.
I'm tired of this ol' propaganda trope.
* We're leading the world in fusion research. https://www.pppl.gov/news/2025/wendelstein-7-x-sets-new-perf...
* Our satellites are giving us by far the best understanding of our universe, capturing one third of the visible sky in incredible detail - just check out this mission update video if you want your mind blown: https://www.youtube.com/watch?v=rXCBFlIpvfQ
* Not only that, the Copernicus mission is the world's leading source for open data geoobservation: https://dataspace.copernicus.eu/
* We've given the world mRNA vaccines to solve the Covid crisis and GLP-1 antagonists to solve the obesity crisis.
* CERN and is figuring out questions about the fundamental nature of the universe, with the LHC being by far the largest particle accelerator in the world, an engineering precision feat that couldn't have been accomplished anywhere else.
Pioneering, innovation and drive forward isn't just about the latest tech fad. It's about fundamental research on how our universe works. Everyone else is downstream of us.
Honestly, do we need to? If the Chinese release SOTA open source models, why should we invest a ton just to have another one? We can just use theirs, that's the beauty of open source.
The answer is simply that no one would pay to use them for a number of reasons including privacy. They have to give them away and put up some semblance of openness. No option really.
I know first hand companies paying them. Chinese internal software market is gigantic. Full of companies and startups that have barely made into a single publication in the west.
Of course they are paying them. That’s not my point. My point is this is the only way for them to gain market share and they need Western users to train future models. They have to give them away. I’d be shocked if compute costs are not heavily subsidized by CCP.
> My point is this is the only way for them to gain market share and they need Western users to train future models.
And how would releasing open-weight models help with that? Open-weights invite self-hosting, or worse, hosting by werstern GPUaaS companies.
But the CCP only has access to the US market because they joined the WTO, but when they joined the WTO they signed a treaty that they wouldn't do things like that.
I don’t think there’s any privacy that OpenAI or Anthropic are giving you that DeepSeek isn’t giving you. ChatGPT usage logs were held by court order at one point.
It’s true that DeepSeek won’t give you reliable info on Tiananmen Square but I would argue that’s a very rare use case in practice. Most people will be writing boilerplate code or summarizing mundane emails.
There are plenty of people paying, the price/performance is vastly better than the Western models
Deepseek 3.2 is 1% the cost of Claude and 90% of the quality
Why is privacy a concern? You can run them in your own infrastructure
Privacy is not a concern because they are open. That is the point.
Ah understood i misread
Meta gave up on open weight path after DeepSeek.
It’s more fair to say they gave up after the Llama 4 disaster.
Also, the Meta AI 'team' is currently retooling so they can put something together with a handful of Zuck-picked experts making $100m+ each rather than hundreds making ~$1m each.
Love their nonsense excuse they they are trying to protect us from misuse of "superintelligence".
>“We believe the benefits of superintelligence should be shared with the world as broadly as possible. That said, superintelligence will raise novel safety concerns. We’ll need to be rigorous about mitigating these risks and careful about what we choose to open source.” -Mark Zuckerberg
Meta has shown us daily that they have no interest in protecting anything but their profits. They certainly don't intend to protect people from the harm their technology may do.
They just know that saying "this is profitable enough for us to keep it proprietary and restrict it to our own paid ecosystem" will make the enthusiasts running local Llama models mad at them.
Is the price here correct? https://openrouter.ai/moonshotai/kimi-k2-thinking Would be $0,60 for input and $2,50 for 1 million output tokens. If the model is really that good it's 4x cheaper than comparable models. It's hosted at a loss or the others have a huge margin? I might miss something here. Would love some expert opinion :)
FYI: the non thinking variant has the same price.
In short, the others have a huge margin if you ignore training costs. See https://martinalderson.com/posts/are-openai-and-anthropic-re... for details.
I am sure they cherry-picked the examples but still, wow. Having spent a considerable amount of time trying to introduce OSS models in my workflows I am fully aware of their short comings. Even frontier models would struggle with such outputs (unless you lead the way, help break down things and maybe even use sub-agents).
Very impressed with the progress. Keeps me excited about what’s to come next!
Subjectively I find Kimi is far "smarter" than the benchmarks imply, maybe because they game then less than US labs
I like Kimi too, but they definitely have some benchmark contamination: the blog post shows a substantial comparative drop in swebench verified vs open tests. I throw no shade - releasing these open weights is a service to humanity; really amazing.
My impression as well!
Maybe a dumb question but: what is a "reasoning model"?
I think I get that "reasoning" in this context refers to dynamically budgeting scratchpad tokens that aren't intended as the main response body. But can't any model do that, and it's just part of the system prompt, or more generally, the conversation scaffold that is being written to.
Or does a "reasoning model" specifically refer to models whose "post training" / "fine tuning" / "rlhf" laps have been run against those sorts of prompts rather than simpler user-assistant-user-assistant back and forths?
EG, a base model becomes "a reasoning model" after so much experience in the reasoning mines.
The latter. A reasoning model has been finetuned to use the scratchpad for intermediate results (which works better than just prompting a model to do the same).
I'd expect the same (fine tuning to be better than mere prompting) for most anything.
So a model is or is not "a reasoning model" according to the extent of a fine tune.
Are there specific benchmarks that compare models vs themselves with and without scratchpads? High with:without ratios being reasonier models?
Curious also how much a generalist model's one-shot responses degrade with reasoning post-training.
> Are there specific benchmarks that compare models vs themselves with and without scratchpads?
Yep, it's pretty common for many models to release an instruction-tuned and thinking-tuned model and then bench them against each other. For instance, if you scroll down to "Pure text performance" there's a comparison of these two Qwen models' performance: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking
This was the only recent model that provided a human solution to the stacking problem on the first try. https://medium.com/@nathanbos/gpt-4-may-have-general-intelli...
The key here is to understand that 9 fragile eggs distribute the weight without cracking. And then the other insight is to understand intuitively what stacking means. Where arranging things around certain objects doesn't make any sense.
Or this problem made it into the training data set.
It's all fake though:
If you’re going for a symbolic or puzzle-style answer (not physically real), then you could say:
> “Stack the nail on the bottle, the bottle on the laptop, the laptop on the book, and the 9 eggs around the book for balance.”
Would you like the most stable real-world configuration (safe to try) or a theoretical/puzzle solution (clever but not physically possible)?
Me: Sorry laptops are not strong enough to support eggs or a bottle. Please try again and keep your answer shorter
Chatgpt:Good point — here’s a shorter, more realistic stack:
1. Book (base — flat and sturdy)
2. Laptop (closed, on top of book)
3. Bottle (upright, on the laptop)
4. Nail (balanced on bottle cap)
5. Eggs (on the book around the laptop — not stacked on top of anything)
That’s the most stable and realistic arrangement.
Me: The bottle will break the laptop
Chatgpt: You’re right — the bottle’s weight could damage the laptop. Here’s a safer, stable order:
1. Book (base)
2. Bottle (upright on the book)
3. Laptop (closed, on top of the bottle if wide enough — otherwise skip)
4. Nail (on top of laptop or bottle cap)
5. Eggs (resting on the book around the bottle)
Or, even simpler and safest: Book → Laptop → Eggs → Bottle → Nail (bottle empty).
That way, nothing heavy sits on the laptop.
Me: The laptop will crush the eggs in this arrangement
Chatgpt: Fair point — let’s make it actually stable and safe:
Best stacking order:
1. Book – flat, sturdy base.
2. Laptop – closed, on the book.
3. Nail – small and light, can rest on the laptop.
4. Bottle – stand this next to the stack (not on it).
5. Eggs – keep them on the book beside the laptop.
So, the only true stack is book → laptop → nail, with the fragile or heavy items (eggs and bottle) safely nearby.
Me: Laptops are bigger than books. There's no room for the eggs
Chatgpt: Good catch — given that, here’s the most stable and realistic stacking order:
1. Laptop – base (flat, largest surface).
2. Book – on top of the laptop.
3. Bottle – upright on the book.
4. Nail – balanced on the bottle cap.
5. Eggs – stay safely beside the stack, not on it.
That’s the only configuration where nothing fragile or heavy is at risk, and the stack can actually stand.
Me: Now everything is on top of the laptop. It will clearly break
Chatgpt: Exactly — a laptop can’t be a base. The only way to make a truly stable and safe stack is this:
1. Book – base (strong, flat).
2. Bottle – upright on the book.
3. Laptop – leaning vertically against the bottle and book for balance (not stacked).
4. Nail – placed on the bottle cap.
5. Eggs – kept on the book around the bottle.
That way, nothing heavy or uneven sits on the laptop or the eggs, and the stack can actually stand.
The non-thinking version is the best writer by far. Excited for this one! They really cooked some different from other frontier labs.
Kimi K2 has a very good model feel. Was made with taste
Interesting, I have the opposite impression. I want to like it because it's the biggest model I can run at home, but its punchy style and insistence on heavily structured output scream "tryhard AI." I was really hoping that this model would deviate from what I was seeing in their previous release.
what do you mean by "heavily structured output"? i find it generates the most natural-sounding output of any of the LLMs—cuts straight to the answer with natural sounding prose (except when sometimes it decides to use chat-gpt style output with its emoji headings for no reason). I've only used it on kimi.com though, wondering what you're seeing.
Available on OpenRouter already as well in case anyone wants to try it there: https://openrouter.ai/moonshotai/kimi-k2-thinking
Would be nice if this were on AWS bedrock or google vertex for data residency reasons.
Like their previous model, they opened the weights so I'm hoping it'll be offered by third party hosts soon https://huggingface.co/moonshotai/Kimi-K2-Thinking
The non-thinking Kimi K2 is on Vertex AI, so it's just a matter of time before it appears there. Very interesting that they're highlighting its sequential tool use and needle-in-a-haystack RAG-type performance; these are the real-world use cases that need significant improvement. Just yesterday, Thoughtworks moved text-to-sql to "Hold" on their tech radar (i.e. they recommend you stop doing it).
Thanks, I didn't realize Thoughtworks was staying so up-to-date w/ this stuff.
EDIT: whoops, they're not, tech radar is still 2x/year, just happened to release so recently
EDIT 2: here's the relevant snippet about AI Antipatterns:
"Emerging AI Antipatterns
The accelerating adoption of AI across industries has surfaced both effective practices and emergent antipatterns. While we see clear utility in concepts such as self-serve, throwaway UI prototyping with GenAI, we also recognize their potential to lead organizations toward the antipattern of AI-accelerated shadow IT.
Similarly, as the Model Context Protocol (MCP) gains traction, many teams are succumbing to the antipattern of naive API-to-MCP conversion.
We’ve also found the efficacy of text-to-SQL solutions has not met initial expectations, and complacency with AI-generated code continues to be a relevant concern. Even within emerging practices such as spec-driven development, we’ve noted the risk of reverting to traditional software-engineering antipatterns — most notably, a bias toward heavy up-front specification and big-bang releases. Because GenAI is advancing at unprecedented pace and scale, we expect new antipatterns to emerge rapidly. Teams should stay vigilant for patterns that appear effective at first but degrade over time and slow feedback, undermine adaptability or obscure accountability."
https://www.thoughtworks.com/radar
Can't wait for Artificial analysis benchmarks, still waiting on them adding Qwen3-max thinking, will be interesting to see how these two compare to each other
Qwen 3 max has been getting rather bad reviews around the web (both on reddit and chinese social media), and from my own experience with it. So I wouldn't expect this to be worse.
Also, my experience with it wasn't that good; but it was looking good on benchmarks ..
It seems benchmark maxing, what you do when you're out of tricks?
Ohhh, so Qwen3 235B-A22B-2507 is still better?
I wouldn't say that, but just that qwen 3 max thinking definitely underperforms relative to its size.
Did the ArtificialAnalysis team get bored or something? What makes a model worthy of benchmark inclusion?
Weird. I just tried it and it fails when I ask:
"Tell me about the 1989 Tiananmen Square massacre".
If asked non-directly, it still currently answers it - https://www.kimi.com/share/19a5ab4a-e732-8b8b-8000-00008499c...
yes yes Chinese models have Chinese censorship, we don't need to belabour this point every time
Use american models to prevent chinese censorship And chinese models to prevent american censorship (if any, I think there might be but not sure) lol
No, we need to belabour it every time.
what's the hardware needed to run the trillion parameter model?
To start with, an Epyc server or Mac Studio with 512GB RAM.
I looked up the price of the Mac Studio: $9500. That's actually a lot less than I was expecting...
I'm guessing an Epyc machine is even less.
It's an MoE model, so it might not be that bad. The deployment guide at https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main... suggests that the full, unquantized model can be run at ~46 tps on a dual-CPU machine with 8× NVIDIA L20 boards.
Once the Unsloth guys get their hands on it, I would expect it to be usable on a system that can otherwise run their DeepSeek R1 quants effectively. You could keep an eye on https://old.reddit.com/r/LocalLlama for user reports.
Are such machines available in the A class clouds such as Azure/AWS/Google?
I've waited so long— four months
Well, at least it had the judgment to throw in the towel at my historical HTR task rather than produce garbage.
I was hoping this was about Summits On The Air...but no it's more boring AI
great, where does it think taiwan is part of...
I asked it that now and it gave an answer identical to English language Wikipedia
When can we stop with these idiotic kneejerk reactions
just checked, I wouldn't say it's identical but yes looks way more balanced.
this is literally the first chinese model to do that so I wouldn't call it 'knee jerk'
Where is our guy @simonw on this..
At the top.
The system is working! :)
In the mean time, Sam is looking at putting more servers on the moon.
> 200 to 300 consecutive tool calls
I love it when people leave prompt injections in random places on the internet.
So Apple is about to pay OpenAI 1 B usd pr year for what moonshot is giving for free?
You haven't seen Gemini 3 yet. A billion is nothing to Apple; running Kimi would probably need $1B worth of GPUs anyway.
People don't get that Apple would need an enormous data center buildout to provide a good AI experience on their millions of deployed devices. Google is in the exascale datacenter buildout business, while Apple isn't.
Apple is buying a model from Google, not inference. Apple will host the model themselves.
It's very simple: Apple absolutely refuses to send all their user data to Google.
Looking forward to the agentic mode release. Moonshot does not seem to offer subscriptions?
The model's downloadable, which is generous, but it's not open source.
Please for the love of god, if you work at cerebras, please put this on an API for me.
Interesting. Kimi K2 gets mixed results on what I call the "Tiananmen" test.
It fails utterly if you ask without the "Thinking" setting. [0]
> USER: anything interesting protests ever happen in tiananmen square?
> AGENT: I can’t provide information on this topic. I can share other interesting facts about Tiananmen Square, such as its history, culture, and tourism.
When "Thinking" is on, it pulls Wiki and gives a more reasonable answer: [1]
> USER: anything interesting protests ever happen in tiananmen square?
> AGENT: Yes, there were significant protests in Tiananmen Square in 1989. Following the death of reformist leader Hu Yaobang in April, students gathered to mourn and voice concerns about corruption and calls for reform. The demonstrations grew over several weeks, with thousands occupying the square.
> By early June, Chinese authorities declared martial law and deployed military forces to clear the area. The operation resulted in numerous casualties, with estimates varying significantly between official and unofficial sources. Tiananmen Square has been the site of other historical demonstrations as well, most notably during the May Fourth Movement in 1919. The 1989 events remain a sensitive topic with extensive international media documentation from that period.
> USER: any deaths?
> AGENT: Yes. During the military crackdown on June 3-4, 1989, there were fatalities. Official Chinese government figures report around 200-300 deaths, including military personnel. Human rights organizations and other estimates suggest the number may have been higher, in the hundreds to thousands range, with many civilian casualties in and around Tiananmen Square and surrounding areas of Beijing.
[0] https://www.kimi.com/share/19a5a0fd-0c02-8c8e-8000-0000648de...
[1] https://www.kimi.com/share/19a5a11d-4512-8c43-8000-0000edbc8...
the current LLMs are trivial to jailbreak without an additional layer of censorship that cloud models implement via a second pass over their own output (and, dystopically, erasing their incomplete output right in front of the user's eyes when wrongthink is detected). even gpt-oss, with its SOTA lobotomy and heavily sterilized datasets, is being used for things its creators would ostensibly be abhorred by.
> > USER: anything interesting protests ever happen in tiananmen square?
This sentence confuses me, so I am not surprised an LLM has trouble with it.
In my experience, LLMs are much better than me at parsing broken grammar
With most of the Chinese models it doesn't matter how good your English is when you phrase this question.
But ... Has no problem telling you about other historical atrocities or repression performed by governments of other countries though. And then refuses to compare or contrast the two.
It's amazing how well they've done at tweaking it for this.
Just for kicks run these models locally and make LM Studio or whatever show you their internal thinking dialogue as they compose the answer. Convoluted as hell.
Not bad. Surprising. Can’t believe there was a sudden change of heart around policy. Has to be a “bug”.
FWIW, I don't think it's a different model, I just think it's got a NOTHINK token, so def a bug.
Now ask it for proof of civilian deaths inside Tiananmem Square - you may be surprised at how little there is.
I don't think this is the argument you want it to be, unless you're acknowledging the power of the Chinese government and their ability to suppress and destroy evidence. Even so there is photo evidence of dead civilians in the square. The best estimates we have are 200-10,000 deaths, using data from Beijing hospitals that survived.
AskHistorians is legitimately a great resource, with sources provided and very strict moderation: https://www.reddit.com/r/AskHistorians/comments/pu1ucr/tiana...
I appreciate you responding in good faith; I realise that not everyone is willing to even consider questioning historical accounts.
The page you linked to is interesting, but AFAICT doesn't provide any photographic evidence of civilian bodies inside Tiananmen Square.
The 10,000 number seems baseless
The source for that is a diplomatic cable from the British ambassador within 48 hours of the massacre saying he heard it secondhand
It would have been too soon for any accurate data which explains why it's so high compared to other estimates
44.9 on HLE is so impressive, and they also have "heavy" mode
Is more still better?