While impressive that the output isn't completely undecipherable, my real-world queries for SpringBoot project with most popular libraries don't compare so favorably to their benchmarks against Qwen3 32B, which I also run regularly (a 4bit quantized version of). Explaining tasks break completely and often.
Used their recommended temperature, top_k, top_p and so on settings
Breaks as in think block contains nonsense or the output finishes? I've had some thinking weirdness which doesn't seem to affect the final answer much.
Overall it still seems extremely good for its size and I wouldn't expect anything below 30B to behave like that. I mean, it flies with 100 tok/sec even on a 1650 :D
Breaks as in contains words that grammatically work but don't make sense, mistakes the symbol | for a person, points back to things that didn't exist in the request etc. I use templates like these for explaining questions:
from
```
excerpt of text or code from some application or site
```
What is the meaning of excerpt?
Just doesen't seem to work at a useable level. Coding questions get code that runs, but almost always misses so many things that finding out what it missed and fixing those takes a lot more time than handwriting code.
>Overall it still seems extremely good for its size and I wouldn't expect anything below 30B to behave like that. I mean, it flies with 100 tok/sec even on a 1650 :D
For it's size absolutely, I've not seen 1,5B models that form even sentences right most of the time so this is miles ahead of most small models, not just to the hinted at levels the benchmarks would you have believe
Interesting, I haven't seen it actually return nonsense yet. (Some incorrect things and getting into thinking loops, but always coherent) I'm running it on a latest llama.cpp with the bf16 gguf. What are you using?
I'm running the huggingface's .safetensors with vLLM with as little starting parameters as possible. I thought it must not be sending temp right, but after setting temp to something else I got chinese so it should be sending it.
Overall if you're memory constrained it's probably still worth to try and fiddle around with it if you can get it to work. Speedwise if you got the memory a 5090 can get ~50-100tok/s for a single query with 32B-AWQ and way more if you have something parallel like open-webui
It's so tiny you can download and run it locally on CPU with llama.cpp. It seems weirdly good at some simple python questions. Definitely better than I'd expect from any model of that size.
While impressive that the output isn't completely undecipherable, my real-world queries for SpringBoot project with most popular libraries don't compare so favorably to their benchmarks against Qwen3 32B, which I also run regularly (a 4bit quantized version of). Explaining tasks break completely and often.
Used their recommended temperature, top_k, top_p and so on settings
Breaks as in think block contains nonsense or the output finishes? I've had some thinking weirdness which doesn't seem to affect the final answer much.
Overall it still seems extremely good for its size and I wouldn't expect anything below 30B to behave like that. I mean, it flies with 100 tok/sec even on a 1650 :D
Breaks as in contains words that grammatically work but don't make sense, mistakes the symbol | for a person, points back to things that didn't exist in the request etc. I use templates like these for explaining questions:
from
```
excerpt of text or code from some application or site
```
What is the meaning of excerpt?
Just doesen't seem to work at a useable level. Coding questions get code that runs, but almost always misses so many things that finding out what it missed and fixing those takes a lot more time than handwriting code.
>Overall it still seems extremely good for its size and I wouldn't expect anything below 30B to behave like that. I mean, it flies with 100 tok/sec even on a 1650 :D
For it's size absolutely, I've not seen 1,5B models that form even sentences right most of the time so this is miles ahead of most small models, not just to the hinted at levels the benchmarks would you have believe
Interesting, I haven't seen it actually return nonsense yet. (Some incorrect things and getting into thinking loops, but always coherent) I'm running it on a latest llama.cpp with the bf16 gguf. What are you using?
I'm running the huggingface's .safetensors with vLLM with as little starting parameters as possible. I thought it must not be sending temp right, but after setting temp to something else I got chinese so it should be sending it.
Overall if you're memory constrained it's probably still worth to try and fiddle around with it if you can get it to work. Speedwise if you got the memory a 5090 can get ~50-100tok/s for a single query with 32B-AWQ and way more if you have something parallel like open-webui
Does benchmarks look incredible. Like almost too good to be true, what am I missing?
Is this hosted online somewhere so I can try it out?
It's so tiny you can download and run it locally on CPU with llama.cpp. It seems weirdly good at some simple python questions. Definitely better than I'd expect from any model of that size.
Many interesting open weights models are coming from China.