Congrats on the effort - the local-first / private space needs more performant AI, and AI in general needs more comparable and trustworthy benchmarks.
Notes:
- Olama integration would be nice
- Is there an anonymous federated score sharing?
That way, users you approximate a model's performance before downloading it.
That's very interesting. I guess it just can't compete with any of the Nvidia cards? I would think your results should show up if sorted by "generation"– maybe the leaderboard is cached...
I’m curious: does this fundamentally need to contain an actual model, or would it be okay if it generated a synthetic model itself, full of random weights? I’m picturing downloading just, say, a 20MB file instead of the multi-gigabyte one, and…
> The llamafile executable size is increased from 30mb to 200mb by this release. This is caused by https://github.com/ggml-org/llama.cpp/issues/7156. We're already employing some workarounds to minimize the impact of upstream development contributions on binary size, and we're aiming to find more in the near future.
Ah, of course, CUDA. Honestly I might be more surprised that it’s only this big. That monstrosity will happily consume a dozen gigabytes of disk space.
llamafile-0.9.0 was still 231MiB, then llamafile-0.9.1 was 391MiB, now llamafile-0.9.2 is 293MiB. Fluctuating all over the place, but growing a lot. And localscore-0.9.2 is 363MiB. Why 70MiB extra on top of llamafile-0.9.2? I’m curious, but not curious enough to investigate concretely.
Well, this became a grumble about bloat, but I’d still like to know whether it would be feasible to ship a smaller localscore that would synthesise a suitable model, according to the size required, at runtime.
—⁂—
¹ Eww, GitHub is using the “MB” suffix for its file sizes, but they’re actually mebibytes (2²⁰ bytes, 1048576 bytes, MiB). I thought we’d basically settled on returning the M/mega- prefix to SI with its traditional 10⁶ definition, at least for file sizes, ten or fifteen years ago.
Llamafile could certainly be released without the GPU binaries included by default and it would slim down the size tremendously.
The extra 70MiB is that the CUDA binaries for LocalScore are built with CuBLAS and for more generations of NVIDIA architectures (sm60->sm120), whereas Llamafile is built with TinyBLAS and for just a few generations in particular
I think it's possible to randomize weights with a standard set of layers, and maybe a possibility for the future
I've been waiting for something like this. Have you considered the following based on the benchmark data that's submitted beyond the GPU?
1. User selects a model, size and token output speed and latency. The website generates a hardware list of components that should match the performance requirements.
2. User selects hardware components and the website generates a list of models that performant on that hardware.
3. Monetize through affiliate links the components to fund the project. Think like PC part picker.
I know there's going to be some variability in the benchmarks due to the software stack, but it should give a AI enthusiasts, an educated perspective on what hardware can be relevant for their use case.
Right now the main priority is just getting the data out, but in the future may have some interest in this. Or perhaps we can open an API for others to build this as well
This is super cool. I finally just upgraded my desktop and one thing I’m curious to do with it is run local models. Of course the ram is late so I’ve been googling trying to get an idea of what I could expect and there’s not much out there to compare to unless you’re running state of the art stuff.
I’ll make sure to run contribute my benchmark to this once my ram comes in.
Stoked to have this dataset out in the open. I submitted a bunch of tests for some models I'm experimenting with on my M4 Pro. Rather paltry scores compared to having a dedicated GPU but I'm excited that running a 24B model locally is actually feasible at this point.
A couple of ideas .. I would like to benchmark a remote headless server, as well as different methods to run the LLM (vllm vs tgi vs llama.cpp ...) on my local machine, and in this case llamafile is quite limiting. Connecting over an OpenAI-like API instead would be great!
Interesting approach to making local recommendations more personalized and relevant. I'm curious about the cold start problem for new users and how the platform handles privacy. Partnering with local businesses to augment data could be a smart move. Will be watching to see how this develops!
This looks super useful, especially with so many folks experimenting with local LLMs now. Curious how well it handles edge devices. Will give it a try!
Clicking on GPU is a nice simple visualization. I was thinking maybe try to put that type of visual representation intuitively accessible immediately on the landing page.
cpubenchmark.net could he an example technique of drawing the site visitor into the paradigm.
Congrats on the effort - the local-first / private space needs more performant AI, and AI in general needs more comparable and trustworthy benchmarks.
Notes: - Olama integration would be nice - Is there an anonymous federated score sharing? That way, users you approximate a model's performance before downloading it.
Can you tell me more about the "anonymous federated score sharing"? Maybe something we can think about more
I totally agree with Ollama integration and if there is interest we will try to upstream into llama.cpp
Contributed scores for the M3 Ultra 512 GB unified memory: https://www.localscore.ai/accelerator/404
Happy to test larger models that utilize the memory capacity if helpful.
That's very interesting. I guess it just can't compete with any of the Nvidia cards? I would think your results should show up if sorted by "generation"– maybe the leaderboard is cached...
Ty for pointing this out, the results are taken from the db based on LocalScore, I will make some modifications to make the sorting better here
I’m curious: does this fundamentally need to contain an actual model, or would it be okay if it generated a synthetic model itself, full of random weights? I’m picturing downloading just, say, a 20MB file instead of the multi-gigabyte one, and…
Hang on, why is https://blob.localscore.ai/localscore-0.9.2 380MB? I remember llamafile being only a few megabytes. From https://github.com/Mozilla-Ocho/llamafile/releases, looks like it steadily grew from adding support for GPUs on more platforms, up to 28.5MiB¹ in 0.8.12, and then rocketed up to 230MiB in 0.8.13:
> The llamafile executable size is increased from 30mb to 200mb by this release. This is caused by https://github.com/ggml-org/llama.cpp/issues/7156. We're already employing some workarounds to minimize the impact of upstream development contributions on binary size, and we're aiming to find more in the near future.
Ah, of course, CUDA. Honestly I might be more surprised that it’s only this big. That monstrosity will happily consume a dozen gigabytes of disk space.
llamafile-0.9.0 was still 231MiB, then llamafile-0.9.1 was 391MiB, now llamafile-0.9.2 is 293MiB. Fluctuating all over the place, but growing a lot. And localscore-0.9.2 is 363MiB. Why 70MiB extra on top of llamafile-0.9.2? I’m curious, but not curious enough to investigate concretely.
Well, this became a grumble about bloat, but I’d still like to know whether it would be feasible to ship a smaller localscore that would synthesise a suitable model, according to the size required, at runtime.
—⁂—
¹ Eww, GitHub is using the “MB” suffix for its file sizes, but they’re actually mebibytes (2²⁰ bytes, 1048576 bytes, MiB). I thought we’d basically settled on returning the M/mega- prefix to SI with its traditional 10⁶ definition, at least for file sizes, ten or fifteen years ago.
LocalScore dev here
Llamafile could certainly be released without the GPU binaries included by default and it would slim down the size tremendously.
The extra 70MiB is that the CUDA binaries for LocalScore are built with CuBLAS and for more generations of NVIDIA architectures (sm60->sm120), whereas Llamafile is built with TinyBLAS and for just a few generations in particular
I think it's possible to randomize weights with a standard set of layers, and maybe a possibility for the future
I've been waiting for something like this. Have you considered the following based on the benchmark data that's submitted beyond the GPU?
1. User selects a model, size and token output speed and latency. The website generates a hardware list of components that should match the performance requirements.
2. User selects hardware components and the website generates a list of models that performant on that hardware.
3. Monetize through affiliate links the components to fund the project. Think like PC part picker.
I know there's going to be some variability in the benchmarks due to the software stack, but it should give a AI enthusiasts, an educated perspective on what hardware can be relevant for their use case.
Right now the main priority is just getting the data out, but in the future may have some interest in this. Or perhaps we can open an API for others to build this as well
Awesome stuff, congrats on launching!
This is super cool. I finally just upgraded my desktop and one thing I’m curious to do with it is run local models. Of course the ram is late so I’ve been googling trying to get an idea of what I could expect and there’s not much out there to compare to unless you’re running state of the art stuff.
I’ll make sure to run contribute my benchmark to this once my ram comes in.
Congrats on launching!
Stoked to have this dataset out in the open. I submitted a bunch of tests for some models I'm experimenting with on my M4 Pro. Rather paltry scores compared to having a dedicated GPU but I'm excited that running a 24B model locally is actually feasible at this point.
This is great, congrats for launching!
A couple of ideas .. I would like to benchmark a remote headless server, as well as different methods to run the LLM (vllm vs tgi vs llama.cpp ...) on my local machine, and in this case llamafile is quite limiting. Connecting over an OpenAI-like API instead would be great!
LocalScore dev here
Thank you! I think this is quite possible! If you don't mind starting a discussion on this I would love to think aloud there
https://github.com/cjpais/LocalScore/discussions
Interesting approach to making local recommendations more personalized and relevant. I'm curious about the cold start problem for new users and how the platform handles privacy. Partnering with local businesses to augment data could be a smart move. Will be watching to see how this develops!
This looks super useful, especially with so many folks experimenting with local LLMs now. Curious how well it handles edge devices. Will give it a try!
Why choose a combibation of llama and qwen, when you could have used just qwen models with more permissive license?
It's kind of just an artifact of the development. Happy to switch the default models and sizes in the future especially based on community feedback
Really awesome project!
Clicking on GPU is a nice simple visualization. I was thinking maybe try to put that type of visual representation intuitively accessible immediately on the landing page.
cpubenchmark.net could he an example technique of drawing the site visitor into the paradigm.
I think you might be right, definitely interested in this feedback and creating charts and graphs that are the most useful for folks!