LLM Benchmark: Frontier models now statistically indistinguishable

4 points | by js4ever 17 hours ago

4 comments

Adrig 16 hours ago
I don't follow closely all these benchmarks but I would love to have some idea of the status of models for these specific use cases. Average intelligence is close for each mainstream models, but on writing, design, coding, search, there is still some gaps.
Even if it's not benchmark, a vibe test from a trusted professionnal with a close use case to mine would suffice.
Your point about ecosystem is true, I just switched main main provider from OpenAI to Anthropic because they continue to prove they have a good concrete vision about AI
anonzzzies 16 hours ago
Would be nice to include similar sized open (source/weights) ones.
[-]
- js4ever 9 hours ago
  Just tried devstral 2 (123B from Mistral) it scored 76% ... Disappointing
jaggs 14 hours ago
That's true until you try to use them for a real task. Then the differences become clear as day.