The big story here is the encoder-free part, which I still don't fully understand.
> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.
It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.
I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.
The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."
12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that.
What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?
Is it simply goodwill and/or marketing? Or am I missing something strategic?
This won't replace commercially viable, revenue generating alternatives of their own devising, but it does enable development activity and initiate conversations with enterprises who start with this model but want to do slightly more.
That's my experience right now... my company is all in on a plethora of platform products. Also, Microsoft just yesterday said their goal was "Unmetered intelligence". There's a lot of things that can be enabled by small local models, and those things are part of stacks that can generate revenue in other layers.
If you're an AI lab, you definitely want research teams in this space - as this is where you can most easily iterate and make improvements which you'll then bake into larger, frontier models.
The question is: do you want to release your models, or use them purely for R&D?
Since everyone else is already releasing models of similar qualities, it's hard to say you're shooting yourself in the foot if you join the chorus.
The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.
It's to destroy possible footholds for competitors and prevent them from making money in segments that Google doesn't care too much about, but can trivially commoditize.
Neutering OpenAI and Anthropic would be my guess. Commoditized LLMs won't hurt Google nearly as much as it hurts the LLM-only companies, and so accelerating the inevitable just helps knock out potential future competition in areas where Google -does- make a lot of money now.
Google's MO since always has been to release great products or services for free, position themselves high and then abandon them or just find uses for Enterprise sales.
I'm pretty sure they are doing it because they get some research experience by shrinking and improving these models, and because they know that by doing this they get some good PR among the dev community.
Isn't Apple about to license some variation of this from google for on-device AI? Maybe it’s their sales pitch to Apple and then they will lock it down.
Maybe they are hedging against a future where local models are just as good as cloud models? Or maybe they can go the Taalas route and start hardcoding Gemma on a chip and hardware manufacturers can use it for local private AI.
Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.
MLX is quite literally macOS-specific technology, for other platforms you want non-MLX.
I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.
This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.
> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.
I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)
I dunno, feels a bit unfair to companies that actually do FOSS releases (Gemma 4 being released under Apache 2.0 license) to compare them to a company that never done any FOSS releases, and mostly done proprietary "available to download" releases.
Agreed, miles ahead though from "proprietary" which is what Meta been using for most model releases.
Ideally companies would share the fucking datasets and training code already, but no, no one wants to talk about the source of those or even share the ones they have as then who knows what comes out of Pandora's box...
Every other Google model I have tried felt very weak compared to qwen models. I dont have a ton of use case for multimodal though, so its very possible this is a fantastic multimodal model.
IDK this model release is a bit disappointing considering the community has been chomping at the bit for the 124ba4b model. There was some leaked info about it but people suspect it was not released because it was too close to gemini flash in performance.
The big story here is the encoder-free part, which I still don't fully understand.
> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.
That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...
> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.
I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.
One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.
Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.
In hindsight I may have been pedantic.
> quantization
12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?
But TBD how well the base model performs before thinking too much about quantization
It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.
I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.
Well its a real simple encoder I guess
> That's technically encoding
Isn't that just projecting the patches into the d_model size vectors that the models takes?
>I am assuming that involves of quantization
12B model in 16GB seems very reasonable to me, int8 is top quality for running models.
The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."
12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that.
What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?
Is it simply goodwill and/or marketing? Or am I missing something strategic?
This won't replace commercially viable, revenue generating alternatives of their own devising, but it does enable development activity and initiate conversations with enterprises who start with this model but want to do slightly more.
That's my experience right now... my company is all in on a plethora of platform products. Also, Microsoft just yesterday said their goal was "Unmetered intelligence". There's a lot of things that can be enabled by small local models, and those things are part of stacks that can generate revenue in other layers.
If you're an AI lab, you definitely want research teams in this space - as this is where you can most easily iterate and make improvements which you'll then bake into larger, frontier models.
The question is: do you want to release your models, or use them purely for R&D?
Since everyone else is already releasing models of similar qualities, it's hard to say you're shooting yourself in the foot if you join the chorus.
The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.
It's to destroy possible footholds for competitors and prevent them from making money in segments that Google doesn't care too much about, but can trivially commoditize.
Android and Chrome need on-device AI capabilities. Google can't lock down those weights like it can with server-side ML.
So it's easier to just release those models as open source and make it official, since someone would inevitably hack the weights out anyway.
My guess is testing for Apple’s Siri replacement and partnership but that’s a total SWAG
Neutering OpenAI and Anthropic would be my guess. Commoditized LLMs won't hurt Google nearly as much as it hurts the LLM-only companies, and so accelerating the inevitable just helps knock out potential future competition in areas where Google -does- make a lot of money now.
Google's MO since always has been to release great products or services for free, position themselves high and then abandon them or just find uses for Enterprise sales.
I'm pretty sure they are doing it because they get some research experience by shrinking and improving these models, and because they know that by doing this they get some good PR among the dev community.
Isn't Apple about to license some variation of this from google for on-device AI? Maybe it’s their sales pitch to Apple and then they will lock it down.
Maybe they are hedging against a future where local models are just as good as cloud models? Or maybe they can go the Taalas route and start hardcoding Gemma on a chip and hardware manufacturers can use it for local private AI.
They're trying to capture the segment of the market that wants to control the model, with the intent of getting you to run them on Vertex.
Marketing + Pro Serv if I had to take a guess.
edge compute
Gemma overtakes and kills real open-source AI projects, pushing people who would support them towards enterprises like Google
Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.
[0] https://ollama.com/library/gemma4/tags
MLX is Apple’s own machine learning framework, designed for Apple Silicon: https://opensource.apple.com/projects/mlx/
MLX is quite literally macOS-specific technology, for other platforms you want non-MLX.
I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.
This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.
> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.
I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)
What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?
I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.
How does it compare with e4b, aside from being larger?
There's a comparison of all the Gemma 4 models (+ Gemma 3 27B) on the Huggingface model card: https://huggingface.co/google/gemma-4-12B-it#benchmark-resul...
That's what I want to know too. A smarter E4B that's happy in opencode would be a good selfhosted model for me
Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.
I dunno, feels a bit unfair to companies that actually do FOSS releases (Gemma 4 being released under Apache 2.0 license) to compare them to a company that never done any FOSS releases, and mostly done proprietary "available to download" releases.
Note that a binary released under Apache 2.0 license does not yet make it FOSS.
Agreed, miles ahead though from "proprietary" which is what Meta been using for most model releases.
Ideally companies would share the fucking datasets and training code already, but no, no one wants to talk about the source of those or even share the ones they have as then who knows what comes out of Pandora's box...
Every other Google model I have tried felt very weak compared to qwen models. I dont have a ton of use case for multimodal though, so its very possible this is a fantastic multimodal model.
IDK this model release is a bit disappointing considering the community has been chomping at the bit for the 124ba4b model. There was some leaked info about it but people suspect it was not released because it was too close to gemini flash in performance.