Google Released Gemma 4 for Free. Here Is Why That Makes Sense.
Gemma 4 dropped April 2, 2026 under Apache 2.0 with full commercial rights. This is what the architecture actually does and what Google is really after.

On April 2, 2026, Google released four AI models. You can download them, run them locally, fine-tune them on your own data, ship products built on top of them, and charge customers for those products. No usage fees. No per-token billing. No reporting requirements back to Google. Just a standard Apache 2.0 license and a Hugging Face page.
That is a strange move for one of the biggest technology companies on earth. Google has profitable API services. They have Google Cloud. They have Gemini. So why are they handing out a model built on the same research as Gemini 3, completely free, with full commercial rights?
There is a reason. Once you see it, the whole release makes sense.
What Gemma 4 Actually Is
When you call the Gemini API, your prompt leaves your machine, travels to Google's servers, gets processed on their GPUs, and the response comes back. You pay per token in and per token out. Your laptop is just a terminal. The actual computation happens somewhere else, on hardware you do not control.
Gemma works differently. You download the model weights once. Those weights contain everything the model learned during training, frozen into a file on your disk. After that, everything runs locally. Your CPU, your GPU, your RAM. No internet needed, no API call, no Google server involved.
This is the same reason Llama from Meta became so widely used. Meta released their weights openly, and within days the community had tools like Ollama running a capable model on a MacBook with one terminal command. Gemma 4 works with the same ecosystem. If you already use Ollama, getting started is one line:
# 26B MoE — the sweet spot for most developers
ollama run gemma4
# 31B Dense — maximum quality, needs ~24GB VRAM
ollama run gemma4:31b
Running local AI is not a new idea. Llama has been running locally for over two years. Mistral, Phi, all of these ran locally. What is new with Gemma 4 is the quality of what you can now run locally. The gap between cloud models and local models just got significantly smaller.
The Four Models and What Makes Them Interesting
Gemma 4 ships in four sizes: E2B, E4B, 26B, and 31B. The naming tells you something about the architecture inside each one.
E2B and E4B: Per-Layer Embeddings
The "E" stands for "effective" parameters. These two edge models use a technique called Per-Layer Embeddings (PLE). In a standard transformer, a token gets converted into a vector and that same vector flows through every decoder layer unchanged. Think of it as one ID badge that gets checked at every floor of a building, whether or not that floor cares about the details on it.
PLE gives each decoder layer its own small secondary embedding for every token. Each layer gets a richer, more specific signal about what it is processing. The model does not need to be as wide or as deep to produce good output, because each layer starts with better information.
The practical result: the E2B runs in under 1.5 GB of RAM. Most smartphone apps take more space. And this model understands text, images, and audio, works in over 140 languages, and runs completely offline. That is a real number worth sitting with.
26B: Mixture of Experts
In a traditional dense model, every parameter fires for every token. All of them, every time. That worked fine when models were smaller. As they scaled into the tens of billions of parameters, it became expensive. Tens of billions of operations per token, on every single forward pass.
Mixture of Experts solves this by splitting the model into many smaller specialist networks called experts, then adding a lightweight router that looks at each incoming token and decides which experts are actually needed. The rest sit idle.
The Gemma 4 26B has 128 such experts. For each token, only 8 of them activate. So while all 26 billion parameters sit in memory, only about 3.8 billion do actual work at any given moment. You get the knowledge of a 26B parameter model at the compute cost of roughly a 4B one.
The benchmark that got people talking: on Arena AI, where real humans have blind conversations with different models and vote on which answer they prefer, the 26B scores 1441. The 31B dense model scores 1452. An 11-point gap. The 26B uses roughly one-eighth the compute per inference step. That is the MoE argument made concrete.
31B Dense
No tricks. Every parameter fires for every token. This is the raw quality variant. It currently sits at #3 among all open models on Arena AI, which is a significant position for something you can run on a single consumer GPU. At Q4 quantization it fits in about 20-24 GB of VRAM, which puts it on a single RTX 4090 or an M-series Mac with enough unified memory.
The License Is the Real Story
Previous Gemma versions used a custom Google license. It looked permissive, but it had enough ambiguous carve-outs that enterprise legal teams kept flagging it as a risk. Many companies just stayed away. The paperwork was not worth it.
Gemma 4 ships under Apache 2.0. This is a standard open-source license that has been around for decades. No revenue limits, no user count thresholds, no reporting obligations. You can fine-tune the model on your own private data, package it into a product, sell that product, and compete with Google directly. You just need to keep the license text in your distribution.
For developers working in healthcare, fintech, or government, places where data cannot leave the building, this changes the conversation completely. The model runs on your own hardware. Your data never moves. And the license is one your legal team has seen a thousand times.
What Google Is Actually After
Google has been watching the open-source AI ecosystem move fast and move away from them. Meta dropped Llama and developers immediately built entire toolchains around it. Mistral came out of nowhere. DeepSeek demonstrated what was possible with aggressive efficiency work.
When developers get comfortable with a model family, they write tooling around it. They learn its quirks. They build workflows that depend on it. That familiarity compounds. The switching cost gets real.
So if Google keeps everything locked behind the Gemini API, they stay competitive at the very top of the market but lose the developer ecosystem. Losing the developer ecosystem is how you become irrelevant to the next generation of builders.
Gemma solves that. Get developers building on Gemma. Make the license clean so there is no legal reason to pick anything else. Then when that startup's local prototype needs to go to production at scale, when they need to serve millions of requests per day, where do they go? Google Cloud. Their entire setup already runs on Gemma. Vertex AI is right there.
Open source is the top of the funnel. Cloud revenue is the conversion at the bottom.
Meta is running the exact same play with Llama. The difference is Meta's core business is advertising. For Google, Cloud is a direct revenue line. What this means in practice: two of the biggest technology companies in the world are now racing to give away the best AI they possibly can, because the developer who builds on your model today is the customer paying your cloud bill tomorrow.
What This Means If You Are Building Something
The cost of building a serious AI product has dropped sharply. You can build and test locally on Gemma, validate whether the thing actually works, and only move to paid cloud infrastructure when you have real revenue to justify it. That is a different risk profile than it was two years ago.
Enterprises and regulated industries also get something they actually needed. A model with a clean open license that runs entirely on your own hardware is a completely different legal conversation than "your data goes to our servers and you pay us per token."
Google is not giving Gemma 4 away because they are generous. They are doing it because a world where developers build on Google's open models is better for Google than a world where developers build on someone else's. The Trojan horse is already inside the walls. What you build with it is up to you.
Build. Ship. Repeat. — arbindbuilds.com
Comments
Leave a comment
Lovable Leaks Source Code: The $6.6B BOLA Vulnerability
An 8 million user platform ignored a critical BOLA vulnerability for 48 days. How a $6.6B AI app builder leaked source code, credentials, and user data.
DeepSeek V4-Pro's 75% Price Cut Is Now Permanent
DeepSeek just made its flagship API pricing permanent at a quarter of launch price. Here's what the numbers actually mean for developers building agentic systems.
Kubernetes vs Docker: Stop Comparing the Wrong Things
Docker builds containers. Kubernetes runs them at scale. They're not rivals and picking the wrong mental model for each costs you months of overhead.
Tagged