Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage
How to pull and run Google's Gemma 4 E2B model locally with Ollama, expose it as an OpenAI-compatible endpoint, and wire it into real workflows without touching a cloud API.

Most AI setups start with "sign up, get API key, add credit card." Gemma 4 E2B running through Ollama starts with one shell command. That's the actual pitch.
Google dropped Gemma 4 in April 2026 and the E2B variant is the one worth paying attention to for local dev. The "E" stands for effective parameters — 2.3B effective, 5.1B total with embeddings. It runs on 8GB RAM at 4-bit quantization and still handles multimodal input, 128K context, and native function calling. That's a lot of model for hardware you already own.
Here's how to get it running, expose it as a local API, and actually use it.
Install Ollama
Head to ollama.com/download and grab the installer for your OS.
On Linux, the one-liner works:
curl -fsSL https://ollama.com/install.sh | sh
On Mac, unpack the zip and move it to Applications. The server runs in the background automatically after install.
Check it's running:
ollama --version
If you're on an Apple Silicon Mac, Ollama v0.19+ automatically uses MLX for inference. You don't have to configure anything. On NVIDIA, it uses CUDA. CPU fallback works too, just slower.
Pull Gemma 4 E2B
ollama pull gemma4:e2b
That's roughly a 2.5GB download at Q4 quantization. Ollama stores models in ~/.ollama/models and manages everything from there.
Want to verify it's there?
ollama list
You should see gemma4:e2b in the output.
Run It
The quickest test is the CLI:
ollama run gemma4:e2b
That drops you into an interactive prompt. Type anything. Ctrl+D to exit.
For programmatic use, Ollama exposes a REST endpoint at http://localhost:11434. The /api/chat endpoint handles multi-turn conversations:
curl http://localhost:11434/api/chat -d '{
"model": "gemma4:e2b",
"messages": [{"role": "user", "content": "Explain what a closure is in Python"}],
"stream": false
}'
Set "stream": true if you want token-by-token output instead of waiting for the full response.
The OpenAI-Compatible API
This is where things get useful. Ollama also serves an OpenAI-compatible endpoint at /v1. That means any tool built for the OpenAI SDK works against your local model with one config change.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but ignored
)
response = client.chat.completions.create(
model="gemma4:e2b",
messages=[
{"role": "system", "content": "You are a code reviewer."},
{"role": "user", "content": "Review this Python function for edge cases: def divide(a, b): return a / b"}
]
)
print(response.choices[0].message.content)
The api_key value doesn't matter — Ollama has no auth. Just needs to be non-empty for the SDK.
Gemma 4 adds native system prompt support, which older Gemma models didn't have cleanly. The system role works as expected here. No more prompt-engineering workarounds to get it to follow instructions.
Using It from Node.js
Same pattern, different SDK:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama",
});
const stream = await client.chat.completions.create({
model: "gemma4:e2b",
messages: [{ role: "user", content: "Write a SQL query to get the top 10 users by signup date" }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
The streaming endpoint works exactly like OpenAI's — chunks come back with delta.content on each event.
Hardware Reality Check
Gemma 4 E2B at 4-bit quantization fits comfortably in 8GB unified memory. On a MacBook Pro M3 with 16GB, inference runs at around 40-60 tokens/second in my testing. That's fast enough for interactive use.
If you're on a 16GB+ machine, the E4B variant (about 5.5GB download) is worth the extra size. Noticeably better at reasoning and code tasks.
The 26B MoE model is a different animal. It activates only 4B parameters per inference but still needs 18GB+ RAM to load the full model. Fast once loaded, but the memory floor is real.
For most dev workloads — code review, document parsing, local chatbots, building prototypes — E2B or E4B is the right call.
One Gotcha: Tool Calling
Function calling is supported by Gemma 4, but early GGUF builds had broken implementations. If you're hitting consistent tool call failures, make sure you're on ollama pull gemma4:e2b from the official tag, not a community re-upload, and that your Ollama version is current.
ollama --version # should be 0.6.x or later as of April 2026
The official Ollama library page for gemma4 lists the exact supported capabilities — check there before assuming the model is broken. Usually it's a stale binary.
What It's Actually Good For
I've been running E2B for about two weeks and the honest summary is: code review and refactoring are where it earns its keep. I paste a function, ask it to find edge cases or suggest a cleaner structure, and the output is consistently useful. Not "wow, this changed my life" useful. Just a reliable second set of eyes that costs nothing per query.
Long document Q&A with the 128K context works well. Paste an entire spec or PDF content and ask specific questions — it tracks context across the document better than I expected at this model size.
Multilingual output is solid too. If you're building anything that needs to handle multiple languages locally without sending data to a third-party API, this is a real option now.
Where it falls short: complex multi-step agentic workflows. I tried chaining tool calls across several steps and it broke down. The cloud 31B via Ollama's hosted tier handles it better. E2B is an edge model, not a reasoning frontier.
The setup takes under 10 minutes. If you've been paying for API tokens on tasks that don't actually need cloud-scale intelligence, this is worth an afternoon.
Comments
Leave a comment
Tagged