Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage

Most AI setups start with "sign up, get API key, add credit card." Gemma 4 E2B running through Ollama starts with one shell command. That's the actual pitch.

Google dropped Gemma 4 in April 2026 and the E2B variant is the one worth paying attention to for local dev. The "E" stands for effective parameters — 2.3B effective, 5.1B total with embeddings. It runs on 8GB RAM at 4-bit quantization and still handles multimodal input, 128K context, and native function calling. That's a lot of model for hardware you already own.

Here's how to get it running, expose it as a local API, and actually use it.

Install Ollama

Head to ollama.com/download and grab the installer for your OS.

On Linux, the one-liner works:

curl -fsSL https://ollama.com/install.sh | sh

On Mac, unpack the zip and move it to Applications. The server runs in the background automatically after install.

Check it's running:

ollama --version

If you're on an Apple Silicon Mac, Ollama v0.19+ automatically uses MLX for inference. You don't have to configure anything. On NVIDIA, it uses CUDA. CPU fallback works too, just slower.

Pull Gemma 4 E2B

ollama pull gemma4:e2b

That's roughly a 2.5GB download at Q4 quantization. Ollama stores models in ~/.ollama/models and manages everything from there.

Want to verify it's there?

ollama list

You should see gemma4:e2b in the output.

Run It

The quickest test is the CLI:

ollama run gemma4:e2b

That drops you into an interactive prompt. Type anything. Ctrl+D to exit.

For programmatic use, Ollama exposes a REST endpoint at http://localhost:11434. The /api/chat endpoint handles multi-turn conversations:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:e2b",
  "messages": [{"role": "user", "content": "Explain what a closure is in Python"}],
  "stream": false
}'

Set "stream": true if you want token-by-token output instead of waiting for the full response.

The OpenAI-Compatible API

This is where things get useful. Ollama also serves an OpenAI-compatible endpoint at /v1. That means any tool built for the OpenAI SDK works against your local model with one config change.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but ignored
)

response = client.chat.completions.create(
    model="gemma4:e2b",
    messages=[
        {"role": "system", "content": "You are a code reviewer."},
        {"role": "user", "content": "Review this Python function for edge cases: def divide(a, b): return a / b"}
    ]
)

print(response.choices[0].message.content)

The api_key value doesn't matter — Ollama has no auth. Just needs to be non-empty for the SDK.

Gemma 4 adds native system prompt support, which older Gemma models didn't have cleanly. The system role works as expected here. No more prompt-engineering workarounds to get it to follow instructions.

Using It from Node.js

Same pattern, different SDK:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",
});

const stream = await client.chat.completions.create({
  model: "gemma4:e2b",
  messages: [{ role: "user", content: "Write a SQL query to get the top 10 users by signup date" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

The streaming endpoint works exactly like OpenAI's — chunks come back with delta.content on each event.

Hardware Reality Check

Gemma 4 E2B at 4-bit quantization fits comfortably in 8GB unified memory. On a MacBook Pro M3 with 16GB, inference runs at around 40-60 tokens/second in my testing. That's fast enough for interactive use.

If you're on a 16GB+ machine, the E4B variant (about 5.5GB download) is worth the extra size. Noticeably better at reasoning and code tasks.

The 26B MoE model is a different animal. It activates only 4B parameters per inference but still needs 18GB+ RAM to load the full model. Fast once loaded, but the memory floor is real.

For most dev workloads — code review, document parsing, local chatbots, building prototypes — E2B or E4B is the right call.

One Gotcha: Tool Calling

Function calling is supported by Gemma 4, but early GGUF builds had broken implementations. If you're hitting consistent tool call failures, make sure you're on ollama pull gemma4:e2b from the official tag, not a community re-upload, and that your Ollama version is current.

ollama --version  # should be 0.6.x or later as of April 2026

The official Ollama library page for gemma4 lists the exact supported capabilities — check there before assuming the model is broken. Usually it's a stale binary.

What It's Actually Good For

I've been running E2B for about two weeks and the honest summary is: code review and refactoring are where it earns its keep. I paste a function, ask it to find edge cases or suggest a cleaner structure, and the output is consistently useful. Not "wow, this changed my life" useful. Just a reliable second set of eyes that costs nothing per query.

Long document Q&A with the 128K context works well. Paste an entire spec or PDF content and ask specific questions — it tracks context across the document better than I expected at this model size.

Multilingual output is solid too. If you're building anything that needs to handle multiple languages locally without sending data to a third-party API, this is a real option now.

Where it falls short: complex multi-step agentic workflows. I tried chaining tool calls across several steps and it broke down. The cloud 31B via Ollama's hosted tier handles it better. E2B is an edge model, not a reasoning frontier.

The setup takes under 10 minutes. If you've been paying for API tokens on tasks that don't actually need cloud-scale intelligence, this is worth an afternoon.

Most AI setups start with "sign up, get API key, add credit card." Gemma 4 E2B running through Ollama starts with one shell command. That's the actual pitch.

Here's how to get it running, expose it as a local API, and actually use it.

Install Ollama

Head to ollama.com/download and grab the installer for your OS.

On Linux, the one-liner works:

curl -fsSL https://ollama.com/install.sh | sh

On Mac, unpack the zip and move it to Applications. The server runs in the background automatically after install.

Check it's running:

ollama --version

If you're on an Apple Silicon Mac, Ollama v0.19+ automatically uses MLX for inference. You don't have to configure anything. On NVIDIA, it uses CUDA. CPU fallback works too, just slower.

Pull Gemma 4 E2B

ollama pull gemma4:e2b

That's roughly a 2.5GB download at Q4 quantization. Ollama stores models in ~/.ollama/models and manages everything from there.

Want to verify it's there?

ollama list

You should see gemma4:e2b in the output.

Run It

The quickest test is the CLI:

ollama run gemma4:e2b

That drops you into an interactive prompt. Type anything. Ctrl+D to exit.

For programmatic use, Ollama exposes a REST endpoint at http://localhost:11434. The /api/chat endpoint handles multi-turn conversations:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:e2b",
  "messages": [{"role": "user", "content": "Explain what a closure is in Python"}],
  "stream": false
}'

Set "stream": true if you want token-by-token output instead of waiting for the full response.

The OpenAI-Compatible API

This is where things get useful. Ollama also serves an OpenAI-compatible endpoint at /v1. That means any tool built for the OpenAI SDK works against your local model with one config change.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but ignored
)

response = client.chat.completions.create(
    model="gemma4:e2b",
    messages=[
        {"role": "system", "content": "You are a code reviewer."},
        {"role": "user", "content": "Review this Python function for edge cases: def divide(a, b): return a / b"}
    ]
)

print(response.choices[0].message.content)

The api_key value doesn't matter — Ollama has no auth. Just needs to be non-empty for the SDK.

Using It from Node.js

Same pattern, different SDK:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",
});

const stream = await client.chat.completions.create({
  model: "gemma4:e2b",
  messages: [{ role: "user", content: "Write a SQL query to get the top 10 users by signup date" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

The streaming endpoint works exactly like OpenAI's — chunks come back with delta.content on each event.

Hardware Reality Check

If you're on a 16GB+ machine, the E4B variant (about 5.5GB download) is worth the extra size. Noticeably better at reasoning and code tasks.

The 26B MoE model is a different animal. It activates only 4B parameters per inference but still needs 18GB+ RAM to load the full model. Fast once loaded, but the memory floor is real.

For most dev workloads — code review, document parsing, local chatbots, building prototypes — E2B or E4B is the right call.

One Gotcha: Tool Calling

ollama --version  # should be 0.6.x or later as of April 2026

The official Ollama library page for gemma4 lists the exact supported capabilities — check there before assuming the model is broken. Usually it's a stale binary.

What It's Actually Good For

Long document Q&A with the 128K context works well. Paste an entire spec or PDF content and ask specific questions — it tracks context across the document better than I expected at this model size.

Multilingual output is solid too. If you're building anything that needs to handle multiple languages locally without sending data to a third-party API, this is a real option now.

The setup takes under 10 minutes. If you've been paying for API tokens on tasks that don't actually need cloud-scale intelligence, this is worth an afternoon.

Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage

Install Ollama

Pull Gemma 4 E2B

Run It

The OpenAI-Compatible API

Using It from Node.js

Hardware Reality Check

One Gotcha: Tool Calling

What It's Actually Good For

Arbind Singh

Comments

Leave a comment

GPT-Image-2 Is Not a DALL-E Upgrade. It's a Different Kind of Model.

Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage

Install Ollama

Pull Gemma 4 E2B

Run It

The OpenAI-Compatible API

Using It from Node.js

Hardware Reality Check

One Gotcha: Tool Calling

What It's Actually Good For

Arbind Singh

Comments

Leave a comment

GPT-Image-2 Is Not a DALL-E Upgrade. It's a Different Kind of Model.