ArbindBuilds LogoArbindBuilds
Blog
CheatsheetsProjectsLinksAbout
Hire Me

ArbindBuilds

Build. Design. Repeat.

© 2026 ArbindBuilds.
All rights reserved.

Site Map

  • Home
  • Blog
  • Projects
  • About
  • Uses

Content

  • Cheatsheets
  • AI Tools
  • AI Prompts
  • Links

Products

  • Speakify
  • Gumroad Store
  • GitHub
  • Twitter / X

Made with care in Assam, India.

  1. Home/
  2. AI Tools/
  3. Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage
AI Tools
Arbind Singh·April 20, 2026·5 min read·

Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage

How to pull and run Google's Gemma 4 E2B model locally with Ollama, expose it as an OpenAI-compatible endpoint, and wire it into real workflows without touching a cloud API.

Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage

Most AI setups start with "sign up, get API key, add credit card." Gemma 4 E2B running through Ollama starts with one shell command. That's the actual pitch.

Google dropped Gemma 4 in April 2026 and the E2B variant is the one worth paying attention to for local dev. The "E" stands for effective parameters — 2.3B effective, 5.1B total with embeddings. It runs on 8GB RAM at 4-bit quantization and still handles multimodal input, 128K context, and native function calling. That's a lot of model for hardware you already own.

Here's how to get it running, expose it as a local API, and actually use it.

Install Ollama

Head to ollama.com/download and grab the installer for your OS.

On Linux, the one-liner works:

curl -fsSL https://ollama.com/install.sh | sh

On Mac, unpack the zip and move it to Applications. The server runs in the background automatically after install.

Check it's running:

ollama --version

If you're on an Apple Silicon Mac, Ollama v0.19+ automatically uses MLX for inference. You don't have to configure anything. On NVIDIA, it uses CUDA. CPU fallback works too, just slower.

Pull Gemma 4 E2B

ollama pull gemma4:e2b

That's roughly a 2.5GB download at Q4 quantization. Ollama stores models in ~/.ollama/models and manages everything from there.

Want to verify it's there?

ollama list

You should see gemma4:e2b in the output.

Run It

The quickest test is the CLI:

ollama run gemma4:e2b

That drops you into an interactive prompt. Type anything. Ctrl+D to exit.

For programmatic use, Ollama exposes a REST endpoint at http://localhost:11434. The /api/chat endpoint handles multi-turn conversations:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:e2b",
  "messages": [{"role": "user", "content": "Explain what a closure is in Python"}],
  "stream": false
}'

Set "stream": true if you want token-by-token output instead of waiting for the full response.

The OpenAI-Compatible API

This is where things get useful. Ollama also serves an OpenAI-compatible endpoint at /v1. That means any tool built for the OpenAI SDK works against your local model with one config change.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but ignored
)

response = client.chat.completions.create(
    model="gemma4:e2b",
    messages=[
        {"role": "system", "content": "You are a code reviewer."},
        {"role": "user", "content": "Review this Python function for edge cases: def divide(a, b): return a / b"}
    ]
)

print(response.choices[0].message.content)

The api_key value doesn't matter — Ollama has no auth. Just needs to be non-empty for the SDK.

Gemma 4 adds native system prompt support, which older Gemma models didn't have cleanly. The system role works as expected here. No more prompt-engineering workarounds to get it to follow instructions.

Using It from Node.js

Same pattern, different SDK:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",
});

const stream = await client.chat.completions.create({
  model: "gemma4:e2b",
  messages: [{ role: "user", content: "Write a SQL query to get the top 10 users by signup date" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

The streaming endpoint works exactly like OpenAI's — chunks come back with delta.content on each event.

Hardware Reality Check

Gemma 4 E2B at 4-bit quantization fits comfortably in 8GB unified memory. On a MacBook Pro M3 with 16GB, inference runs at around 40-60 tokens/second in my testing. That's fast enough for interactive use.

If you're on a 16GB+ machine, the E4B variant (about 5.5GB download) is worth the extra size. Noticeably better at reasoning and code tasks.

The 26B MoE model is a different animal. It activates only 4B parameters per inference but still needs 18GB+ RAM to load the full model. Fast once loaded, but the memory floor is real.

For most dev workloads — code review, document parsing, local chatbots, building prototypes — E2B or E4B is the right call.

One Gotcha: Tool Calling

Function calling is supported by Gemma 4, but early GGUF builds had broken implementations. If you're hitting consistent tool call failures, make sure you're on ollama pull gemma4:e2b from the official tag, not a community re-upload, and that your Ollama version is current.

ollama --version  # should be 0.6.x or later as of April 2026

The official Ollama library page for gemma4 lists the exact supported capabilities — check there before assuming the model is broken. Usually it's a stale binary.

What It's Actually Good For

I've been running E2B for about two weeks and the honest summary is: code review and refactoring are where it earns its keep. I paste a function, ask it to find edge cases or suggest a cleaner structure, and the output is consistently useful. Not "wow, this changed my life" useful. Just a reliable second set of eyes that costs nothing per query.

Long document Q&A with the 128K context works well. Paste an entire spec or PDF content and ask specific questions — it tracks context across the document better than I expected at this model size.

Multilingual output is solid too. If you're building anything that needs to handle multiple languages locally without sending data to a third-party API, this is a real option now.

Where it falls short: complex multi-step agentic workflows. I tried chaining tool calls across several steps and it broke down. The cloud 31B via Ollama's hosted tier handles it better. E2B is an edge model, not a reasoning frontier.


The setup takes under 10 minutes. If you've been paying for API tokens on tasks that don't actually need cloud-scale intelligence, this is worth an afternoon.

Arbind Singh

Arbind Singh

ArbindBuilds is my digital space where I showcase my projects, share insightful blogs, and document my work and ideas.

Comments

Leave a comment

0/500 characters

READ NEXT

GPT-Image-2 Is Not a DALL-E Upgrade. It's a Different Kind of Model.

OpenAI's ChatGPT Images 2.0 ships reasoning into image generation for the first time. Here's what actually changed, what it costs, and what you need to migrate before May 12.

Read →

Tagged

ollamagemmalocal-aillmpythonopen-sourceself-hosteddeveloperindie makertutorialai toolson-device ai
← Back to AI Tools