Lessons Running Gemma 4 On-Device in a Multi-Agent Flutter App

2026-04-07

We spent a little over a week getting Gemma 4 E4B to run stably inside a Flutter app with multi-agent concurrency. Most of that week was not the integration itself — it was fighting undocumented constraints in LiteRT-LM: WebP silently rejected, a 2520 image-patch ceiling that segfaults the prefill, GPU vision decode that crashes on MediaTek, thinking-mode-plus- vision that crashes on some devices, and a one-active-Conversation-per-Engine rule that forces serial access from a parallel agent system.

This is a writeup of what we hit and what we'd tell ourselves at the start. The full implementation is open source: github.com/memex-lab/memex/pull/4.

Context: this runs inside Memex, an open-source local-first AI journal app — but everything below applies to any Flutter app integrating an on-device LLM via LiteRT-LM.

Why on-device at all

The app handles personal records — text, photos, voice memos. For users who care about privacy, even sending prompts to a trusted cloud provider is a compromise. On-device inference eliminates that: the model runs on the phone, the data never leaves, and the app works offline.

Gemma 4 E2B/E4B was the right fit: multimodal input, function calling, and small enough to run on consumer Android hardware. We tested with E4B (~3.7GB) on mid-range devices.

Architecture: Kotlin API via Platform Channels

We initially evaluated flutter_gemma, a third-party Flutter plugin. It proved unstable — crashes severe enough to occasionally reboot the device. Google's own Edge Gallery app, which calls the LiteRT-LM Kotlin API directly, ran the same model without issues on the same hardware.

The lesson was clear: for on-device LLM inference, call the native API directly. We built a thin Platform Channel bridge between Dart and Kotlin:

Kotlin side: MethodChannel for lifecycle control (init, close, infer, cancel), with a reverse callback channel to stream tokens back to Dart
Dart side: GemmaLocalClient implements the same LLMClient interface as our cloud providers — the agent system doesn't know or care whether it's talking to GPT-4 or a local model

The critical design pattern is Engine singleton, Conversation per-request. Loading a multi-GB model into GPU memory takes seconds. Creating a Conversation for each inference call takes milliseconds. The Engine lives for the app's lifetime; Conversations are created and disposed per request.

// Engine: created once, expensive
val engine = Engine(EngineConfig(
    modelPath = modelPath,
    backend = Backend.GPU(),
    maxNumTokens = 10000,
    cacheDir = context.cacheDir.absolutePath,
))

// Conversation: created per request, cheap
engine.createConversation(config).use { conversation ->
    conversation.sendMessageAsync(contents)
        .collect { message -> /* stream to Dart */ }
}

Serializing multi-agent access

Memex runs multiple agents in parallel — card generation, knowledge extraction, asset analysis — all potentially calling the LLM simultaneously. LiteRT-LM enforces a hard constraint: one active Conversation per Engine. Violating this causes native crashes.

We serialize all access behind a Dart-side global mutex. The lock is acquired before Engine initialization and held until the inference stream closes. This is important because different request types may need different Engine configurations — image analysis requires a CPU vision backend, audio needs an audio backend. Without the lock covering initialization, concurrent backend switches crash the native layer.

Multimodal: what we discovered

Multimodal support works, but comes with undocumented constraints we discovered through testing:

Images:

Only JPEG and PNG are accepted — WebP is silently rejected
A 2520 image patch limit means large images cause segfaults during prefill. We cap the longest side at 896px.
On MediaTek chipsets, the GPU vision backend crashes during decode. CPU backend is stable for vision while GPU handles text inference.

Audio:

Only WAV/PCM is supported — M4A, AAC, and MP3 all fail at the decoder level
We transcode on the Kotlin side using Android's MediaExtractor + MediaCodec, resampling to 16kHz mono 16-bit PCM

Thinking mode:

Gemma 4's thinking mode improves reasoning for text-only tasks, but crashes when combined with vision input on some devices
We auto-detect multimodal content and disable thinking for those requests

Four things we'd tell ourselves at the start

1. Call native APIs directly — skip the Flutter wrappers. Third-party Flutter plugins for on-device LLM add an abstraction layer that hides bugs and makes debugging nearly impossible. flutter_gemmacaused crashes severe enough to occasionally reboot the device. The native Kotlin API is more work upfront but dramatically more stable. If you're shipping on-device LLM in Flutter today, write the Platform Channel yourself.

2. Serialize everything, including init.Engine initialization and inference must both be behind a single global lock. On-device models don't handle concurrency, and switching backends mid-flight (vision CPU ↔ text GPU) under concurrent load crashes the native layer. Design for sequential access from the start.

3. Never trust structured output for anything that must be correct.Function-call JSON from Gemma 4 E4B is malformed often enough that you need a parse-error fallback. IDs, file paths, and references are routinely hallucinated. Validate every structured field against ground truth from agent state — don't round-trip them through the model.

4. Test each modality in isolation before combining. Image, audio, and thinking mode each have their own undocumented constraints and device-specific behaviors. Combining them before understanding each one individually leads to hard-to-diagnose crashes. The thinking-mode-plus-vision segfault took us a week to isolate because we changed multiple inputs at once.

Where it works, where it doesn't

After running Gemma 4 E4B inside a full multi-agent pipeline, an honest assessment:

Works well:

Image description — reliably identifies scene content, reads text in images, recognizes UI elements
Audio transcription — Mandarin recognition is usable for short voice notes
Unstructured text — summaries, insights, and narrative generation are reasonable for a model this size

Needs improvement:

Function calling produces malformed JSON frequently — missing quotes, wrong nesting. We catch parse errors at the Kotlin layer and return raw text so the agent can retry.
Structured fields like IDs and file paths are often hallucinated. We never trust model output for these.
Thermal throttling is real. Sustained inference generates significant heat, triggering system-level CPU/GPU frequency reduction.

Performance

Tested on Redmi Pad (Dimensity 8100) with Gemma 4 E4B (~3.7GB):

Text inference: ~15–20 tokens/sec (GPU)
Image analysis: 5–8 seconds per image (CPU vision)
Audio transcription: ~0.3x realtime (CPU audio)
Engine initialization: ~8–10 seconds first load, cached after

For a fully offline use case, this is acceptable.

Bottom line

On-device LLM inference is real and usable today for specific tasks. It is not yet a replacement for cloud models — structured output and sustained performance need to improve. But for use cases where data must never leave the device, it's a meaningful step forward, and the constraints above are tractable once you know to design around them.

Full source: github.com/memex-lab/memex. Happy to answer questions in the comments — particularly interested in how other people are handling the structured-output-from-small-models problem, since that's still the biggest gap for us.