Lessons Running Gemma 4 On-Device in a Multi-Agent Flutter App

We spent a little over a week getting Gemma 4 E4B to run stably inside a Flutter app with multi-agent concurrency. Most of that week was not the integration itself — it was fighting undocumented constraints in LiteRT-LM: WebP silently rejected, a 2520 image-patch ceiling that segfaults the prefill, GPU vision decode that crashes on MediaTek, thinking-mode-plus- vision that crashes on some devices, and a one-active-Conversation-per-Engine rule that forces serial access from a parallel agent system.

This is a writeup of what we hit and what we'd tell ourselves at the start. The full implementation is open source: github.com/memex-lab/memex/pull/4.

Context: this runs inside Memex, an open-source local-first AI journal app — but everything below applies to any Flutter app integrating an on-device LLM via LiteRT-LM.

Why on-device at all

The app handles personal records — text, photos, voice memos. For users who care about privacy, even sending prompts to a trusted cloud provider is a compromise. On-device inference eliminates that: the model runs on the phone, the data never leaves, and the app works offline.

Gemma 4 E2B/E4B was the right fit: multimodal input, function calling, and small enough to run on consumer Android hardware. We tested with E4B (~3.7GB) on mid-range devices.

Architecture: Kotlin API via Platform Channels

We initially evaluated flutter_gemma, a third-party Flutter plugin. It proved unstable — crashes severe enough to occasionally reboot the device. Google's own Edge Gallery app, which calls the LiteRT-LM Kotlin API directly, ran the same model without issues on the same hardware.

The lesson was clear: for on-device LLM inference, call the native API directly. We built a thin Platform Channel bridge between Dart and Kotlin:

  • Kotlin side: MethodChannel for lifecycle control (init, close, infer, cancel), with a reverse callback channel to stream tokens back to Dart
  • Dart side: GemmaLocalClient implements the same LLMClient interface as our cloud providers — the agent system doesn't know or care whether it's talking to GPT-4 or a local model

The critical design pattern is Engine singleton, Conversation per-request. Loading a multi-GB model into GPU memory takes seconds. Creating a Conversation for each inference call takes milliseconds. The Engine lives for the app's lifetime; Conversations are created and disposed per request.

// Engine: created once, expensive
val engine = Engine(EngineConfig(
    modelPath = modelPath,
    backend = Backend.GPU(),
    maxNumTokens = 10000,
    cacheDir = context.cacheDir.absolutePath,
))

// Conversation: created per request, cheap
engine.createConversation(config).use { conversation ->
    conversation.sendMessageAsync(contents)
        .collect { message -> /* stream to Dart */ }
}

Serializing multi-agent access

Memex runs multiple agents in parallel — card generation, knowledge extraction, asset analysis — all potentially calling the LLM simultaneously. LiteRT-LM enforces a hard constraint: one active Conversation per Engine. Violating this causes native crashes.

We serialize all access behind a Dart-side global mutex. The lock is acquired before Engine initialization and held until the inference stream closes. This is important because different request types may need different Engine configurations — image analysis requires a CPU vision backend, audio needs an audio backend. Without the lock covering initialization, concurrent backend switches crash the native layer.

Multimodal: what we discovered

Multimodal support works, but comes with undocumented constraints we discovered through testing:

Images:

  • Only JPEG and PNG are accepted — WebP is silently rejected
  • A 2520 image patch limit means large images cause segfaults during prefill. We cap the longest side at 896px.
  • On MediaTek chipsets, the GPU vision backend crashes during decode. CPU backend is stable for vision while GPU handles text inference.

Audio:

  • Only WAV/PCM is supported — M4A, AAC, and MP3 all fail at the decoder level
  • We transcode on the Kotlin side using Android's MediaExtractor + MediaCodec, resampling to 16kHz mono 16-bit PCM

Thinking mode:

  • Gemma 4's thinking mode improves reasoning for text-only tasks, but crashes when combined with vision input on some devices
  • We auto-detect multimodal content and disable thinking for those requests

Four things we'd tell ourselves at the start

1. Call native APIs directly — skip the Flutter wrappers. Third-party Flutter plugins for on-device LLM add an abstraction layer that hides bugs and makes debugging nearly impossible. flutter_gemmacaused crashes severe enough to occasionally reboot the device. The native Kotlin API is more work upfront but dramatically more stable. If you're shipping on-device LLM in Flutter today, write the Platform Channel yourself.

2. Serialize everything, including init.Engine initialization and inference must both be behind a single global lock. On-device models don't handle concurrency, and switching backends mid-flight (vision CPU ↔ text GPU) under concurrent load crashes the native layer. Design for sequential access from the start.

3. Never trust structured output for anything that must be correct.Function-call JSON from Gemma 4 E4B is malformed often enough that you need a parse-error fallback. IDs, file paths, and references are routinely hallucinated. Validate every structured field against ground truth from agent state — don't round-trip them through the model.

4. Test each modality in isolation before combining. Image, audio, and thinking mode each have their own undocumented constraints and device-specific behaviors. Combining them before understanding each one individually leads to hard-to-diagnose crashes. The thinking-mode-plus-vision segfault took us a week to isolate because we changed multiple inputs at once.

Where it works, where it doesn't

After running Gemma 4 E4B inside a full multi-agent pipeline, an honest assessment:

Works well:

  • Image description — reliably identifies scene content, reads text in images, recognizes UI elements
  • Audio transcription — Mandarin recognition is usable for short voice notes
  • Unstructured text — summaries, insights, and narrative generation are reasonable for a model this size

Needs improvement:

  • Function calling produces malformed JSON frequently — missing quotes, wrong nesting. We catch parse errors at the Kotlin layer and return raw text so the agent can retry.
  • Structured fields like IDs and file paths are often hallucinated. We never trust model output for these.
  • Thermal throttling is real. Sustained inference generates significant heat, triggering system-level CPU/GPU frequency reduction.

Performance

Tested on Redmi Pad (Dimensity 8100) with Gemma 4 E4B (~3.7GB):

  • Text inference: ~15–20 tokens/sec (GPU)
  • Image analysis: 5–8 seconds per image (CPU vision)
  • Audio transcription: ~0.3x realtime (CPU audio)
  • Engine initialization: ~8–10 seconds first load, cached after

For a fully offline use case, this is acceptable.

Bottom line

On-device LLM inference is real and usable today for specific tasks. It is not yet a replacement for cloud models — structured output and sustained performance need to improve. But for use cases where data must never leave the device, it's a meaningful step forward, and the constraints above are tractable once you know to design around them.

Full source: github.com/memex-lab/memex. Happy to answer questions in the comments — particularly interested in how other people are handling the structured-output-from-small-models problem, since that's still the biggest gap for us.