Lessons Running Gemma 4 On-Device in a Multi-Agent Flutter App
We spent a little over a week getting Gemma 4 E4B to run stably inside a Flutter app with multi-agent concurrency. Most of that week was not the integration itself — it was fighting undocumented constraints in LiteRT-LM: WebP silently rejected, a 2520 image-patch ceiling that segfaults the prefill, GPU vision decode that crashes on MediaTek, thinking-mode-plus- vision that crashes on some devices, and a one-active-Conversation-per-Engine rule that forces serial access from a parallel agent system.
This is a writeup of what we hit and what we'd tell ourselves at the start. The full implementation is open source: github.com/memex-lab/memex/pull/4.
Context: this runs inside Memex, an open-source local-first AI journal app — but everything below applies to any Flutter app integrating an on-device LLM via LiteRT-LM.
Why on-device at all
The app handles personal records — text, photos, voice memos. For users who care about privacy, even sending prompts to a trusted cloud provider is a compromise. On-device inference eliminates that: the model runs on the phone, the data never leaves, and the app works offline.
Gemma 4 E2B/E4B was the right fit: multimodal input, function calling, and small enough to run on consumer Android hardware. We tested with E4B (~3.7GB) on mid-range devices.
Architecture: Kotlin API via Platform Channels
We initially evaluated flutter_gemma, a third-party Flutter plugin. It proved unstable — crashes severe enough to occasionally reboot the device. Google's own Edge Gallery app, which calls the LiteRT-LM Kotlin API directly, ran the same model without issues on the same hardware.
The lesson was clear: for on-device LLM inference, call the native API directly. We built a thin Platform Channel bridge between Dart and Kotlin:
- Kotlin side: MethodChannel for lifecycle control (init, close, infer, cancel), with a reverse callback channel to stream tokens back to Dart
- Dart side:
GemmaLocalClientimplements the sameLLMClientinterface as our cloud providers — the agent system doesn't know or care whether it's talking to GPT-4 or a local model
The critical design pattern is Engine singleton, Conversation per-request. Loading a multi-GB model into GPU memory takes seconds. Creating a Conversation for each inference call takes milliseconds. The Engine lives for the app's lifetime; Conversations are created and disposed per request.
// Engine: created once, expensive
val engine = Engine(EngineConfig(
modelPath = modelPath,
backend = Backend.GPU(),
maxNumTokens = 10000,
cacheDir = context.cacheDir.absolutePath,
))
// Conversation: created per request, cheap
engine.createConversation(config).use { conversation ->
conversation.sendMessageAsync(contents)
.collect { message -> /* stream to Dart */ }
}Serializing multi-agent access
Memex runs multiple agents in parallel — card generation, knowledge extraction, asset analysis — all potentially calling the LLM simultaneously. LiteRT-LM enforces a hard constraint: one active Conversation per Engine. Violating this causes native crashes.
We serialize all access behind a Dart-side global mutex. The lock is acquired before Engine initialization and held until the inference stream closes. This is important because different request types may need different Engine configurations — image analysis requires a CPU vision backend, audio needs an audio backend. Without the lock covering initialization, concurrent backend switches crash the native layer.
Multimodal: what we discovered
Multimodal support works, but comes with undocumented constraints we discovered through testing:
Images:
- Only JPEG and PNG are accepted — WebP is silently rejected
- A 2520 image patch limit means large images cause segfaults during prefill. We cap the longest side at 896px.
- On MediaTek chipsets, the GPU vision backend crashes during decode. CPU backend is stable for vision while GPU handles text inference.
Audio:
- Only WAV/PCM is supported — M4A, AAC, and MP3 all fail at the decoder level
- We transcode on the Kotlin side using Android's MediaExtractor + MediaCodec, resampling to 16kHz mono 16-bit PCM
Thinking mode:
- Gemma 4's thinking mode improves reasoning for text-only tasks, but crashes when combined with vision input on some devices
- We auto-detect multimodal content and disable thinking for those requests
Four things we'd tell ourselves at the start
1. Call native APIs directly — skip the Flutter wrappers. Third-party Flutter plugins for on-device LLM add an abstraction layer that hides bugs and makes debugging nearly impossible. flutter_gemmacaused crashes severe enough to occasionally reboot the device. The native Kotlin API is more work upfront but dramatically more stable. If you're shipping on-device LLM in Flutter today, write the Platform Channel yourself.
2. Serialize everything, including init.Engine initialization and inference must both be behind a single global lock. On-device models don't handle concurrency, and switching backends mid-flight (vision CPU ↔ text GPU) under concurrent load crashes the native layer. Design for sequential access from the start.
3. Never trust structured output for anything that must be correct.Function-call JSON from Gemma 4 E4B is malformed often enough that you need a parse-error fallback. IDs, file paths, and references are routinely hallucinated. Validate every structured field against ground truth from agent state — don't round-trip them through the model.
4. Test each modality in isolation before combining. Image, audio, and thinking mode each have their own undocumented constraints and device-specific behaviors. Combining them before understanding each one individually leads to hard-to-diagnose crashes. The thinking-mode-plus-vision segfault took us a week to isolate because we changed multiple inputs at once.
Where it works, where it doesn't
After running Gemma 4 E4B inside a full multi-agent pipeline, an honest assessment:
Works well:
- Image description — reliably identifies scene content, reads text in images, recognizes UI elements
- Audio transcription — Mandarin recognition is usable for short voice notes
- Unstructured text — summaries, insights, and narrative generation are reasonable for a model this size
Needs improvement:
- Function calling produces malformed JSON frequently — missing quotes, wrong nesting. We catch parse errors at the Kotlin layer and return raw text so the agent can retry.
- Structured fields like IDs and file paths are often hallucinated. We never trust model output for these.
- Thermal throttling is real. Sustained inference generates significant heat, triggering system-level CPU/GPU frequency reduction.
Performance
Tested on Redmi Pad (Dimensity 8100) with Gemma 4 E4B (~3.7GB):
- Text inference: ~15–20 tokens/sec (GPU)
- Image analysis: 5–8 seconds per image (CPU vision)
- Audio transcription: ~0.3x realtime (CPU audio)
- Engine initialization: ~8–10 seconds first load, cached after
For a fully offline use case, this is acceptable.
Bottom line
On-device LLM inference is real and usable today for specific tasks. It is not yet a replacement for cloud models — structured output and sustained performance need to improve. But for use cases where data must never leave the device, it's a meaningful step forward, and the constraints above are tractable once you know to design around them.
Full source: github.com/memex-lab/memex. Happy to answer questions in the comments — particularly interested in how other people are handling the structured-output-from-small-models problem, since that's still the biggest gap for us.