Bringing Gemma 4 On-Device to Memex
Memex is local-first — all data stays on your device. But until now, the AI features still required a cloud LLM provider. With Gemma 4, we've added a fully offline option: zero cloud dependency, multimodal inference (text, image, audio), and function calling — all running natively on Android via Google's LiteRT-LM runtime.
This post covers the architecture, the key technical challenges, and an honest look at where on-device inference stands today for production use.
Full implementation: github.com/memex-lab/memex/pull/4
Why on-device matters
Memex handles personal records — text notes, photos, voice memos. For users who care deeply about privacy, even sending prompts to a trusted cloud provider is a compromise. On-device inference eliminates that entirely: the model runs on the phone, the data never leaves, and the app works without internet.
Gemma 4 E2B/E4B was the right fit: it supports multimodal input, function calling for our agent tool system, and runs on consumer Android hardware. We've tested with E4B (~3.7GB) on mid-range devices.
Architecture: Kotlin API via Platform Channels
We initially evaluated flutter_gemma, a third-party Flutter plugin. It proved unstable — crashes severe enough to occasionally reboot the device. Google's own Edge Gallery app, which calls the LiteRT-LM Kotlin API directly, ran the same model without issues on the same hardware.
The lesson was clear: for on-device LLM inference, call the native API directly. We built a thin Platform Channel bridge between Dart and Kotlin:
- Kotlin side: MethodChannel for lifecycle control (init, close, infer, cancel), with a reverse callback channel to stream tokens back to Dart
- Dart side:
GemmaLocalClientimplements the sameLLMClientinterface as our cloud providers — the agent system doesn't know or care whether it's talking to GPT-4 or a local model
The critical design pattern is Engine singleton, Conversation per-request. Loading a multi-GB model into GPU memory takes seconds. Creating a Conversation for each inference call takes milliseconds. The Engine lives for the app's lifetime; Conversations are created and disposed per request.
// Engine: created once, expensive
val engine = Engine(EngineConfig(
modelPath = modelPath,
backend = Backend.GPU(),
maxNumTokens = 10000,
cacheDir = context.cacheDir.absolutePath,
))
// Conversation: created per request, cheap
engine.createConversation(config).use { conversation ->
conversation.sendMessageAsync(contents)
.collect { message -> /* stream to Dart */ }
}Serializing multi-agent access
Memex runs multiple agents in parallel — card generation, knowledge extraction, asset analysis — all potentially calling the LLM simultaneously. LiteRT-LM enforces a hard constraint: one active Conversation per Engine. Violating this causes native crashes.
We serialize all access behind a Dart-side global mutex. The lock is acquired before Engine initialization and held until the inference stream closes. This is important because different request types may need different Engine configurations — image analysis requires a CPU vision backend, audio needs an audio backend. Without the lock covering initialization, concurrent backend switches crash the native layer.
Multimodal: what we discovered
Multimodal support works, but comes with undocumented constraints we discovered through testing:
Images:
- Only JPEG and PNG are accepted — WebP is silently rejected
- A 2520 image patch limit means large images cause segfaults during prefill. We cap the longest side at 896px.
- On MediaTek chipsets, the GPU vision backend crashes during decode. CPU backend is stable for vision while GPU handles text inference.
Audio:
- Only WAV/PCM is supported — M4A, AAC, and MP3 all fail at the decoder level
- We transcode on the Kotlin side using Android's MediaExtractor + MediaCodec, resampling to 16kHz mono 16-bit PCM
Thinking mode:
- Gemma 4's thinking mode improves reasoning for text-only tasks, but crashes when combined with vision input on some devices
- We auto-detect multimodal content and disable thinking for those requests
Where it works, where it doesn't
After running Gemma 4 E4B inside Memex's full multi-agent pipeline, here's an honest assessment:
Works well:
- Image description — reliably identifies scene content, reads text in images, recognizes UI elements
- Audio transcription — Mandarin Chinese recognition is usable for short voice notes
- Unstructured text — summaries, insights, and narrative generation are reasonable for a model this size
Needs improvement:
- Function calling produces malformed JSON frequently — missing quotes, wrong nesting. We catch parse errors at the Kotlin layer and return raw text so the agent can retry.
- Structured fields like IDs and file paths are often hallucinated. We never trust model output for these — always validate against ground truth from agent state.
- Thermal throttling is real. Sustained inference generates significant heat, triggering system-level CPU/GPU frequency reduction.
Performance
Tested on Redmi Pad (Dimensity 8100) with Gemma 4 E4B (~3.7GB):
- Text inference: ~15–20 tokens/sec (GPU)
- Image analysis: 5–8 seconds per image (CPU vision)
- Audio transcription: ~0.3x realtime (CPU audio)
- Engine initialization: ~8–10 seconds first load, cached after
For a fully offline use case, this is acceptable.
What we learned
Call native APIs directly. Third-party Flutter wrappers for on-device LLM add an abstraction layer that hides bugs and makes debugging nearly impossible. The native Kotlin API is more work upfront but dramatically more stable.
Serialize everything.Engine initialization and inference must both be behind a single global lock. On-device models don't handle concurrency — design for sequential access from the start.
Build fallbacks for structured output. Small on-device models will hallucinate field values that cloud models get right. Any field that needs to be correct — IDs, paths, references — must be validated against ground truth.
Test each modality independently. Image, audio, and thinking mode each have their own undocumented constraints and device-specific behaviors. Combining them before understanding each one individually leads to hard-to-diagnose crashes.
On-device LLM inference is real and usable today for specific tasks. It's not a replacement for cloud models yet — structured output and sustained performance need to improve. But for privacy-sensitive use cases where data must never leave the device, it's a meaningful step forward.
The full implementation is open source: github.com/memex-lab/memex