What It Actually Takes to Run AI Agents on a Phone

2026-03-31

The core logic of an AI agent is a simple loop: observe, think, act, repeat. Most conversations about agents stop there. But between that elegant loop and a product people actually use every day, there's a vast gap filled with decisions that don't get talked about enough.

This post is about those decisions — the engineering choices behind Memex, why we made them, and what we learned.

The model layer isn't just an API call

OpenAI's format has become the industry standard, but "standard" is generous. In practice, every provider has its own quirks. Streaming chunk formats differ. Token counting methods vary. Error responses are inconsistent. Authentication ranges from simple API keys to OAuth flows to cloud-specific credential systems.

The obvious approach is to pick one provider and move on. We chose the harder path: support them all. The reasoning was simple — Memex is a local-first app where users bring their own LLM. We can't dictate which provider they use. A user in China might use Kimi or Qwen. Someone privacy-conscious might run Ollama locally. A developer might want Claude for its reasoning quality.

So we built a unified client abstraction that normalizes streaming, token stats, error handling, and retry logic across twelve providers. The tradeoff is maintenance cost — every time a provider changes their API, we adapt. But the payoff is that Memex genuinely works with whatever model you prefer, and it gets better as models improve without us shipping an update.

Since no mature agent framework existed for mobile, we extracted this layer into dart_agent_core — a standalone open-source library that anyone building Flutter agents can use.

Phones don't have a terminal

This sounds obvious, but the implications are deep. Coding agents like Devin operate in a Linux environment with Bash, grep, find, sed — decades of mature tooling. A phone has none of that.

We had to build an equivalent toolbox from scratch: file search, read, write, list, move, delete — all scoped to the user's local workspace. But the harder problem isn't the tools themselves. It's the input.

A coding agent receives structured text (code). Memex receives a chaotic mix of photos with EXIF metadata, voice recordings that need transcription, and text fragments of varying length and intent. Before the agent can reason about anything, it needs to unify these heterogeneous inputs into a common format. We chose Markdown — it's human-readable, future-proof, and portable.

One deliberate choice: we don't give the agent web search or HTTP tools. Memex is designed to be inward-facing — an agent that deeply understands your local data, not one that reaches out to the internet for answers. Mining patterns from your own fragmented records is a fundamentally harder problem than calling a search API, and it's the problem we chose to solve.

Why RAG doesn't work for personal records

RAG (retrieval-augmented generation) is the default approach for giving LLMs access to external knowledge. It works well when the data is structured — documentation, codebases, knowledge bases with clear boundaries.

Personal records are none of those things. They're messy, fragmented, and deeply context-dependent. A note that says "feeling better today" means nothing without knowing what happened yesterday. A photo of a receipt only matters in the context of a trip. Simple vector similarity search misses these connections entirely.

Our approach is different. Instead of retrieving and injecting context at query time, we have the agent continuously organize records into a structured knowledge base as they come in. The agent acts as a file manager — classifying, linking, and maintaining a living directory of your knowledge using the P.A.R.A. methodology.

This required solving two sub-problems:

Granularity control. An agent with unrestricted file access will try to read everything at once, overflow its context window, and produce garbage. We enforce strict limits on how much data the agent can read or write in a single operation. This forces it to work incrementally, which produces better reasoning.

Adversarial validation. When the agent proposes a knowledge structure — say, creating a new category or merging two files — a separate validation layer checks the proposal against predefined rules. Does this category already exist? Is the file too large? Does the structure follow the methodology? If not, the proposal is rejected and the agent retries. This back-and-forth is slower than letting the agent do whatever it wants, but the results are dramatically more reliable.

The agent needs guardrails, not just capabilities

An agent that can read your files, write to your knowledge base, and generate insights about your life is powerful. It's also dangerous. Not in a sci-fi way — in a "it accidentally overwrote your journal" way.

The instinct is to focus on what the agent can do. We spent just as much time on what it shouldn't do. Every tool invocation passes through a permission layer. Memory access has explicit boundaries — the agent sees what it needs for the current task, nothing more. Before any write operation, the scope is validated against the user's workspace rules.

The key insight: guardrails aren't about distrusting the model. They're about building a system that stays correct even when the model makes mistakes — because it will. The question isn't whether the agent will hallucinate or misinterpret; it's whether the system catches it before it affects your data.

Showing, not telling — generative UI

Most agent products return text. When Memex discovers that you've been running more consistently this month, it shouldn't describe that in words. It should show you a trend chart.

We built a template-first approach. There's a library of pre-built UI components — trend charts, radar maps, progress rings, timelines, composition breakdowns, highlight quotes, gallery layouts. The agent's output is matched against these templates first.

When no template fits — which happens with novel insight types — the agent generates structured data that gets rendered dynamically. The dispatch logic between template matching and dynamic rendering was one of the trickier parts to get right. Too aggressive on templates and the output feels rigid. Too much dynamic generation and the quality drops.

The result is that Memex's insights feel like a designed product, not a chatbot response. That distinction matters more than you'd think for something people use daily.

Mobile kills your process. Deal with it.

On a server, your agent process runs until you stop it. On a phone, the OS can kill your app at any moment — to free memory, because the user switched away, because the phone locked. There's no graceful shutdown signal.

This means every step of the agent loop needs to be resumable. We checkpoint progress at each stage: before an LLM call, after tool execution, when a card is generated. If the app is killed mid-task, reopening it picks up from the last checkpoint, not from the beginning.

This is invisible when it works — which is the point. But it fundamentally shapes how the agent loop is architected. You can't design for the happy path and bolt on persistence later. It has to be in the bones of the system.

You need to see what the agent is doing

An agent making LLM calls on your behalf is spending your money. Without visibility, costs spiral and bugs hide behind opaque behavior.

We built observability into the core: every task records how many LLM calls were made, how many tokens were consumed, and what the estimated cost is. When the agent enters a loop — calling the same tool repeatedly or producing malformed output — automatic detection kicks in and stops the cycle before it burns through your API budget.

The deeper motivation is trust. Memex handles intimate personal data. Users need to understand what the AI is doing with their records, not just see the output. Full transparency of the agent's reasoning process — every tool call, every decision point — is available on-device. No telemetry, no cloud logging. The observability is for you, not for us.

The loop is the easy part

We started this post by saying an agent is a simple loop. That's true. But the loop runs inside a context — a phone with limited memory, a user with messy data, a model that sometimes hallucinates, a process that can be killed at any time, and personal records that demand the highest standard of care.

The real engineering isn't the loop. It's everything that makes the loop safe, reliable, transparent, and useful in that context. That's what we've been building, and we're far from done.

Memex is open source under GPL-3.0. If these kinds of problems interest you — come build with us.