Three-Tier Memory System

Every AI conversation starts fresh. The model has no memory of what you discussed yesterday, what decisions you made last week, or what patterns emerged across your last twenty interactions. This is by design — statelessness keeps things simple. But it creates a fundamental problem for any AI workflow that spans more than a single exchange.

The question isn't whether AI agents need memory. It's how to structure it so that the right information is available at the right time, without drowning the model in irrelevant context.

The problem with just using the context window

Modern language models accept enormous inputs — 100K, 200K, even larger context windows. The naive solution is to stuff everything in: every prior conversation, every document, every decision. Let the model sort it out.

This breaks in three ways. First, cost scales linearly with context length. Sending 200K tokens on every request gets expensive fast. Second, latency increases — more tokens to process means slower responses. Third, and most subtly, model performance degrades with irrelevant context. When a model has to attend to vast amounts of loosely related information, it produces worse outputs than when it receives only the information that matters for the current task.

The context window is working memory, not storage. And working memory should be curated, not comprehensive.

Three tiers, three purposes

The solution is to separate memory into three tiers, each optimized for a different retrieval pattern.

Tier 1: Working memory

What the model sees right now. The current conversation, the active task description, the specific constraints that apply to this particular request.

Working memory is small — maybe 10-30K tokens of the available context window. It includes only what's directly relevant to the current interaction. Everything else lives in the other tiers, pulled in on demand.

The discipline of working memory is curation. For every piece of context you include, ask: does this change what the model should do right now? If not, it doesn't belong in working memory. It belongs in a tier where it can be retrieved when it becomes relevant.

Tier 2: Session memory

The accumulated context from a multi-step workflow. When an AI agent runs through a research phase, an analysis phase, and a writing phase, each phase produces outputs that subsequent phases need to reference.

Session memory solves this with a dual storage pattern: every workflow step produces both a full artifact (the complete output, stored durably) and a compressed summary (a concise distillation of what happened and what was decided).

Summaries flow into working memory for subsequent steps. Full artifacts stay in storage, available when the model needs to reference specific details — "what exactly did the market research say about pricing?" — without permanently occupying the context window.

The compression is the engineering challenge. Too aggressive, and the summary loses critical nuance. Too conservative, and summaries accumulate until they overflow the context budget. The sweet spot is capturing decisions, constraints, and key findings — the things future steps are most likely to need.

Tier 3: Persistent memory

Everything the system has ever learned, stored for semantic retrieval. Historical workflow outputs, past decisions, accumulated patterns, user preferences.

Persistent memory is queried, not loaded. When the current task might benefit from historical context — "how did we handle this last quarter?" — the system performs a semantic search and retrieves the most relevant fragments. These fragments get injected into working memory alongside the session summaries and current task.

This is where retrieval-augmented generation (RAG) lives. Vector embeddings of historical content, queried by semantic similarity. The key design choice is what gets embedded: not raw conversation logs (too noisy) but structured artifacts and decisions (high signal, well-formatted for retrieval).

How the tiers interact

For any given model call, the context is assembled from all three tiers:

Load the task description and current conversation (working memory)
Load relevant session summaries from the active workflow (session memory)
Query persistent memory for historical context that matches the current task
Assemble everything within the token budget, prioritizing recency and relevance
Execute the model call
Store the output as both a full artifact and a compressed summary
Update session state

The assembly step is where most of the engineering complexity lives. You're solving a packing problem — fitting the most useful information into a fixed budget, with imperfect knowledge of what "most useful" means for any given request.

Why this architecture matters

Without persistent memory, every workflow starts from zero. The agent has no knowledge of what worked before, what the user prefers, or what patterns have emerged across interactions.

Without session memory, multi-step workflows lose coherence. Each step operates in isolation, unaware of what happened in previous steps. The agent produces locally optimal outputs that don't add up to a coherent whole.

Without disciplined working memory, the model gets overwhelmed. Too much context means worse outputs, higher costs, and slower responses.

The three tiers together create a system that maintains coherence within a workflow, learns from historical patterns, and stays focused on the current task. That's what makes it feel like working with an intelligent collaborator rather than a stateless text generator.

The challenge is real engineering work — building the compression pipeline, tuning the retrieval, managing the token budget. But the payoff is AI workflows that feel fundamentally different from the single-shot experience most people associate with language models.