Semantic Memory for AI Agents: Recall in Milliseconds

TL;DR

Semantic memory lets an agent recall past knowledge by meaning, not exact keywords. The durable approach is markdown-first: agents write plain notes; a semantic index sits on top so they recall the few relevant ones instead of re-reading everything. Make it shared across a hive and one agent can use what another learned — recall stays fast and the context window stays small even as the knowledge base grows.

An agent’s working memory is its context window, and the context window is wiped at the end of every session. So “memory” for agents is really two problems: storing what they learn durably, and recalling the right piece at the right time without dragging everything back into context. Semantic memory solves the second. This post explains what it is, why markdown-first is the right foundation, and how a shared memory layer changes how a team of agents behaves.

Keyword recall isn’t enough

The first instinct is to store notes and grep them. It works until it doesn’t. An agent that learned

“The web bundle is built with electron-vite, not plain vite.”

won’t be found by a query like “how do we compile the front-end?” — no shared keywords. Keyword search matches strings; agents think in concepts. The gap between “compile the front-end” and “build the web bundle with electron-vite” is exactly where keyword recall fails and the agent re-discovers something it already knew.

What semantic recall actually does

Semantic memory closes that gap with embeddings. The idea, in three steps:

Embed each note into a vector — a list of numbers that captures its meaning, so notes about similar things land near each other in vector space.
At recall time, embed the query the same way.
Return the nearest notes by vector similarity.

Now “how do we compile the front-end?” lands near the electron-vite note even though they share no words, because they mean the same thing. The agent recalls by meaning, pulls the two or three most relevant notes into context, and acts — instead of either re-reading its whole history or missing the note entirely.

The payoff compounds: recall stays near-instant and the context window stays small even as the note collection grows into the hundreds. You inject the few notes that matter, not the whole archive.

Why markdown-first beats a heavyweight DB

There’s a strong temptation to reach for a dedicated vector database or a full agent-memory framework. For a handful of agents, that’s the wrong trade. The more robust foundation is markdown-first: agents write durable facts to plain markdown files, and a semantic index is built on top of those files. Three reasons this wins:

Human-readable. You can open an agent’s memory.md and see exactly what it “knows.” Memory you can’t read is memory you can’t trust.
Diffable. The notes live in git. You watch memory change over time, and you can review it like any other artifact.
Degrades gracefully. If the fancy index breaks or isn’t installed, the markdown files still work — the agent reads them directly. A database that won’t open takes the memory with it; a folder of markdown never does.

There’s also an architectural reason. Heavyweight memory frameworks tend to want to own the agent runtime — they assume they’re the thing running the loop. But when your agents are real claude sessions, the runtime is already Claude Code. The memory layer should be a lightweight companion to that, not a replacement for it. Markdown files plus a CLI-driven index fit that shape; a framework that wants to be in charge doesn’t. (For the full argument, see why we built agent memory markdown-first; for the plain-English version, how to give Claude Code long-term memory.)

The shape of a shared memory layer

Here’s how Munder Difflin implements semantic memory for a whole hive, via the MemPalace layer. It’s worth walking the moving parts because they show what “fast, shared recall” requires in practice.

One palace, many wings

The harness keeps a single shared palace for the whole team and points every agent’s environment at it. Each agent’s own notes go into its own wing of that palace. The result: agents have private authorship (each writes only its own memory.md, keeping the single-writer rule intact) but shared recall — a query can search the whole palace or scope to one agent’s wing.

That shared recall is what turns a pile of agents into a team that learns. When the researcher figures out a build quirk, the coder can recall it later without anyone copy-pasting. The human stops being the courier of context between sessions.

Mining: notes become searchable automatically

Agents don’t manage the index by hand. The harness mines each agent’s memory.md into its wing on a schedule, and — crucially — only when the file has actually changed (it checks the modification time and skips unchanged memories, so it doesn’t reload the embedding model for nothing). Write a durable fact to your markdown, and it becomes searchable by the whole team a few minutes later, with no extra step.

Recall: search and wake-up

Agents recall in two ways:

search "<query>" returns the most relevant notes by meaning, optionally scoped to one wing or widened to more results. This is targeted recall: “what do we know about X?”
wake-up returns a short digest — a few hundred tokens of what matters — meant to be read at the start of a task, so an agent boots with the team’s relevant knowledge already in hand.

Both are plain CLI calls the agent can run itself. There’s no MCP server to stand up, no database to administer — the index is a companion to the markdown, driven by a command.

Degrade-and-detect

If the semantic layer isn’t installed at all, none of this errors — it’s a no-op, and agents fall back to reading their markdown directly. The fast path is an upgrade, not a dependency. That’s the markdown-first principle paying off: the floor never falls out from under the team.

How fast is “fast”?

The “milliseconds” promise is about the experience, and it’s worth being precise. Recall feels instant because the expensive work — turning notes into vectors — happens ahead of time, during mining. At recall time you embed one short query and compare it against pre-computed vectors, which is cheap. The agent gets its handful of relevant notes back without a perceptible wait, and without dragging its whole history into context. Compared to the alternative — re-reading everything, or worse, re-discovering it — it’s the difference between remembering and relearning.

A practical note: lighter embedding models keep recall snappy on modest hardware, which matters when everything runs locally. A small, fast model is the sensible default; a larger multilingual one is there when you need it. The point is that local and fast aren’t in tension here.

What shared memory changes day to day

The abstract benefit is “agents remember.” The concrete one shows up in your habits:

Your briefs get shorter every week, because the team stops needing context re-explained.
Agents stop contradicting each other, because they recall the same shared decisions.
A new agent added to the team is useful immediately — it wakes up into the hive’s accumulated knowledge instead of a blank slate.

That’s the quiet superpower of a memory layer: it makes a coordinated team get better over time instead of starting cold every run. It’s also what lets the GOD orchestrator route well — it can recall what’s been tried and who knows what.

FAQ

Does semantic memory replace the markdown notes? No — it sits on top of them. The markdown is the source of truth; the index is a recall accelerator built from it. That ordering is the whole point.

Who decides what gets remembered? The agents do, by writing durable facts to their own memory as they work. Good memory hygiene — small, atomic, factual notes — makes recall sharp; rambling notes make it fuzzy.

Munder Difflin gives every Claude Code agent markdown memory plus a shared semantic palace the whole hive can recall from — local, fast, and graceful when offline. Download Munder Difflin to give your agents memory that actually sticks; it’s free and open source.

FAQ

What is semantic memory for an AI agent?

It's a layer that lets an agent recall past knowledge by meaning rather than exact keywords. Notes are embedded into vectors; at recall time the agent's query is embedded too and the closest notes are returned — so 'how do we build the web bundle?' finds a note about electron-vite even if it never used those words.

Why markdown-first instead of a vector database?

Markdown notes are human-readable, diffable in git, and degrade gracefully — if the index breaks, the files still work. A heavyweight vector DB wants to own the agent runtime and adds operational weight you don't need for a handful of agents. The index sits on top of the markdown, not in place of it.

How does shared memory help a team of agents?

Each agent's notes are mined into a shared store, so one agent can recall what another learned. The team stops re-discovering the same facts, and the human stops re-explaining context every session.

Keyword recall isn’t enough #

What semantic recall actually does #

Why markdown-first beats a heavyweight DB #

The shape of a shared memory layer #

One palace, many wings #

Mining: notes become searchable automatically #

Recall: search and wake-up #

Degrade-and-detect #

How fast is “fast”? #

What shared memory changes day to day #

FAQ #

FAQ