Technical

MemPalace Benchmarks Deep Dive: How 96.6% Recall Actually Works

Breaking down the LongMemEval R@5 benchmark methodology and what 96.6% recall means in practice for your AI agents.

The Benchmark: LongMemEval R@5

When MemPalace claims 96.6% recall, that number comes from a specific benchmark: LongMemEval R@5. This measures whether the correct memory appears in the top 5 results returned by the memory system, across a standardized set of conversational memory retrieval tasks.

LongMemEval was designed to test real-world memory retrieval scenarios: multi-turn conversations where the agent needs to recall facts, preferences, and context from earlier exchanges. It's not a toy benchmark. It simulates the actual usage pattern of AI agents that maintain long-term memory across sessions.

What Makes 96.6% Remarkable

The number alone is impressive, but context makes it extraordinary. MemPalace achieves this recall rate with zero API calls during retrieval. Every other competitive system (Mem0, Zep, LangChain Memory) relies on external embedding APIs or vector database queries. MemPalace runs entirely on your local machine.

The secret is the palace hierarchy. Instead of computing similarity across every stored memory (which is what vector search does), MemPalace navigates the spatial hierarchy: Wing → Hall → Room → Closet → Drawer. This narrows the search space dramatically before any comparison happens.

Methodology Breakdown

The evaluation uses 500 memory retrieval queries across 50 simulated conversation histories, each containing 20-100 stored memories. For each query, the system returns its top 5 candidate memories, and the evaluator checks whether the ground-truth memory is among them.

SystemR@5Startup TokensAPI Calls
MemPalace96.6%1700
MemPalace + Haiku Rerank100%170 + rerank1
Mem0~85%2,000-5,0002+
Zep~80%3,000+1+

The 170-Token Advantage

At startup, MemPalace injects only 170 tokens into the agent's context window. This is the palace “map”: a compressed index of Wings and Halls that tells the system where to look. Compare this to Mem0's 2,000-5,000 tokens of retrieved context that gets injected on every conversation turn.

Why does this matter? Because context window space is the scarcest resource in AI agent design. Every token used for memory is a token not available for reasoning, tool use, or user interaction. MemPalace leaves 97% of the context window free for actual work.

When to Use Haiku Rerank

The optional Haiku reranking step bumps recall from 96.6% to 100%, but at the cost of one API call per retrieval. For most use cases, the base 96.6% is more than sufficient. Enable reranking when:

  • High-stakes recall— Medical, legal, or financial agents where missing a memory has real consequences
  • Large memory stores— 1,000+ memories where the hierarchy alone may not narrow enough
  • Ambiguous queries— When user intent is vague and multiple memories could be relevant

Reproducing the Benchmark

The benchmark suite is included in the MemPalace repository. You can run it yourself:

git clone https://github.com/milla-jovovich/mempalace
cd mempalace
pip install -e ".[dev]"
python benchmarks/longmemeval.py --report

The benchmark takes about 10 minutes on a modern laptop. Results are written to benchmarks/results/ with per-query breakdowns.