89% Fewer Tokens: Teaching an LLM When Not to Think

Every time you ask a Large Language Model a question it has answered before, it solves it from scratch. Every matrix multiplication. Every token. Every watt. Over and over again.

We built a system that stops doing that. It remembers.

This post is a walkthrough of RLRL-LLM — the Rich Learning Paradigm applied to Large Language Models. The idea is deceptively simple: build a graph of everything the LLM has ever reasoned through, and next time a similar question arrives, walk the graph instead of waking the neural network.

The result: 89.1% fewer tokens generated, zero hallucinations on known paths, and evidence that reasoning patterns transfer across completely unrelated domains.

The Problem: Expensive Amnesia

Modern LLMs are stateless. They have no long-term memory of their own reasoning. Ask GPT or DeepSeek to prove the Pythagorean theorem today, and it will execute billions of floating-point operations to produce the proof. Ask it again tomorrow — same operations, same energy, same result. Every query is a cold start.

This isn't just wasteful — it's architecturally fragile. Without memory, there's no mechanism to catch a reasoning error once it's been committed. Every inference pass is independent, so the same hallucination can recur indefinitely.

What if the model could recognize "I've solved this before" and simply recall the answer — the way you recall that 7 × 8 = 56 without re-deriving multiplication?

The Architecture: Two Systems, One Brain

DAPSA (Dual Active-Passive System Architecture) is inspired by Daniel Kahneman's dual-process theory: System 1 is fast, automatic recall. System 2 is slow, deliberate reasoning. In our implementation:

✓

System 1 — The Passive Manifold. A topological graph stored in memory. Each node is a reasoning step. Each edge is a logical dependency. When a query matches a known path, the system walks the graph in O(1) and returns the answer instantly.

✓

System 2 — The Active Manifold. The actual LLM (DeepSeek-R1 running locally via Ollama). It only activates when the system encounters something genuinely new — a question whose embedding doesn't match any existing fossil.

✓

The Gatekeeper — Consonance Checking. A MiniLM-L6-v2 embedding model (384 dimensions) compares every incoming query against the graph using HNSW approximate nearest-neighbor search. If the cosine distance is below the threshold, System 1 handles it. If not, System 2 wakes up.

Analogy: Think of it like a library with a librarian. When you ask a question, the librarian first checks the card catalogue (System 1). If the book is on the shelf, she hands it to you in seconds. If it's not, she calls the author and commissions a new chapter (System 2) — then files it for next time.

How It Works: From Question to Answer

Here's the actual inference pipeline, step by step:

Encode the query. The incoming question is embedded into a 384-dimensional vector using MiniLM-L6-v2 (running locally via ONNX — no API calls).
Search the graph. HNSW indexing retrieves the nearest fossil in O(log n) time. If the cosine distance is below the consonance threshold θ_c, the query is routed to System 1.
System 1 path: Walk the topological graph from the matched node to its terminal state. Return the cached reasoning chain. Total LLM compute: zero tokens.
System 2 path: Wake the LLM. Generate a full reasoning chain. Parse the chain into discrete steps. Score each step. Fossilize high-confidence paths into the graph for future recall.
Self-heal. If a fossilized path leads to a wrong answer, the Autonomous Repair Agent (ARA) propagates negative rewards backward through the graph, weakening or pruning the faulty path.

The Results

Echo Test: Can It Remember?

The first test was simple: feed the system 20 reasoning chains, then ask it the same 20 questions again. Can the graph intercept them all without waking the LLM?

20/20

Queries Intercepted

100% recall from graph

89.1%

Tokens Saved

407,552 tokens bypassed

49,992

Active Tokens Used

Only for novel queries

That 89.1% figure isn't theoretical. It's the ratio of tokens the LLM didn't have to generate because the graph already had the answer. On repeated queries, the system does zero neural compute.

Self-Healing: Can It Fix Itself?

A memory that can't correct itself is just a cache with extra steps. The Autonomous Repair Agent (ARA) monitors the graph for logical inconsistencies. When a fossilized path leads to an incorrect terminal state, ARA walks backward through the reasoning chain and applies a discounted penalty to every node along the way.

Configuration	Zero-Shot ARA	Peak Healing	Tokens Saved
Baseline (Run 8)	35.1%	59.8%	20.8%
Fully Tuned (Run 9)	27.5%	82.4%	89.1%
Multi-Domain (Run 11)	37.4%	68.9%	96.8%

At peak, ARA healed 82.4% of detected blind spots without any human intervention. The multi-domain configuration pushed token savings to 96.8% — meaning the LLM only had to think for 3.2% of the total workload.

Cross-Domain Transfer: The Eureka Effect

This is the result we didn't expect.

The graph was populated exclusively with mathematical reasoning chains — algebra, calculus, set theory. Then we asked it 20 computer science questions it had never seen: algorithm complexity, data structures, graph traversals. Zero training on CS content.

It got 34.3% right on the first attempt. That number alone isn't impressive. What's impressive is how: every single correct answer was produced by System 1 walking a mathematical fossil to solve a computer science problem. The graph recognized the structural topology of the reasoning, independent of the words.

We expanded the test. Three source domains (Math, Logic, Science) against two quarantined target domains (Computer Science, Machine Learning). The system never saw a single CS or ML training example.

99.4%

Interception Transfer

170 of 171 Eurekas cross-domain

Source Domains

Math · Logic · Science

Target Domains

CS · ML (zero-shot)

170 out of 171 successful interceptions used reasoning fossils from a completely different domain. The continuous embedding space organized knowledge by causal structure, not by vocabulary. A proof-by-contradiction fossil originally built for a math problem can navigate a CS problem that shares the same logical shape.

The Stack: What's Actually Running

Everything runs locally on a single machine. No cloud APIs. No GPU required.

Runtime: C# on .NET 10, ~3,500 lines of code.
LLM: DeepSeek-R1:14b via Ollama (local). Recently extended with a provider abstraction to support GLM-Z1:9b and future models.
Embeddings: MiniLM-L6-v2 (384-dim), running locally via ONNX Runtime. No API calls.
Graph Storage: LiteDB for persistence, HNSW for approximate nearest-neighbor search, Neo4j optional for visualization.
Self-Healing: ARA with backward Q-value propagation and configurable decay.
Hardware: Apple Silicon (M4 Pro). Total power draw during inference: under 5 watts.

Why local? If the goal is to reduce LLM compute, it defeats the purpose to call a cloud API for embeddings. Every component — the encoder, the graph, the LLM — runs on the same machine. The system is fully self-contained.

What This Means

LLMs are powerful but wasteful. They re-derive known answers from scratch, they can't self-correct, and they forget everything between sessions. DAPSA addresses all three:

✓

No redundant computation. Known reasoning paths are recalled from the graph. The LLM only activates for genuinely novel problems.

✓

Self-correcting memory. ARA detects and heals faulty paths automatically. The graph gets more reliable over time, not less.

✓

Structural generalization. The embedding space organizes by causal topology, enabling zero-shot transfer across unrelated domains.

This isn't a replacement for LLMs. It's a memory layer that makes them dramatically more efficient. The model still does the hard thinking — but only once per reasoning pattern.

Why this matters now: The recent Anthropic Mythos incident — a black-box model autonomously chaining zero-day exploits, prompting an emergency meeting between the US Treasury, the Fed, and major bank CEOs — illustrates why auditability isn't optional. In DAPSA, every decision is a traceable graph walk. There are no hidden policy adjustments. The "Refined Policy Adjustment" that made Mythos so dangerous is, in our system, an explicit, verifiable Recursive Meta Hierarchy you can inspect node by node.

Limitations and Next Steps

We're being transparent about what this experiment doesn't prove yet:

Scale. The current evaluation uses a curated problem set. Public benchmarks (MMLU, GSM8K, ARC) are next.
Adversarial robustness. Consonance checking relies on cosine similarity — adversarial paraphrases could potentially evade the gatekeeper. We're investigating dissonance stress testing.
Multi-model validation. The provider abstraction is in place (DeepSeek-R1 and GLM-Z1 both supported), but systematic comparison across model families hasn't been completed.

The architecture is open. The next experiment will put DAPSA through public benchmarks and publish the results — no cherry-picking, no curation.