Every time you ask a Large Language Model a question it has answered before, it solves it from scratch. Every matrix multiplication. Every token. Every watt. Over and over again.
We built a system that stops doing that. It remembers.
This post is a walkthrough of RLRL-LLM — the Rich Learning Paradigm applied to Large Language Models. The idea is deceptively simple: build a graph of everything the LLM has ever reasoned through, and next time a similar question arrives, walk the graph instead of waking the neural network.
The result: 89.1% fewer tokens generated, zero hallucinations on known paths, and evidence that reasoning patterns transfer across completely unrelated domains.
The Problem: Expensive Amnesia
Modern LLMs are stateless. They have no long-term memory of their own reasoning. Ask GPT or DeepSeek to prove the Pythagorean theorem today, and it will execute billions of floating-point operations to produce the proof. Ask it again tomorrow — same operations, same energy, same result. Every query is a cold start.
This isn't just wasteful — it's architecturally fragile. Without memory, there's no mechanism to catch a reasoning error once it's been committed. Every inference pass is independent, so the same hallucination can recur indefinitely.
What if the model could recognize "I've solved this before" and simply recall the answer — the way you recall that 7 × 8 = 56 without re-deriving multiplication?
The Architecture: Two Systems, One Brain
DAPSA (Dual Active-Passive System Architecture) is inspired by Daniel Kahneman's dual-process theory: System 1 is fast, automatic recall. System 2 is slow, deliberate reasoning. In our implementation:
Analogy: Think of it like a library with a librarian. When you ask a question, the librarian first checks the card catalogue (System 1). If the book is on the shelf, she hands it to you in seconds. If it's not, she calls the author and commissions a new chapter (System 2) — then files it for next time.
How It Works: From Question to Answer
Here's the actual inference pipeline, step by step:
- Encode the query. The incoming question is embedded into a 384-dimensional vector using MiniLM-L6-v2 (running locally via ONNX — no API calls).
- Search the graph. HNSW indexing retrieves the nearest fossil in O(log n) time. If the cosine distance is below the consonance threshold θc, the query is routed to System 1.
- System 1 path: Walk the topological graph from the matched node to its terminal state. Return the cached reasoning chain. Total LLM compute: zero tokens.
- System 2 path: Wake the LLM. Generate a full reasoning chain. Parse the chain into discrete steps. Score each step. Fossilize high-confidence paths into the graph for future recall.
- Self-heal. If a fossilized path leads to a wrong answer, the Autonomous Repair Agent (ARA) propagates negative rewards backward through the graph, weakening or pruning the faulty path.
The Results
Echo Test: Can It Remember?
The first test was simple: feed the system 20 reasoning chains, then ask it the same 20 questions again. Can the graph intercept them all without waking the LLM?
That 89.1% figure isn't theoretical. It's the ratio of tokens the LLM didn't have to generate because the graph already had the answer. On repeated queries, the system does zero neural compute.
Self-Healing: Can It Fix Itself?
A memory that can't correct itself is just a cache with extra steps. The Autonomous Repair Agent (ARA) monitors the graph for logical inconsistencies. When a fossilized path leads to an incorrect terminal state, ARA walks backward through the reasoning chain and applies a discounted penalty to every node along the way.
| Configuration | Zero-Shot ARA | Peak Healing | Tokens Saved |
|---|---|---|---|
| Baseline (Run 8) | 35.1% | 59.8% | 20.8% |
| Fully Tuned (Run 9) | 27.5% | 82.4% | 89.1% |
| Multi-Domain (Run 11) | 37.4% | 68.9% | 96.8% |
At peak, ARA healed 82.4% of detected blind spots without any human intervention. The multi-domain configuration pushed token savings to 96.8% — meaning the LLM only had to think for 3.2% of the total workload.
Cross-Domain Transfer: The Eureka Effect
This is the result we didn't expect.
The graph was populated exclusively with mathematical reasoning chains — algebra, calculus, set theory. Then we asked it 20 computer science questions it had never seen: algorithm complexity, data structures, graph traversals. Zero training on CS content.
It got 34.3% right on the first attempt. That number alone isn't impressive. What's impressive is how: every single correct answer was produced by System 1 walking a mathematical fossil to solve a computer science problem. The graph recognized the structural topology of the reasoning, independent of the words.
We expanded the test. Three source domains (Math, Logic, Science) against two quarantined target domains (Computer Science, Machine Learning). The system never saw a single CS or ML training example.
170 out of 171 successful interceptions used reasoning fossils from a completely different domain. The continuous embedding space organized knowledge by causal structure, not by vocabulary. A proof-by-contradiction fossil originally built for a math problem can navigate a CS problem that shares the same logical shape.
The Stack: What's Actually Running
Everything runs locally on a single machine. No cloud APIs. No GPU required.
- Runtime: C# on .NET 10, ~3,500 lines of code.
- LLM: DeepSeek-R1:14b via Ollama (local). Recently extended with a provider abstraction to support GLM-Z1:9b and future models.
- Embeddings: MiniLM-L6-v2 (384-dim), running locally via ONNX Runtime. No API calls.
- Graph Storage: LiteDB for persistence, HNSW for approximate nearest-neighbor search, Neo4j optional for visualization.
- Self-Healing: ARA with backward Q-value propagation and configurable decay.
- Hardware: Apple Silicon (M4 Pro). Total power draw during inference: under 5 watts.
Why local? If the goal is to reduce LLM compute, it defeats the purpose to call a cloud API for embeddings. Every component — the encoder, the graph, the LLM — runs on the same machine. The system is fully self-contained.
What This Means
LLMs are powerful but wasteful. They re-derive known answers from scratch, they can't self-correct, and they forget everything between sessions. DAPSA addresses all three:
This isn't a replacement for LLMs. It's a memory layer that makes them dramatically more efficient. The model still does the hard thinking — but only once per reasoning pattern.
Why this matters now: The recent Anthropic Mythos incident — a black-box model autonomously chaining zero-day exploits, prompting an emergency meeting between the US Treasury, the Fed, and major bank CEOs — illustrates why auditability isn't optional. In DAPSA, every decision is a traceable graph walk. There are no hidden policy adjustments. The "Refined Policy Adjustment" that made Mythos so dangerous is, in our system, an explicit, verifiable Recursive Meta Hierarchy you can inspect node by node.
Limitations and Next Steps
We're being transparent about what this experiment doesn't prove yet:
- Scale. The current evaluation uses a curated problem set. Public benchmarks (MMLU, GSM8K, ARC) are next.
- Adversarial robustness. Consonance checking relies on cosine similarity — adversarial paraphrases could potentially evade the gatekeeper. We're investigating dissonance stress testing.
- Multi-model validation. The provider abstraction is in place (DeepSeek-R1 and GLM-Z1 both supported), but systematic comparison across model families hasn't been completed.
The architecture is open. The next experiment will put DAPSA through public benchmarks and publish the results — no cherry-picking, no curation.