rolandpg · rolandpg · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
@@ -57,15 +57,28 @@ jobs:
         # install, not at runtime; CI builds in ephemeral runners with
         # no persistent state. Re-evaluate when GitHub's images ship a
         # patched pip.
+        #
+        # CVE-2023-36464 / GHSA-4vvm-4w3v-6mr8: medium-severity
+        # infinite-loop DoS in PyPDF2 3.0.1, introduced transitively by
+        # Maigret. PyPDF2 has no patched release under that package name
+        # (upstream recommends migrating to pypdf>=3.9.0), and ZettelForge's
+        # AGE-120 username collector does not parse attacker-supplied PDFs or
+        # invoke Maigret report generation. Accepted for AGE-120 because the
+        # GOV-009 blocking threshold is HIGH/CRITICAL and the collector
+        # lazy-imports/fails closed.
         pip-audit --strict \
           --ignore-vuln=CVE-2026-3219 \
-          --ignore-vuln=PYSEC-2026-196
+          --ignore-vuln=PYSEC-2026-196 \
+          --ignore-vuln=CVE-2023-36464
 
   test:
     runs-on: ubuntu-latest
     needs: lint
     strategy:
       fail-fast: false
+      # The fastembed model download is shared across Python versions. Running
+      # these jobs in parallel can double-hit HuggingFace and trigger 429s.
+      max-parallel: 1
       matrix:
         python-version: ['3.12', '3.13']
 
@@ -77,6 +90,14 @@ jobs:
       with:
         python-version: ${{ matrix.python-version }}
 
+    - name: Cache fastembed model
+      uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830
+      with:
+        path: |
+          ~/.cache/fastembed
+          ~/.cache/huggingface
+        key: fastembed-nomic-embed-text-v1.5-Q-${{ runner.os }}
+
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip

@@ -28,6 +28,79 @@ ZettelForge was evaluated across five benchmark suites. The system runs with zer
 
 ---
 
+## 0. Performance session 2026-06-09 (v2.8.0-dev, branch perf/cti-memory-40)
+
+All numbers below are same-machine (DGX Spark GB10), same-day, deterministic
+config: enrichment disabled (`ZETTELFORGE_ENRICHMENT_ENABLED=false`), keyword
+judge, heuristic answer extraction (no synthesis LLM installed). The clean
+baseline was measured first on unmodified v2.7.0 source after repairing the
+rotted harnesses (dead `disable_enrichment` kwarg, removed `remember_chunked`
+API). Raw logs: `benchmarks/results/session_2026-06-09/`.
+
+| Metric | v2.7.0 baseline | optimized | delta |
+|--------|-----------------|-----------|-------|
+| LoCoMo accuracy (keyword judge) | 7.0% | 11.0% | +57% relative |
+| LoCoMo p50 / p95 latency | 336ms / 387ms | 170ms / 193ms | -49% / -50% |
+| LoCoMo ingest (272 sessions) | 262.5s (1.0/s) | 33.8s (8.0/s) | 7.8x |
+| CTI retrieval accuracy | 75.0% | 75.0% | held |
+| CTI p50 latency (idle machine) | 79ms | 39ms | -51% |
+| recall p95 (profiled, 60 calls) | 258ms | 93ms | -64% |
+| recall mean (profiled) | 117.6ms | 54.8ms | -53% |
+
+Note on LoCoMo baselines: the published 22% (v2.1.1) used a local synthesis
+LLM (qwen2.5:3b) that is not installed on this host; both columns above use
+the same deterministic heuristic-extraction path, so the comparison is
+apples to apples. Latency includes harness overhead (keyword boost scan and
+synthesis fallback), not just `recall()`.
+
+### What changed
+
+1. **Scoped knowledge graph reads.** `_recall_inner` traversed the
+   process-global JSONL KG (109MB on this host, mixing every store) while
+   writes went to the per-store SQLite KG. Isolated stores saw up to ~2000
+   phantom note IDs per entity query and never saw their own graph. Recall
+   now reads the store's KG via `StoreGraphSource`.
+2. **MemSAD gate vectorized.** The write-time anomaly gate was 93% of
+   remember() latency at 50 references (~1.1s/ingest): O(n^2) pure-Python
+   cosines plus n^2 n-gram recounts per ingest. numpy pairwise scoring,
+   content-hash counter cache, and a bounded reference fetch
+   (`get_recent_notes_by_domain`) brought warm evaluate() to ~3.4ms with
+   scores pinned to the original math at 1e-9 by characterization tests.
+3. **Rerank policy.** Cross-encoder rerank is the dominant read cost and is
+   worth +15pp CTI accuracy (75% vs 60% without it). Grid-tuned bounds:
+   8 candidates, 256 chars/doc (accuracy holds from 50x512 down to 8x128;
+   collapses below 8 candidates). `rerank_model` is configurable; the
+   model grid kept ms-marco-MiniLM-L-6-v2.
+4. **ONNX thread pinning.** 20-core default oversubscribed small batches:
+   8 threads cut rerank 23.7ms to 11.5ms and query embedding 5.9ms to 4.5ms.
+5. **Embedding LRU cache** keyed by (model, sha256(text)) — first
+   integration of the dormant cache.py.
+6. **Entity fan-out gate.** Query entities whose KG out-degree exceeds
+   `retrieval.entity_max_fanout` (default 25) are skipped by graph and
+   entity-augmentation stages (conversational speaker names map to every
+   session and flood blended recall).
+7. **Enrichment off-switch** (`ZETTELFORGE_ENRICHMENT_ENABLED`) restoring
+   deterministic benchmark ingestion; `remember_chunked()` restored.
+
+### Chunked-ingestion configuration (recorded, not default)
+
+`LOCOMO_CHUNK_SIZE=800` stores each session as ~800-char chunks
+(MemPalace granularity, no 4000-char truncation): 13.0% accuracy at
+p50 347ms / p95 418ms on a ~1400-note store. Compared to the v2.7.0
+baseline at effectively the same latency (336ms), that is +86%
+relative accuracy; compared to the default optimized config it trades
+2x latency for +2pp. Default stays full-session (11.0% at 170ms).
+
+### Negative result (recorded)
+
+Free-text person extraction (capitalized tokens in running text) dropped
+LoCoMo from 11% to 5% by reshuffling supersession chains at ingest, with no
+single-hop or multi-hop gain. Reverted same day; regression-locked in
+`tests/test_conversational_entities.py`. Conversational NER should come via
+the RFC-001 LLM path, not regex.
+
+---
+
 ## 1. CTI Retrieval Benchmark (Domain Benchmark)
 
 **Date:** 2026-04-10 | **Corpus:** 8 real-world-style CTI reports | **Queries:** 20

@@ -22,6 +22,8 @@
 from typing import List, Dict, Tuple
 
 os.environ["ZETTELFORGE_BACKEND"] = "jsonl"
+# Deterministic ingestion: no background LLM enrichment during benchmarks.
+os.environ.setdefault("ZETTELFORGE_ENRICHMENT_ENABLED", "false")
 
 from zettelforge import MemoryManager
 

@@ -1,78 +1,78 @@
 {
   "meta": {
-    "date": "2026-04-10T08:05:55.405026",
+    "date": "2026-06-09T13:56:15.802128",
     "reports": 8,
     "queries": 20
   },
   "full_session": {
     "strategy": "full_session",
     "notes": 8,
-    "ingest_time_s": 69.1,
+    "ingest_time_s": 3.4,
     "accuracy": 75.0,
-    "avg_score": 0.875,
-    "p50_latency_ms": 620.0,
-    "p95_latency_ms": 2732.0,
+    "avg_score": 0.85,
+    "p50_latency_ms": 39.0,
+    "p95_latency_ms": 159.0,
     "by_category": {
       "tool-attribution": {
         "accuracy": 40.0,
         "avg_score": 0.7,
-        "p50_latency_ms": 1343.0
+        "p50_latency_ms": 42.0
       },
       "cve-linkage": {
         "accuracy": 75.0,
-        "avg_score": 0.875,
-        "p50_latency_ms": 794.0
+        "avg_score": 0.75,
+        "p50_latency_ms": 38.0
       },
       "attribution": {
         "accuracy": 100.0,
         "avg_score": 1.0,
-        "p50_latency_ms": 611.0
+        "p50_latency_ms": 59.0
       },
       "temporal": {
         "accuracy": 66.7,
         "avg_score": 0.833,
-        "p50_latency_ms": 569.0
+        "p50_latency_ms": 41.0
       },
       "multi-hop": {
         "accuracy": 100.0,
         "avg_score": 1.0,
-        "p50_latency_ms": 644.0
+        "p50_latency_ms": 38.0
       }
     }
   },
   "chunked_800": {
     "strategy": "chunked_800",
     "notes": 8,
-    "ingest_time_s": 56.5,
+    "ingest_time_s": 0.1,
     "accuracy": 75.0,
-    "avg_score": 0.875,
-    "p50_latency_ms": 706.0,
-    "p95_latency_ms": 2729.0,
+    "avg_score": 0.85,
+    "p50_latency_ms": 52.0,
+    "p95_latency_ms": 59.0,
     "by_category": {
       "tool-attribution": {
         "accuracy": 40.0,
         "avg_score": 0.7,
-        "p50_latency_ms": 1299.0
+        "p50_latency_ms": 50.0
       },
       "cve-linkage": {
         "accuracy": 75.0,
-        "avg_score": 0.875,
-        "p50_latency_ms": 795.0
+        "avg_score": 0.75,
+        "p50_latency_ms": 52.0
       },
       "attribution": {
         "accuracy": 100.0,
         "avg_score": 1.0,
-        "p50_latency_ms": 535.0
+        "p50_latency_ms": 52.0
       },
       "temporal": {
         "accuracy": 66.7,
         "avg_score": 0.833,
-        "p50_latency_ms": 772.0
+        "p50_latency_ms": 54.0
       },
       "multi-hop": {
         "accuracy": 100.0,
         "avg_score": 1.0,
-        "p50_latency_ms": 741.0
+        "p50_latency_ms": 33.0
       }
     }
   }

@@ -0,0 +1,62 @@
+#!/usr/bin/env python3
+"""Instrument note-lookup volume per recall stage.
+
+Counts store.get_note_by_id calls (total vs unique ids) and graph result
+sizes per query to locate the redundant-lookup source the profiler exposed
+(~476 lookups/query on an 8-note corpus).
+
+Usage:
+  python benchmarks/instrument_lookups.py
+"""
+import os
+import tempfile
+
+os.environ.setdefault('ZETTELFORGE_ENRICHMENT_ENABLED', 'false')
+
+from cti_retrieval_benchmark import CTI_QUERIES, CTI_REPORTS
+
+from zettelforge import MemoryManager
+from zettelforge.graph_retriever import GraphRetriever
+
+
+def main() -> None:
+    tmpdir = tempfile.mkdtemp(prefix='instr_lookups_')
+    mm = MemoryManager(jsonl_path=f'{tmpdir}/notes.jsonl', lance_path=f'{tmpdir}/vectordb')
+    for report in CTI_REPORTS:
+        mm.remember(report['content'], source_type='threat_report', source_ref=report['id'], domain='cti')
+
+    # Wrap get_note_by_id with a counter
+    calls = {'total': 0, 'ids': []}
+    orig = mm.store.get_note_by_id
+
+    def counting(nid):
+        calls['total'] += 1
+        calls['ids'].append(nid)
+        return orig(nid)
+
+    mm.store.get_note_by_id = counting
+
+    # Wrap graph retrieval to report result sizes
+    orig_retrieve = GraphRetriever.retrieve_note_ids
+    graph_sizes = []
+
+    def counting_retrieve(self, query_entities, max_depth=2):
+        res = orig_retrieve(self, query_entities, max_depth=max_depth)
+        graph_sizes.append(len(res))
+        return res
+
+    GraphRetriever.retrieve_note_ids = counting_retrieve
+
+    print(f'{"query":<48} {"lookups":>8} {"unique":>7} {"graph_n":>8}')
+    for qa in CTI_QUERIES:
+        calls['total'] = 0
+        calls['ids'] = []
+        graph_sizes.clear()
+        mm.recall(qa['question'], k=10, exclude_superseded=False)
+        uniq = len(set(calls['ids']))
+        gsz = graph_sizes[0] if graph_sizes else 0
+        print(f'{qa["question"][:46]:<48} {calls["total"]:>8} {uniq:>7} {gsz:>8}')
+
+
+if __name__ == '__main__':
+    main()
@@ -32,6 +32,10 @@
 from typing import List, Dict, Optional, Tuple
 from datetime import datetime
 
+# Must be set before any zettelforge import resolves the config singleton:
+# benchmark ingestion is deterministic, no background LLM enrichment.
+os.environ.setdefault("ZETTELFORGE_ENRICHMENT_ENABLED", "false")
+
 from zettelforge import MemoryManager
 
 
@@ -127,18 +131,44 @@ def ingest_conversations(mm: MemoryManager, turns: List[Dict], batch_sessions: b
                 sessions[key] = {"date": turn["date"], "lines": [], "sample_id": turn["sample_id"], "session": turn["session"]}
             sessions[key]["lines"].append(f"{turn['speaker']}: {turn['text']}")
 
+        # LOCOMO_CHUNK_SIZE > 0 stores each session as ~chunk-size pieces
+        # (MemPalace-style granularity) with the [date] header repeated per
+        # chunk, and avoids the 4000-char truncation that drops session tails.
+        chunk_size = int(os.environ.get("LOCOMO_CHUNK_SIZE", "0"))
+
         for key, session in sessions.items():
-            content = f"[{session['date']}] Conversation session {session['session']}:\n" + "\n".join(session["lines"])
-            # Truncate very long sessions to avoid overwhelming the embedding
-            if len(content) > 4000:
-                content = content[:4000]
+            header = f"[{session['date']}] Conversation session {session['session']}:"
+            source_ref = f"locomo:{session['sample_id']}:session_{session['session']}"
+            if chunk_size > 0:
+                pieces: List[str] = []
+                current: List[str] = []
+                current_len = 0
+                for line in session["lines"]:
+                    if current and current_len + len(line) + 1 > chunk_size:
+                        pieces.append("\n".join(current))
+                        current = []
+                        current_len = 0
+                    current.append(line)
+                    current_len += len(line) + 1
+                if current:
+                    pieces.append("\n".join(current))
+                contents = [f"{header}\n{piece}" for piece in pieces]
+            else:
+                content = f"{header}\n" + "\n".join(session["lines"])
+                # Truncate very long sessions to avoid overwhelming the embedding
+                if len(content) > 4000:
+                    content = content[:4000]
+                contents = [content]
+
             try:
-                mm.remember(
-                    content=content,
-                    source_type="dialogue",
-                    source_ref=f"locomo:{session['sample_id']}:session_{session['session']}",
-                    domain="locomo",
-                )
+                for i, content in enumerate(contents):
+                    ref = source_ref if len(contents) == 1 else f"{source_ref}#c{i}"
+                    mm.remember(
+                        content=content,
+                        source_type="dialogue",
+                        source_ref=ref,
+                        domain="locomo",
+                    )
                 ingested += 1
             except RuntimeError as e:
                 errors += 1
@@ -443,7 +473,6 @@ def run_benchmark(
     mm = MemoryManager(
         jsonl_path=f"{tmpdir}/notes.jsonl",
         lance_path=f"{tmpdir}/vectordb",
-        disable_enrichment=True,
     )
 
     # Ingest