Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
75ab721
feat(osint): add passive executor and resolver scoping
rolandpg Jun 9, 2026
41d36ca
Potential fix for pull request finding
rolandpg Jun 9, 2026
f7fa81f
Persist existing-node updates in OSINT resolver
Copilot Jun 9, 2026
b399716
fix(osint): enforce tuple validation and persist updates
rolandpg Jun 9, 2026
a6b7fd0
feat(config): add enrichment off-switch for deterministic benchmarks
rolandpg Jun 9, 2026
9318096
feat(memory): restore remember_chunked for chunked ingestion
rolandpg Jun 9, 2026
9fceb3d
fix(recall): graph stage reads the per-store KG, not the global JSONL KG
rolandpg Jun 9, 2026
810f26b
perf(recall): bound cross-encoder rerank cost with config policy
rolandpg Jun 9, 2026
9ff6cf2
perf(embedding): LRU+TTL cache for repeated embedding requests
rolandpg Jun 9, 2026
6bae64c
perf(defense): vectorize MemSAD gate, cache n-gram counters, bound re…
rolandpg Jun 9, 2026
74054dd
perf(recall): adopt benchmark-tuned rerank defaults (256 chars, 8 can…
rolandpg Jun 9, 2026
b2f4d61
perf(onnx): pin intra-op threads for small-batch inference, add reran…
rolandpg Jun 9, 2026
b3d2c5e
feat(entities): free-text person extraction for conversational recall
rolandpg Jun 9, 2026
48659d8
feat(recall): IDF-style fan-out gate for query entities
rolandpg Jun 9, 2026
0784138
revert(entities): free-text person extraction regressed LoCoMo, dialo…
rolandpg Jun 9, 2026
8fb59af
chore(types): annotate StoreGraphSource and embedding cache helpers
rolandpg Jun 9, 2026
8a8f5ab
test(benchmarks): add profiling and measurement harnesses from perf s…
rolandpg Jun 9, 2026
db919bb
docs(benchmarks): record 2026-06-09 performance session results
rolandpg Jun 9, 2026
98d592b
feat(osint): port flowsint observable-model gaps + enforce AGE-118 ex…
rolandpg Jun 15, 2026
91a6dfd
feat(osint): implement live OSINT enrichers into graph (AGE-120)
rolandpg Jun 15, 2026
7bbf6cb
chore(osint): update AGE-120 audit evidence
rolandpg Jun 16, 2026
74a9c0f
Merge origin/master into AGE-120 OSINT branch
rolandpg Jun 16, 2026
2c5e0be
Merge remote-tracking branch 'origin/master' into feat/age-120-osint-…
rolandpg Jun 16, 2026
76d2bcb
style: format memory defense
rolandpg Jun 16, 2026
2ddebcc
Address AGE-120 review feedback
rolandpg Jun 16, 2026
fe433fe
ci: cache fastembed model across test matrix
rolandpg Jun 16, 2026
8302d15
Merge origin/master into AGE-120 OSINT branch
rolandpg Jun 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,15 +57,28 @@ jobs:
# install, not at runtime; CI builds in ephemeral runners with
# no persistent state. Re-evaluate when GitHub's images ship a
# patched pip.
#
# CVE-2023-36464 / GHSA-4vvm-4w3v-6mr8: medium-severity
# infinite-loop DoS in PyPDF2 3.0.1, introduced transitively by
# Maigret. PyPDF2 has no patched release under that package name
# (upstream recommends migrating to pypdf>=3.9.0), and ZettelForge's
# AGE-120 username collector does not parse attacker-supplied PDFs or
# invoke Maigret report generation. Accepted for AGE-120 because the
# GOV-009 blocking threshold is HIGH/CRITICAL and the collector
# lazy-imports/fails closed.
pip-audit --strict \
--ignore-vuln=CVE-2026-3219 \
--ignore-vuln=PYSEC-2026-196
--ignore-vuln=PYSEC-2026-196 \
--ignore-vuln=CVE-2023-36464

test:
runs-on: ubuntu-latest
needs: lint
strategy:
fail-fast: false
# The fastembed model download is shared across Python versions. Running
# these jobs in parallel can double-hit HuggingFace and trigger 429s.
max-parallel: 1
matrix:
python-version: ['3.12', '3.13']

Expand All @@ -77,6 +90,14 @@ jobs:
with:
python-version: ${{ matrix.python-version }}

- name: Cache fastembed model
uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830
with:
path: |
~/.cache/fastembed
~/.cache/huggingface
key: fastembed-nomic-embed-text-v1.5-Q-${{ runner.os }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand Down
73 changes: 73 additions & 0 deletions benchmarks/BENCHMARK_REPORT.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,79 @@ ZettelForge was evaluated across five benchmark suites. The system runs with zer

---

## 0. Performance session 2026-06-09 (v2.8.0-dev, branch perf/cti-memory-40)

All numbers below are same-machine (DGX Spark GB10), same-day, deterministic
config: enrichment disabled (`ZETTELFORGE_ENRICHMENT_ENABLED=false`), keyword
judge, heuristic answer extraction (no synthesis LLM installed). The clean
baseline was measured first on unmodified v2.7.0 source after repairing the
rotted harnesses (dead `disable_enrichment` kwarg, removed `remember_chunked`
API). Raw logs: `benchmarks/results/session_2026-06-09/`.

| Metric | v2.7.0 baseline | optimized | delta |
|--------|-----------------|-----------|-------|
| LoCoMo accuracy (keyword judge) | 7.0% | 11.0% | +57% relative |
| LoCoMo p50 / p95 latency | 336ms / 387ms | 170ms / 193ms | -49% / -50% |
| LoCoMo ingest (272 sessions) | 262.5s (1.0/s) | 33.8s (8.0/s) | 7.8x |
| CTI retrieval accuracy | 75.0% | 75.0% | held |
| CTI p50 latency (idle machine) | 79ms | 39ms | -51% |
| recall p95 (profiled, 60 calls) | 258ms | 93ms | -64% |
| recall mean (profiled) | 117.6ms | 54.8ms | -53% |

Note on LoCoMo baselines: the published 22% (v2.1.1) used a local synthesis
LLM (qwen2.5:3b) that is not installed on this host; both columns above use
the same deterministic heuristic-extraction path, so the comparison is
apples to apples. Latency includes harness overhead (keyword boost scan and
synthesis fallback), not just `recall()`.

### What changed

1. **Scoped knowledge graph reads.** `_recall_inner` traversed the
process-global JSONL KG (109MB on this host, mixing every store) while
writes went to the per-store SQLite KG. Isolated stores saw up to ~2000
phantom note IDs per entity query and never saw their own graph. Recall
now reads the store's KG via `StoreGraphSource`.
2. **MemSAD gate vectorized.** The write-time anomaly gate was 93% of
remember() latency at 50 references (~1.1s/ingest): O(n^2) pure-Python
cosines plus n^2 n-gram recounts per ingest. numpy pairwise scoring,
content-hash counter cache, and a bounded reference fetch
(`get_recent_notes_by_domain`) brought warm evaluate() to ~3.4ms with
scores pinned to the original math at 1e-9 by characterization tests.
3. **Rerank policy.** Cross-encoder rerank is the dominant read cost and is
worth +15pp CTI accuracy (75% vs 60% without it). Grid-tuned bounds:
8 candidates, 256 chars/doc (accuracy holds from 50x512 down to 8x128;
collapses below 8 candidates). `rerank_model` is configurable; the
model grid kept ms-marco-MiniLM-L-6-v2.
4. **ONNX thread pinning.** 20-core default oversubscribed small batches:
8 threads cut rerank 23.7ms to 11.5ms and query embedding 5.9ms to 4.5ms.
5. **Embedding LRU cache** keyed by (model, sha256(text)) — first
integration of the dormant cache.py.
6. **Entity fan-out gate.** Query entities whose KG out-degree exceeds
`retrieval.entity_max_fanout` (default 25) are skipped by graph and
entity-augmentation stages (conversational speaker names map to every
session and flood blended recall).
7. **Enrichment off-switch** (`ZETTELFORGE_ENRICHMENT_ENABLED`) restoring
deterministic benchmark ingestion; `remember_chunked()` restored.

### Chunked-ingestion configuration (recorded, not default)

`LOCOMO_CHUNK_SIZE=800` stores each session as ~800-char chunks
(MemPalace granularity, no 4000-char truncation): 13.0% accuracy at
p50 347ms / p95 418ms on a ~1400-note store. Compared to the v2.7.0
baseline at effectively the same latency (336ms), that is +86%
relative accuracy; compared to the default optimized config it trades
2x latency for +2pp. Default stays full-session (11.0% at 170ms).

### Negative result (recorded)

Free-text person extraction (capitalized tokens in running text) dropped
LoCoMo from 11% to 5% by reshuffling supersession chains at ingest, with no
single-hop or multi-hop gain. Reverted same day; regression-locked in
`tests/test_conversational_entities.py`. Conversational NER should come via
the RFC-001 LLM path, not regex.

---

## 1. CTI Retrieval Benchmark (Domain Benchmark)

**Date:** 2026-04-10 | **Corpus:** 8 real-world-style CTI reports | **Queries:** 20
Expand Down
2 changes: 2 additions & 0 deletions benchmarks/cti_retrieval_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@
from typing import List, Dict, Tuple

os.environ["ZETTELFORGE_BACKEND"] = "jsonl"
# Deterministic ingestion: no background LLM enrichment during benchmarks.
os.environ.setdefault("ZETTELFORGE_ENRICHMENT_ENABLED", "false")

from zettelforge import MemoryManager

Expand Down
42 changes: 21 additions & 21 deletions benchmarks/cti_retrieval_results.json
Original file line number Diff line number Diff line change
@@ -1,78 +1,78 @@
{
"meta": {
"date": "2026-04-10T08:05:55.405026",
"date": "2026-06-09T13:56:15.802128",
"reports": 8,
"queries": 20
},
"full_session": {
"strategy": "full_session",
"notes": 8,
"ingest_time_s": 69.1,
"ingest_time_s": 3.4,
"accuracy": 75.0,
"avg_score": 0.875,
"p50_latency_ms": 620.0,
"p95_latency_ms": 2732.0,
"avg_score": 0.85,
"p50_latency_ms": 39.0,
"p95_latency_ms": 159.0,
"by_category": {
"tool-attribution": {
"accuracy": 40.0,
"avg_score": 0.7,
"p50_latency_ms": 1343.0
"p50_latency_ms": 42.0
},
"cve-linkage": {
"accuracy": 75.0,
"avg_score": 0.875,
"p50_latency_ms": 794.0
"avg_score": 0.75,
"p50_latency_ms": 38.0
},
"attribution": {
"accuracy": 100.0,
"avg_score": 1.0,
"p50_latency_ms": 611.0
"p50_latency_ms": 59.0
},
"temporal": {
"accuracy": 66.7,
"avg_score": 0.833,
"p50_latency_ms": 569.0
"p50_latency_ms": 41.0
},
"multi-hop": {
"accuracy": 100.0,
"avg_score": 1.0,
"p50_latency_ms": 644.0
"p50_latency_ms": 38.0
}
}
},
"chunked_800": {
"strategy": "chunked_800",
"notes": 8,
"ingest_time_s": 56.5,
"ingest_time_s": 0.1,
"accuracy": 75.0,
"avg_score": 0.875,
"p50_latency_ms": 706.0,
"p95_latency_ms": 2729.0,
"avg_score": 0.85,
"p50_latency_ms": 52.0,
"p95_latency_ms": 59.0,
"by_category": {
"tool-attribution": {
"accuracy": 40.0,
"avg_score": 0.7,
"p50_latency_ms": 1299.0
"p50_latency_ms": 50.0
},
"cve-linkage": {
"accuracy": 75.0,
"avg_score": 0.875,
"p50_latency_ms": 795.0
"avg_score": 0.75,
"p50_latency_ms": 52.0
},
"attribution": {
"accuracy": 100.0,
"avg_score": 1.0,
"p50_latency_ms": 535.0
"p50_latency_ms": 52.0
},
"temporal": {
"accuracy": 66.7,
"avg_score": 0.833,
"p50_latency_ms": 772.0
"p50_latency_ms": 54.0
},
"multi-hop": {
"accuracy": 100.0,
"avg_score": 1.0,
"p50_latency_ms": 741.0
"p50_latency_ms": 33.0
}
}
}
Expand Down
62 changes: 62 additions & 0 deletions benchmarks/instrument_lookups.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/usr/bin/env python3
"""Instrument note-lookup volume per recall stage.

Counts store.get_note_by_id calls (total vs unique ids) and graph result
sizes per query to locate the redundant-lookup source the profiler exposed
(~476 lookups/query on an 8-note corpus).

Usage:
python benchmarks/instrument_lookups.py
"""
import os
import tempfile

os.environ.setdefault('ZETTELFORGE_ENRICHMENT_ENABLED', 'false')

from cti_retrieval_benchmark import CTI_QUERIES, CTI_REPORTS

from zettelforge import MemoryManager
from zettelforge.graph_retriever import GraphRetriever


def main() -> None:
tmpdir = tempfile.mkdtemp(prefix='instr_lookups_')
mm = MemoryManager(jsonl_path=f'{tmpdir}/notes.jsonl', lance_path=f'{tmpdir}/vectordb')
for report in CTI_REPORTS:
mm.remember(report['content'], source_type='threat_report', source_ref=report['id'], domain='cti')

# Wrap get_note_by_id with a counter
calls = {'total': 0, 'ids': []}
orig = mm.store.get_note_by_id

def counting(nid):
calls['total'] += 1
calls['ids'].append(nid)
return orig(nid)

mm.store.get_note_by_id = counting

# Wrap graph retrieval to report result sizes
orig_retrieve = GraphRetriever.retrieve_note_ids
graph_sizes = []

def counting_retrieve(self, query_entities, max_depth=2):
res = orig_retrieve(self, query_entities, max_depth=max_depth)
graph_sizes.append(len(res))
return res

GraphRetriever.retrieve_note_ids = counting_retrieve

print(f'{"query":<48} {"lookups":>8} {"unique":>7} {"graph_n":>8}')
for qa in CTI_QUERIES:
calls['total'] = 0
calls['ids'] = []
graph_sizes.clear()
mm.recall(qa['question'], k=10, exclude_superseded=False)
uniq = len(set(calls['ids']))
gsz = graph_sizes[0] if graph_sizes else 0
print(f'{qa["question"][:46]:<48} {calls["total"]:>8} {uniq:>7} {gsz:>8}')


if __name__ == '__main__':
main()
51 changes: 40 additions & 11 deletions benchmarks/locomo_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@
from typing import List, Dict, Optional, Tuple
from datetime import datetime

# Must be set before any zettelforge import resolves the config singleton:
# benchmark ingestion is deterministic, no background LLM enrichment.
os.environ.setdefault("ZETTELFORGE_ENRICHMENT_ENABLED", "false")

from zettelforge import MemoryManager


Expand Down Expand Up @@ -127,18 +131,44 @@ def ingest_conversations(mm: MemoryManager, turns: List[Dict], batch_sessions: b
sessions[key] = {"date": turn["date"], "lines": [], "sample_id": turn["sample_id"], "session": turn["session"]}
sessions[key]["lines"].append(f"{turn['speaker']}: {turn['text']}")

# LOCOMO_CHUNK_SIZE > 0 stores each session as ~chunk-size pieces
# (MemPalace-style granularity) with the [date] header repeated per
# chunk, and avoids the 4000-char truncation that drops session tails.
chunk_size = int(os.environ.get("LOCOMO_CHUNK_SIZE", "0"))

for key, session in sessions.items():
content = f"[{session['date']}] Conversation session {session['session']}:\n" + "\n".join(session["lines"])
# Truncate very long sessions to avoid overwhelming the embedding
if len(content) > 4000:
content = content[:4000]
header = f"[{session['date']}] Conversation session {session['session']}:"
source_ref = f"locomo:{session['sample_id']}:session_{session['session']}"
if chunk_size > 0:
pieces: List[str] = []
current: List[str] = []
current_len = 0
for line in session["lines"]:
if current and current_len + len(line) + 1 > chunk_size:
pieces.append("\n".join(current))
current = []
current_len = 0
current.append(line)
current_len += len(line) + 1
if current:
pieces.append("\n".join(current))
contents = [f"{header}\n{piece}" for piece in pieces]
else:
content = f"{header}\n" + "\n".join(session["lines"])
# Truncate very long sessions to avoid overwhelming the embedding
if len(content) > 4000:
content = content[:4000]
contents = [content]

try:
mm.remember(
content=content,
source_type="dialogue",
source_ref=f"locomo:{session['sample_id']}:session_{session['session']}",
domain="locomo",
)
for i, content in enumerate(contents):
ref = source_ref if len(contents) == 1 else f"{source_ref}#c{i}"
mm.remember(
content=content,
source_type="dialogue",
source_ref=ref,
domain="locomo",
)
ingested += 1
except RuntimeError as e:
errors += 1
Expand Down Expand Up @@ -443,7 +473,6 @@ def run_benchmark(
mm = MemoryManager(
jsonl_path=f"{tmpdir}/notes.jsonl",
lance_path=f"{tmpdir}/vectordb",
disable_enrichment=True,
)

# Ingest
Expand Down
Loading
Loading