feat(osint): implement live OSINT enrichers into graph (AGE-120)#167
feat(osint): implement live OSINT enrichers into graph (AGE-120)#167rolandpg wants to merge 27 commits into
Conversation
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Patrick Roland <48327651+rolandpg@users.noreply.github.com>
Restores the benchmark isolation the removed disable_enrichment kwarg provided. ZETTELFORGE_ENRICHMENT_ENABLED=false gates causal extraction, LLM NER, and neighbor evolution dispatch. LoCoMo harness repaired (dead kwarg removed) and pinned to deterministic ingestion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sentence-boundary greedy packing to chunk_size with ordinal source_ref provenance. Unblocks the CTI benchmark chunked_800 strategy and the MemPalace-granularity LoCoMo experiment. CTI harness pinned to deterministic (enrichment-off) ingestion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_update_knowledge_graph writes MENTIONED_IN edges to the manager's storage backend, but _recall_inner traversed the process-global JSONL KG (~109MB on this host). Isolated stores saw up to ~2000 phantom note IDs per entity query (each a wasted SQLite lookup) and never saw their own graph, so the graph signal was dead in any custom-data-dir deployment. Adds StorageBackend.get_kg_edges_from and a StoreGraphSource adapter; GraphRetriever now accepts any GraphSource. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rerank_enabled / rerank_max_candidates / rerank_doc_chars on RetrievalConfig plus ZETTELFORGE_RERANK_ENABLED kill switch. Only the head of the blended ranking is reranked; the tail keeps blended order. Defaults preserve prior behavior pending benchmark-tuned values. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SmartCache (config.cache sizing) keyed by (model, sha256(text)) in front of embedding compute. First integration of the previously dormant cache.py module. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ference fetch The gate was 93% of remember() latency at 50 references (1.1s/ingest on LoCoMo): leave-one-out calibration ran O(n^2) pure-Python 768-dim cosines and rebuilt every reference's n-gram Counter n times per ingest, and the call site fetched the entire domain per write. - numpy pairwise cosine + one-shot leave-one-out JSD over a shared vocabulary (Counter subtraction from the pooled total is exact) - content-hash keyed n-gram counter cache - get_recent_notes_by_domain bounded SQL fetch (4x overfetch window) - pure-Python originals retained as degenerate-shape fallbacks Characterization tests pin score/threshold/flag equivalence to 1e-9 against the verbatim original math. Warm-path calibration: 75ms -> 1.6ms on synthetic 50x700-word references; full evaluate ~3.4ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…didates) CTI grid 2026-06-09: accuracy holds at 75% from 512c-50n down to 128c-8n; p50 drops 91ms to 51ms at 256c-8n in-grid. 256c-8n picked over 128c-8n for rerank-context headroom. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…k_model knob 20-core oversubscription thrashed small batches: rerank 8x256c pairs 23.7ms -> 11.5ms and single-query embedding 5.9ms -> 4.5ms at 8 threads (GB10 measurements). rerank_model makes the cross-encoder swappable; model grid kept ms-marco-MiniLM-L-6-v2 (jina tiny/turbo no better). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Person names were only extracted from 'Name:' dialogue lines, so conversational queries produced no entities and graph traversal never fired on them (RFC-001 gap). Single capitalized tokens in running text now qualify, filtered by sentence position, proper-noun-phrase adjacency, and an expanded stopword list (demonyms, vendors, celebrations). CTI suite unchanged at 75%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Free-text person extraction regressed LoCoMo 11% -> 5%: speaker names map to every session, so graph traversal flooded blended recall with undiscriminative notes. Query entities whose KG out-degree exceeds retrieval.entity_max_fanout (default 25) are now skipped by the graph/causal/entity-augmentation stages. KG out-degree is the right signal: supersession prunes the entity index but MENTIONED_IN edges accumulate one per note. Also: LOCOMO_CHUNK_SIZE harness knob for MemPalace-granularity chunked ingestion (removes 4000-char truncation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gue-only again Measured 11% -> 5% overall (single-hop/multi-hop unchanged at 0): persons extracted from turn bodies reshuffled supersession chains at ingest, changing which notes survive in the entity index. The fan-out gate could not recover it because the damage is write-side. Expanded stopword list and the gate itself are kept. Decision and data recorded in the test docstring; revisit via RFC-001 LLM NER. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ession profile_recall (cProfile attribution), instrument_lookups (note-lookup volume per stage), rerank_grid (policy tuning grid), mine_phase_timings (OCSF log phase aggregation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Same-machine, deterministic-config before/after: LoCoMo 7.0% -> 11.0% accuracy (+57% relative; 13.0% chunked config), p50 336 -> 170ms (-49%), p95 -50%, ingest 1.0 -> 8.0 turns/s (7.8x). CTI held 75.0% with p50 79 -> 39ms (-51%). Raw logs under benchmarks/results/session_2026-06-09/. Includes the recorded negative result (free-text person extraction) and the chunked-ingestion configuration trade-off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…clusions AGE-119. Realizes the Flowsint enricher vendoring under the AGE-118 CONDITIONAL GO. Reality check changed the shape of the work: - Framework NOT vendored: every flowsint enricher + its registry import flowsint_core (forbidden by AGE-118: LGPL psycopg2 + Docker control), and ZettelForge already has an equivalent decoupled framework (RFC-016 transform_registry + executor). Reused it instead of duplicating. - Type gaps ported from flowsint-types v1.2.8 @ 2a4878c8 (Apache-2.0): CryptoWallet, Transaction, SocialAccount, with edges + canonicalization. ASN/CIDR were NOT ported (already exist as ASNumber / Netblock). - Compliance artifacts under osint/THIRD_PARTY/: Apache LICENSE, carried- forward NOTICE, third-party notices, and PROVENANCE.md (pinned SHA, post- relicense date, telemetry-grep PASS, exclusions). - Enforced exclusions: neutralized the pre-existing holehe_collector GPL-3.0 import path to a permanent compliant no-op. Tests: 117 passing (new gap-type validation + KG-persistence test). mypy --strict and ruff clean on changed source. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Builds on the AGE-119 vendoring foundation. Implements the five live enrichers as native ZettelForge collectors (RFC-016 transform_registry + executor), feeding the graph backend: - whois_collector: also emit the registrant EmailAddress via a new registered_by edge (Organization branch unchanged). - dns_collector: reverse PTR for IPv4/IPv6 seeds -> DomainName via the existing hosts edge; non-global IPs are skipped. - maigret_collector (new, people tier): Alias -> SocialAccount via has_account, backed by maigret/sherlock (MIT), lazy-imported and fail-closed without the dependency. - hibp_collector: native HIBP v3 REST -> Breach via appeared_in_breach. Replaces the excluded LGPL hibpwned path; key read from env, never logged. - wallet_collector (new, financial tier): CryptoWallet -> Transaction via an Etherscan-style explorer API; sent_transaction / received_transaction. EVM hex wallets only; key read from env, never logged. Supporting changes: - ontology: Breach entity, registered_by + appeared_in_breach edges, canonicalize_email / canonicalize_alias / canonicalize_breach helpers. - executor: EmailAddress / Alias / CryptoWallet seed types plus endpoint prop-key and required-field wiring for the new entity types. - entity_resolver: canonical-key branches for the new seed/output types. - pyproject [osint]: add maigret / sherlock-project (both MIT); notices moved from planned to active in THIRD_PARTY_NOTICES.md. AGE-118 gates: pip-audit on the resolved [osint] closure (core + maigret/ sherlock) reports no known vulnerabilities; evidence in THIRD_PARTY/AGE-120-pip-audit.md. No GPL/LGPL packages, no Docker tool wrappers, secrets only from env and never logged. Tests: tests/test_osint_enrichers_age120.py (21 mocked-seam tests); full OSINT suite green; ruff check/format clean on src. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 74a9c0f315
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _logger = get_logger("zettelforge.osint.collectors.wallet") | ||
|
|
||
| API_KEY_ENV = "ETHERSCAN_API_KEY" | ||
| API_URL = "https://api.etherscan.io/api" |
There was a problem hiding this comment.
Switch wallet lookup to Etherscan V2
For real ETHERSCAN_API_KEY usage, this V1 base URL now hits Etherscan's deprecated API path; the official V2 migration docs say to use /v2/api with a chainid. The deprecated endpoint returns a status-0 result string, which _fetch_transactions() treats as “no result,” so CryptoWallet enrichment silently returns [] for every wallet instead of ingesting transactions.
Useful? React with 👍 / 👎.
| db = MaigretDatabase().load_from_path(maigret.settings.Settings().sites_db_path) | ||
| sites = db.ranked_sites_dict(top=MAX_ACCOUNTS) | ||
| results = asyncio.run( | ||
| maigret.search(username=username, site_dict=sites, timeout=30, no_progressbar=True) |
There was a problem hiding this comment.
Load Maigret settings and pass a logger
When maigret is installed, this live path fails before returning rows: Settings() is not loaded before sites_db_path is read, and Maigret's documented library signature requires a logger argument for search. Because the broad handler converts either exception into [], every Alias seed produces no SocialAccount in production while the mocked _search_username tests still pass.
Useful? React with 👍 / 👎.
| # (Supersession prunes the entity index but MENTIONED_IN | ||
| # edges accumulate one per note.) | ||
| node = self.store.get_kg_node(etype, value) | ||
| fanout = len(self.store.get_kg_edges_from(node["node_id"])) if node else 0 |
There was a problem hiding this comment.
Count only note fan-out before dropping entities
When a query entity has many outgoing non-note relationships, this drops it even if it is mentioned in only a few notes. _update_knowledge_graph() also writes actor/tool/CVE/asset edges from the same node, while the new config describes this threshold as note fan-out; a well-connected CTI actor with >25 tools/CVEs can therefore lose both graph traversal and entity-augmented recall for actor queries.
Useful? React with 👍 / 👎.
| embedding: EmbeddingConfig = field(default_factory=EmbeddingConfig) | ||
| llm: LLMConfig = field(default_factory=LLMConfig) | ||
| llm_ner: LLMNerConfig = field(default_factory=LLMNerConfig) | ||
| enrichment: EnrichmentConfig = field(default_factory=EnrichmentConfig) |
There was a problem hiding this comment.
Apply enrichment settings from config files
When operators disable background enrichment in config.yaml, this new section is never copied by _apply_yaml (unlike llm_ner, extraction, and retrieval), so enrichment.enabled: false is ignored unless the environment variable is also set. In offline ingestion or benchmarks that rely on file config, LLM NER/evolution jobs still dispatch unexpectedly.
Useful? React with 👍 / 👎.
| response.raise_for_status() | ||
| payload = response.json() | ||
| except httpx.HTTPError as exc: | ||
| _logger.warning("hibp_collector_http_error", email=email, error=str(exc)) |
There was a problem hiding this comment.
Redact queried emails from HIBP error logs
When HIBP returns a transient HTTP/JSON/shape error, these warnings include the full email address being investigated. In production or shared logs this exposes sensitive breach-lookup subjects; log a hash/redacted value or omit the address while preserving the error context.
Useful? React with 👍 / 👎.
| response.raise_for_status() | ||
| payload = response.json() | ||
| except httpx.HTTPError as exc: | ||
| _logger.warning("wallet_collector_http_error", wallet=wallet, error=str(exc)) |
There was a problem hiding this comment.
Avoid logging Etherscan API keys
When Etherscan returns a non-2xx response or a request error, str(exc) from httpx includes the request URL, and this request URL contains the apikey query parameter. That makes transient upstream errors leak ETHERSCAN_API_KEY into application logs despite the collector's key-handling guarantee; log the status/error type without the full URL or redact the query string.
Useful? React with 👍 / 👎.
| from_entity_type="Transaction", | ||
| to_entity_type="CryptoWallet", |
There was a problem hiding this comment.
Make received transactions reachable from wallet seeds
For wallets with inbound-only transfers, this persists received_transaction as Transaction -> CryptoWallet, but the public graph traversal APIs used elsewhere (get_neighbors, traverse, and GraphRetriever) walk outgoing edges from the seed. As a result, collection can succeed while starting from the CryptoWallet shows no received transactions in the graph.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 76d2bcb21e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # overfetch leaves margin for notes the gate filters out. | ||
| _defense_cfg = get_config().governance.memory_defense | ||
| _fetch_limit = max(200, 4 * _defense_cfg.max_reference_notes) | ||
| reference_notes = self.store.get_recent_notes_by_domain(domain, _fetch_limit) |
There was a problem hiding this comment.
Keep fetching until enough valid defense references exist
When a domain has more than this bounded window of recent notes with missing/zero embeddings (for example legacy imports or an embedding outage), _select_reference_notes() filters them out and the gate returns calibration_insufficient, so block/quarantine mode silently allows a write that the previous full-domain fetch would have scored against older valid notes. Consider falling back or paging until min_calibration_notes valid-vector references are available.
Useful? React with 👍 / 👎.
| cached: list[float] | None = cache.get(key) | ||
| if cached is not None: | ||
| return cached | ||
| embedding = _compute_embedding(text, model) | ||
| cache.set(key, embedding) |
There was a problem hiding this comment.
Guard the shared embedding cache
When the background enrichment worker re-embeds evolved notes while a foreground remember()/recall() also calls get_embedding(), both threads share this SmartCache, but only singleton creation is locked; cache.get() and cache.set() mutate and expire/evict the same dict without synchronization. Under concurrent expiry or misses this can raise KeyError/RuntimeError or abort the user operation, so the cache access itself should be protected or made thread-safe.
Useful? React with 👍 / 👎.
|
|
||
| kg = get_knowledge_graph() | ||
| graph_retriever = GraphRetriever(kg) | ||
| graph_retriever = GraphRetriever(StoreGraphSource(self.store)) |
There was a problem hiding this comment.
Persist OSINT tuples into the scoped store graph
When AGE-120 collectors are run through the public run_osint_collection() default path, the executor still writes to get_knowledge_graph(), but recall now traverses only StoreGraphSource(self.store). That makes collected breach/social/wallet tuples invisible to MemoryManager.recall() graph augmentation unless they were separately copied into the manager's SQLite KG; add a StorageBackend-backed OSINT persistence path or bridge these writes into the active manager store.
Useful? React with 👍 / 👎.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
75fd9be to
2ddebcc
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ddebcceff
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| db = MaigretDatabase().load_from_path(settings.sites_db_path) | ||
| sites = db.ranked_sites_dict(top=MAX_ACCOUNTS) | ||
| backend_logger = logging.getLogger("zettelforge.osint.collectors.maigret.backend") | ||
| results = asyncio.run( |
There was a problem hiding this comment.
Handle active event loops in Maigret searches
When Alias enrichment is invoked from an async host such as FastAPI/aiohttp or a notebook, an event loop is already running, and Maigret's library guide says to await its async API in that context rather than call asyncio.run. This call raises RuntimeError, which the broad handler converts to [], so those deployments silently produce no SocialAccount rows for every Alias; run the search in a separate thread/private loop or provide an async path for this case.
Useful? React with 👍 / 👎.
| load = getattr(settings, "load", None) | ||
| if callable(load): | ||
| load() | ||
| db = MaigretDatabase().load_from_path(settings.sites_db_path) |
There was a problem hiding this comment.
Resolve Maigret's bundled database path
When Maigret is installed as a package and ZettelForge is launched from a normal application cwd, Settings.load() leaves sites_db_path as Maigret's relative resources/data.json, and MaigretDatabase.load_from_path() opens that path cwd-relative rather than package-relative. That raises FileNotFoundError, which the broad handler converts to [], so Alias enrichment silently returns no SocialAccount rows outside repos that happen to have resources/data.json; resolve the bundled DB path relative to Maigret's package or use its DB resolver before loading.
Useful? React with 👍 / 👎.
| if callable(load): | ||
| load() | ||
| db = MaigretDatabase().load_from_path(settings.sites_db_path) | ||
| sites = db.ranked_sites_dict(top=MAX_ACCOUNTS) |
There was a problem hiding this comment.
Exclude disabled Maigret sites from searches
When the live Maigret path reaches site selection, ranked_sites_dict(top=MAX_ACCOUNTS) uses the library default that includes disabled site definitions rather than mirroring the CLI's normal --use-disabled opt-in. Those disabled/broken checks can consume the fixed 200-site budget and yield stale statuses, so Alias enrichment may miss enabled platforms or persist unreliable SocialAccount hits; pass disabled=False when building the site dict.
Useful? React with 👍 / 👎.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
|
CI is green on head |
Co-Authored-By: Paperclip <noreply@paperclip.ing>
|
Merge-up CI is green on head |
Summary
Implements AGE-120 live OSINT enrichers as native RFC-016 collectors feeding the graph backend.
Security and Compliance
hibpwnedpackage.[osint]extra with MIT notices.src/zettelforge/osint/THIRD_PARTY/AGE-120-pip-audit.mdrecords the AGE-120 audit evidence.CVE-2023-36464/GHSA-4vvm-4w3v-6mr8, inPyPDF2 3.0.1; it is risk-accepted with a CI ignore because GOV-009 blocks HIGH/CRITICAL, PyPDF2 has no patched release under that package name, and ZettelForge does not parse attacker-supplied PDFs or invoke Maigret report generation.Validation
PYTHONPATH=src python3 -m pytest tests/test_osint_enrichers_age120.py tests/test_osint_executor.py tests/test_osint_entity_resolver.py-> 38 passedpython3 -m pip_audit --strict --vulnerability-service=osv -r <core-osint requirements>-> no known vulnerabilities foundpython3 -m pip_audit --strict --vulnerability-service=osv --ignore-vuln=CVE-2023-36464 -r <maigret/sherlock requirements>-> no known vulnerabilities found, 1 ignoredCloses AGE-120.