Skip to content

feat(osint): implement live OSINT enrichers into graph (AGE-120)#167

Open
rolandpg wants to merge 27 commits into
masterfrom
feat/age-120-osint-enrichers
Open

feat(osint): implement live OSINT enrichers into graph (AGE-120)#167
rolandpg wants to merge 27 commits into
masterfrom
feat/age-120-osint-enrichers

Conversation

@rolandpg

Copy link
Copy Markdown
Owner

Summary

Implements AGE-120 live OSINT enrichers as native RFC-016 collectors feeding the graph backend.

  • Adds native collectors for WHOIS/DNS enrichment, Maigret/Sherlock username discovery, HIBP breach lookup, and blockchain wallet transactions.
  • Extends OSINT ontology, executor seed support, entity canonicalization/resolution, graph persistence, and relationship handling for new entity/edge types.
  • Adds passive enrichment documentation, dependency/license provenance, and pip-audit evidence for the AGE-118 supply-chain gate.

Security and Compliance

  • HIBP is implemented through native REST and does not use the excluded LGPL hibpwned package.
  • Maigret and Sherlock are declared under the [osint] extra with MIT notices.
  • src/zettelforge/osint/THIRD_PARTY/AGE-120-pip-audit.md records the AGE-120 audit evidence.
  • Refreshed audit found one Maigret transitive medium finding, CVE-2023-36464 / GHSA-4vvm-4w3v-6mr8, in PyPDF2 3.0.1; it is risk-accepted with a CI ignore because GOV-009 blocks HIGH/CRITICAL, PyPDF2 has no patched release under that package name, and ZettelForge does not parse attacker-supplied PDFs or invoke Maigret report generation.

Validation

  • PYTHONPATH=src python3 -m pytest tests/test_osint_enrichers_age120.py tests/test_osint_executor.py tests/test_osint_entity_resolver.py -> 38 passed
  • python3 -m pip_audit --strict --vulnerability-service=osv -r <core-osint requirements> -> no known vulnerabilities found
  • python3 -m pip_audit --strict --vulnerability-service=osv --ignore-vuln=CVE-2023-36464 -r <maigret/sherlock requirements> -> no known vulnerabilities found, 1 ignored

Closes AGE-120.

rolandpg and others added 21 commits June 8, 2026 19:55
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Patrick Roland <48327651+rolandpg@users.noreply.github.com>
Restores the benchmark isolation the removed disable_enrichment kwarg
provided. ZETTELFORGE_ENRICHMENT_ENABLED=false gates causal extraction,
LLM NER, and neighbor evolution dispatch. LoCoMo harness repaired
(dead kwarg removed) and pinned to deterministic ingestion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sentence-boundary greedy packing to chunk_size with ordinal source_ref
provenance. Unblocks the CTI benchmark chunked_800 strategy and the
MemPalace-granularity LoCoMo experiment. CTI harness pinned to
deterministic (enrichment-off) ingestion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_update_knowledge_graph writes MENTIONED_IN edges to the manager's
storage backend, but _recall_inner traversed the process-global JSONL
KG (~109MB on this host). Isolated stores saw up to ~2000 phantom note
IDs per entity query (each a wasted SQLite lookup) and never saw their
own graph, so the graph signal was dead in any custom-data-dir
deployment. Adds StorageBackend.get_kg_edges_from and a
StoreGraphSource adapter; GraphRetriever now accepts any GraphSource.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rerank_enabled / rerank_max_candidates / rerank_doc_chars on
RetrievalConfig plus ZETTELFORGE_RERANK_ENABLED kill switch. Only the
head of the blended ranking is reranked; the tail keeps blended order.
Defaults preserve prior behavior pending benchmark-tuned values.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SmartCache (config.cache sizing) keyed by (model, sha256(text)) in
front of embedding compute. First integration of the previously
dormant cache.py module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ference fetch

The gate was 93% of remember() latency at 50 references (1.1s/ingest on
LoCoMo): leave-one-out calibration ran O(n^2) pure-Python 768-dim
cosines and rebuilt every reference's n-gram Counter n times per
ingest, and the call site fetched the entire domain per write.

- numpy pairwise cosine + one-shot leave-one-out JSD over a shared
  vocabulary (Counter subtraction from the pooled total is exact)
- content-hash keyed n-gram counter cache
- get_recent_notes_by_domain bounded SQL fetch (4x overfetch window)
- pure-Python originals retained as degenerate-shape fallbacks

Characterization tests pin score/threshold/flag equivalence to 1e-9
against the verbatim original math. Warm-path calibration: 75ms -> 1.6ms
on synthetic 50x700-word references; full evaluate ~3.4ms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…didates)

CTI grid 2026-06-09: accuracy holds at 75% from 512c-50n down to
128c-8n; p50 drops 91ms to 51ms at 256c-8n in-grid. 256c-8n picked
over 128c-8n for rerank-context headroom.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…k_model knob

20-core oversubscription thrashed small batches: rerank 8x256c pairs
23.7ms -> 11.5ms and single-query embedding 5.9ms -> 4.5ms at 8 threads
(GB10 measurements). rerank_model makes the cross-encoder swappable;
model grid kept ms-marco-MiniLM-L-6-v2 (jina tiny/turbo no better).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Person names were only extracted from 'Name:' dialogue lines, so
conversational queries produced no entities and graph traversal never
fired on them (RFC-001 gap). Single capitalized tokens in running text
now qualify, filtered by sentence position, proper-noun-phrase
adjacency, and an expanded stopword list (demonyms, vendors,
celebrations). CTI suite unchanged at 75%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Free-text person extraction regressed LoCoMo 11% -> 5%: speaker names
map to every session, so graph traversal flooded blended recall with
undiscriminative notes. Query entities whose KG out-degree exceeds
retrieval.entity_max_fanout (default 25) are now skipped by the
graph/causal/entity-augmentation stages. KG out-degree is the right
signal: supersession prunes the entity index but MENTIONED_IN edges
accumulate one per note. Also: LOCOMO_CHUNK_SIZE harness knob for
MemPalace-granularity chunked ingestion (removes 4000-char truncation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gue-only again

Measured 11% -> 5% overall (single-hop/multi-hop unchanged at 0):
persons extracted from turn bodies reshuffled supersession chains at
ingest, changing which notes survive in the entity index. The fan-out
gate could not recover it because the damage is write-side. Expanded
stopword list and the gate itself are kept. Decision and data recorded
in the test docstring; revisit via RFC-001 LLM NER.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ession

profile_recall (cProfile attribution), instrument_lookups (note-lookup
volume per stage), rerank_grid (policy tuning grid), mine_phase_timings
(OCSF log phase aggregation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Same-machine, deterministic-config before/after: LoCoMo 7.0% -> 11.0%
accuracy (+57% relative; 13.0% chunked config), p50 336 -> 170ms
(-49%), p95 -50%, ingest 1.0 -> 8.0 turns/s (7.8x). CTI held 75.0%
with p50 79 -> 39ms (-51%). Raw logs under
benchmarks/results/session_2026-06-09/. Includes the recorded negative
result (free-text person extraction) and the chunked-ingestion
configuration trade-off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…clusions

AGE-119. Realizes the Flowsint enricher vendoring under the AGE-118
CONDITIONAL GO. Reality check changed the shape of the work:

- Framework NOT vendored: every flowsint enricher + its registry import
  flowsint_core (forbidden by AGE-118: LGPL psycopg2 + Docker control), and
  ZettelForge already has an equivalent decoupled framework (RFC-016
  transform_registry + executor). Reused it instead of duplicating.
- Type gaps ported from flowsint-types v1.2.8 @ 2a4878c8 (Apache-2.0):
  CryptoWallet, Transaction, SocialAccount, with edges + canonicalization.
  ASN/CIDR were NOT ported (already exist as ASNumber / Netblock).
- Compliance artifacts under osint/THIRD_PARTY/: Apache LICENSE, carried-
  forward NOTICE, third-party notices, and PROVENANCE.md (pinned SHA, post-
  relicense date, telemetry-grep PASS, exclusions).
- Enforced exclusions: neutralized the pre-existing holehe_collector GPL-3.0
  import path to a permanent compliant no-op.

Tests: 117 passing (new gap-type validation + KG-persistence test). mypy
--strict and ruff clean on changed source.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Builds on the AGE-119 vendoring foundation. Implements the five live
enrichers as native ZettelForge collectors (RFC-016 transform_registry +
executor), feeding the graph backend:

- whois_collector: also emit the registrant EmailAddress via a new
  registered_by edge (Organization branch unchanged).
- dns_collector: reverse PTR for IPv4/IPv6 seeds -> DomainName via the
  existing hosts edge; non-global IPs are skipped.
- maigret_collector (new, people tier): Alias -> SocialAccount via
  has_account, backed by maigret/sherlock (MIT), lazy-imported and
  fail-closed without the dependency.
- hibp_collector: native HIBP v3 REST -> Breach via appeared_in_breach.
  Replaces the excluded LGPL hibpwned path; key read from env, never logged.
- wallet_collector (new, financial tier): CryptoWallet -> Transaction via
  an Etherscan-style explorer API; sent_transaction / received_transaction.
  EVM hex wallets only; key read from env, never logged.

Supporting changes:
- ontology: Breach entity, registered_by + appeared_in_breach edges,
  canonicalize_email / canonicalize_alias / canonicalize_breach helpers.
- executor: EmailAddress / Alias / CryptoWallet seed types plus endpoint
  prop-key and required-field wiring for the new entity types.
- entity_resolver: canonical-key branches for the new seed/output types.
- pyproject [osint]: add maigret / sherlock-project (both MIT); notices
  moved from planned to active in THIRD_PARTY_NOTICES.md.

AGE-118 gates: pip-audit on the resolved [osint] closure (core + maigret/
sherlock) reports no known vulnerabilities; evidence in
THIRD_PARTY/AGE-120-pip-audit.md. No GPL/LGPL packages, no Docker tool
wrappers, secrets only from env and never logged.

Tests: tests/test_osint_enrichers_age120.py (21 mocked-seam tests); full
OSINT suite green; ruff check/format clean on src.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rolandpg rolandpg requested a review from mbower June 16, 2026 16:45

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74a9c0f315

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

_logger = get_logger("zettelforge.osint.collectors.wallet")

API_KEY_ENV = "ETHERSCAN_API_KEY"
API_URL = "https://api.etherscan.io/api"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Switch wallet lookup to Etherscan V2

For real ETHERSCAN_API_KEY usage, this V1 base URL now hits Etherscan's deprecated API path; the official V2 migration docs say to use /v2/api with a chainid. The deprecated endpoint returns a status-0 result string, which _fetch_transactions() treats as “no result,” so CryptoWallet enrichment silently returns [] for every wallet instead of ingesting transactions.

Useful? React with 👍 / 👎.

Comment on lines +61 to +64
db = MaigretDatabase().load_from_path(maigret.settings.Settings().sites_db_path)
sites = db.ranked_sites_dict(top=MAX_ACCOUNTS)
results = asyncio.run(
maigret.search(username=username, site_dict=sites, timeout=30, no_progressbar=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Load Maigret settings and pass a logger

When maigret is installed, this live path fails before returning rows: Settings() is not loaded before sites_db_path is read, and Maigret's documented library signature requires a logger argument for search. Because the broad handler converts either exception into [], every Alias seed produces no SocialAccount in production while the mocked _search_username tests still pass.

Useful? React with 👍 / 👎.

Comment thread src/zettelforge/memory_manager.py Outdated
# (Supersession prunes the entity index but MENTIONED_IN
# edges accumulate one per note.)
node = self.store.get_kg_node(etype, value)
fanout = len(self.store.get_kg_edges_from(node["node_id"])) if node else 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count only note fan-out before dropping entities

When a query entity has many outgoing non-note relationships, this drops it even if it is mentioned in only a few notes. _update_knowledge_graph() also writes actor/tool/CVE/asset edges from the same node, while the new config describes this threshold as note fan-out; a well-connected CTI actor with >25 tools/CVEs can therefore lose both graph traversal and entity-augmented recall for actor queries.

Useful? React with 👍 / 👎.

Comment thread src/zettelforge/config.py
embedding: EmbeddingConfig = field(default_factory=EmbeddingConfig)
llm: LLMConfig = field(default_factory=LLMConfig)
llm_ner: LLMNerConfig = field(default_factory=LLMNerConfig)
enrichment: EnrichmentConfig = field(default_factory=EnrichmentConfig)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply enrichment settings from config files

When operators disable background enrichment in config.yaml, this new section is never copied by _apply_yaml (unlike llm_ner, extraction, and retrieval), so enrichment.enabled: false is ignored unless the environment variable is also set. In offline ingestion or benchmarks that rely on file config, LLM NER/evolution jobs still dispatch unexpectedly.

Useful? React with 👍 / 👎.

response.raise_for_status()
payload = response.json()
except httpx.HTTPError as exc:
_logger.warning("hibp_collector_http_error", email=email, error=str(exc))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Redact queried emails from HIBP error logs

When HIBP returns a transient HTTP/JSON/shape error, these warnings include the full email address being investigated. In production or shared logs this exposes sensitive breach-lookup subjects; log a hash/redacted value or omit the address while preserving the error context.

Useful? React with 👍 / 👎.

response.raise_for_status()
payload = response.json()
except httpx.HTTPError as exc:
_logger.warning("wallet_collector_http_error", wallet=wallet, error=str(exc))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid logging Etherscan API keys

When Etherscan returns a non-2xx response or a request error, str(exc) from httpx includes the request URL, and this request URL contains the apikey query parameter. That makes transient upstream errors leak ETHERSCAN_API_KEY into application logs despite the collector's key-handling guarantee; log the status/error type without the full URL or redact the query string.

Useful? React with 👍 / 👎.

Comment on lines +141 to +142
from_entity_type="Transaction",
to_entity_type="CryptoWallet",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make received transactions reachable from wallet seeds

For wallets with inbound-only transfers, this persists received_transaction as Transaction -> CryptoWallet, but the public graph traversal APIs used elsewhere (get_neighbors, traverse, and GraphRetriever) walk outgoing edges from the seed. As a result, collection can succeed while starting from the CryptoWallet shows no received transactions in the graph.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76d2bcb21e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/zettelforge/memory_manager.py Outdated
# overfetch leaves margin for notes the gate filters out.
_defense_cfg = get_config().governance.memory_defense
_fetch_limit = max(200, 4 * _defense_cfg.max_reference_notes)
reference_notes = self.store.get_recent_notes_by_domain(domain, _fetch_limit)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep fetching until enough valid defense references exist

When a domain has more than this bounded window of recent notes with missing/zero embeddings (for example legacy imports or an embedding outage), _select_reference_notes() filters them out and the gate returns calibration_insufficient, so block/quarantine mode silently allows a write that the previous full-domain fetch would have scored against older valid notes. Consider falling back or paging until min_calibration_notes valid-vector references are available.

Useful? React with 👍 / 👎.

Comment thread src/zettelforge/vector_memory.py Outdated
Comment on lines +142 to +146
cached: list[float] | None = cache.get(key)
if cached is not None:
return cached
embedding = _compute_embedding(text, model)
cache.set(key, embedding)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard the shared embedding cache

When the background enrichment worker re-embeds evolved notes while a foreground remember()/recall() also calls get_embedding(), both threads share this SmartCache, but only singleton creation is locked; cache.get() and cache.set() mutate and expire/evict the same dict without synchronization. Under concurrent expiry or misses this can raise KeyError/RuntimeError or abort the user operation, so the cache access itself should be protected or made thread-safe.

Useful? React with 👍 / 👎.


kg = get_knowledge_graph()
graph_retriever = GraphRetriever(kg)
graph_retriever = GraphRetriever(StoreGraphSource(self.store))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist OSINT tuples into the scoped store graph

When AGE-120 collectors are run through the public run_osint_collection() default path, the executor still writes to get_knowledge_graph(), but recall now traverses only StoreGraphSource(self.store). That makes collected breach/social/wallet tuples invisible to MemoryManager.recall() graph augmentation unless they were separately copied into the manager's SQLite KG; add a StorageBackend-backed OSINT persistence path or bridge these writes into the active manager store.

Useful? React with 👍 / 👎.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@rolandpg rolandpg force-pushed the feat/age-120-osint-enrichers branch from 75fd9be to 2ddebcc Compare June 16, 2026 19:05

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ddebcceff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

db = MaigretDatabase().load_from_path(settings.sites_db_path)
sites = db.ranked_sites_dict(top=MAX_ACCOUNTS)
backend_logger = logging.getLogger("zettelforge.osint.collectors.maigret.backend")
results = asyncio.run(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle active event loops in Maigret searches

When Alias enrichment is invoked from an async host such as FastAPI/aiohttp or a notebook, an event loop is already running, and Maigret's library guide says to await its async API in that context rather than call asyncio.run. This call raises RuntimeError, which the broad handler converts to [], so those deployments silently produce no SocialAccount rows for every Alias; run the search in a separate thread/private loop or provide an async path for this case.

Useful? React with 👍 / 👎.

load = getattr(settings, "load", None)
if callable(load):
load()
db = MaigretDatabase().load_from_path(settings.sites_db_path)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resolve Maigret's bundled database path

When Maigret is installed as a package and ZettelForge is launched from a normal application cwd, Settings.load() leaves sites_db_path as Maigret's relative resources/data.json, and MaigretDatabase.load_from_path() opens that path cwd-relative rather than package-relative. That raises FileNotFoundError, which the broad handler converts to [], so Alias enrichment silently returns no SocialAccount rows outside repos that happen to have resources/data.json; resolve the bundled DB path relative to Maigret's package or use its DB resolver before loading.

Useful? React with 👍 / 👎.

if callable(load):
load()
db = MaigretDatabase().load_from_path(settings.sites_db_path)
sites = db.ranked_sites_dict(top=MAX_ACCOUNTS)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude disabled Maigret sites from searches

When the live Maigret path reaches site selection, ranked_sites_dict(top=MAX_ACCOUNTS) uses the library default that includes disabled site definitions rather than mirroring the CLI's normal --use-disabled opt-in. Those disabled/broken checks can consume the fixed 200-site budget and yield stale statuses, so Alias enrichment may miss enabled platforms or persist unreliable SocialAccount hits; pass disabled=False when building the site dict.

Useful? React with 👍 / 👎.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@rolandpg

Copy link
Copy Markdown
Owner Author

CI is green on head fe433fe after the fastembed cache/matrix fix. PR #167 is ready for the requested reviewer pass; mbower remains requested.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@rolandpg

Copy link
Copy Markdown
Owner Author

Merge-up CI is green on head 8302d15: lint, pip-audit, governance, build, CodeQL, Snyk, GitGuardian, and Python 3.12/3.13 tests all pass. PR #167 remains ready for the requested mbower review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants