Skip to content

feat: add generic imported artifact byte reads#1058

Open
lupuszr wants to merge 3 commits into
mainfrom
codex/generic-imported-artifact-bytes
Open

feat: add generic imported artifact byte reads#1058
lupuszr wants to merge 3 commits into
mainfrom
codex/generic-imported-artifact-bytes

Conversation

@lupuszr

@lupuszr lupuszr commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements a generic imported assertion artifact byte API that supports source, markdown, and original artifact kinds without introducing a separate authorization system.

The artifact byte path now reuses the existing context graph sync authorization stack:

  • buildSyncRequestEnvelope
  • parseSyncRequest
  • authorizeSyncRequest
  • existing private sync identity, delegation, participant, allowed-peer, and curator-refresh logic

Artifact-specific code only validates request shape, auth-envelope binding, metadata linkage, hash consistency, and bounded byte serving.

What Changed

  • Added POST /api/assertion/artifact/read
  • Added generic peer protocol /dkg/10.0.2/get-assertion-artifact
  • Added signed sync-envelope fields:
    • authPurpose
    • authSelector
  • Added generic artifact selector binding for imported artifact byte requests
  • Added bounded FileStore.stat() and FileStore.readRange()
  • Added requester-side paged fetch and hash verification before cache promotion
  • Reused existing imported artifact metadata resolution for:
    • sourceFileHash
    • mdIntermediateHash
    • Markdown fallback for native text/markdown
    • durable _meta
    • SWM/shared-memory fallback
  • Preserved existing /api/assertion/import-artifact/read-markdown behavior

Authorization Model

No new source-blob-specific authorization flow was added.

The responder:

  1. Parses the signed sync request envelope.
  2. Verifies the envelope is bound to the artifact request.
  3. Calls existing authorizeSyncRequest.
  4. Only after authorization, resolves artifact metadata linkage and serves bounded bytes.

Transport

Retains the useful transport ideas from PR #1043:

  • offset
  • maxBytes
  • bounded page reads
  • totalBytes
  • nextOffset
  • truncated
  • hash verification
  • cache promotion only after full artifact verification

Tests

  • pnpm --filter @origintrail-official/dkg-agent exec vitest run --config vitest.unit.config.ts test/imported-artifact.test.ts test/sync-envelope-cursor.test.ts
  • pnpm --filter @origintrail-official/dkg exec vitest run test/import-artifact-routes.test.ts

Notes

  • sourcePeerId is currently required for remote peer fetches; source-agent-to-peer discovery can be added as a follow-up.
  • original currently resolves to the linked source bytes unless distinct original-artifact metadata is added later.

const info = await handle.stat();
if (offset >= info.size) return Buffer.alloc(0);
const cappedLength = Math.min(length, info.size - offset);
const buffer = Buffer.alloc(cappedLength);
? (requestedHash ? { fileHash: requestedHash } : {})
: {}),
},
undefined,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: resolveAssertionArtifact() drops the existing owner guard by passing undefined here. That means the new /api/assertion/artifact/read route can read another agent's imported bytes locally just by providing its assertionUri, even though the older resolve/read-markdown paths still bind reads to requestAgentAddress (with the public-open relaxation). Thread the same owner-guard policy through this path before resolving the artifact.

return denied(req);
}

const authorized = await this.authorizeSyncRequest(syncReq, fromPeerId);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: authorizeSyncRequest() only proves the caller may sync this context graph; it does not enforce the imported-artifact owner rule that the HTTP routes already rely on. As written, any peer that can sync the CG can call this protocol directly and fetch another agent's source/markdown bytes. Resolve the assertion owner here and apply the same owner/public-open policy before serving the blob.

Comment thread packages/agent/src/imported-artifact.ts Outdated
includeSharedMemory: false,
targetPeerId: params.sourcePeerId,
requesterPeerId: this.peerId,
needsAuth: true,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: forcing needsAuth: true makes public artifact reads require a local signing key. On nodes without a chain identity or claimed-agent private key, buildSyncRequestEnvelope() will throw before the request is sent, even though handleGetImportedArtifact() currently accepts public-CG requests via authorizeSyncRequest(). Compute whether the target CG actually needs auth and only require a signed envelope for private/curated fetches, or make the responder reject unsigned public requests consistently.

return denied(req);
}

const authorized = await this.authorizeSyncRequest(syncReq, fromPeerId);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: authorizeSyncRequest() only gates private context graphs. On public-but-not-open graphs this will return true for any peer-bound envelope, so the new P2P artifact route can bypass the owner/public-open checks that /api/assertion/artifact/read still enforces. Add the same read-policy check here before resolving metadata/bytes, otherwise imported artifacts from non-open public CGs become remotely readable.

}
const chunks = [Buffer.from(first.bytesB64, 'base64')];
let nextOffset = first.nextOffset;
while (first.truncated && nextOffset != null && nextOffset < total) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this loop never verifies that the responder made forward progress. A peer that keeps replying with truncated: true and the same nextOffset (or zero-byte pages) will keep this caller in an endless fetch loop. Validate that each page advances the offset and abort on repeated/non-increasing nextOffset or inconsistent page metadata.

? mdIntermediateHash ?? (contentType === 'text/markdown' ? fileHash : undefined)
: fileHash;
if (!linkedHash) return null;
if (args.hash && args.hash !== linkedHash) return 'hash_mismatch';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: validateContentHash() accepts bare 64-hex hashes, but this comparison requires the exact canonical string. A caller that sends a bare hash will be told hash_mismatch even when it refers to the same blob. Either normalize hashes before comparing or reject bare hashes at the API boundary so validation matches runtime behavior.

@lupuszr lupuszr force-pushed the codex/generic-imported-artifact-bytes branch from 2c7f710 to fea30a5 Compare June 8, 2026 16:15
largeLiteralStorage?: LargeLiteralStorageConfig;
/** Out-of-Oxigraph immutable public SWM operation snapshots. Defaults on when dataDir is set. */
sharedMemoryPublicSnapshotStorage?: SharedMemoryPublicSnapshotStorageConfig;
importedArtifactByteStore?: ImportedArtifactByteStore;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: exposing importedArtifactByteStore on DKGAgentConfig is misleading right now because DKGAgent.create() never consumes it or registers the peer handler. Non-CLI embedders can pass this config and still fail every get-assertion-artifact request until they manually call registerImportedArtifactByteStore(). Either wire the store during agent construction or keep it off the public config surface.

ctx,
{
...raw,
...(kind === 'source' || kind === 'original'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: threading requestedHash into fileHash here preserves the old 400 fileHash does not match import metadata path for source/original, while markdown mismatches return the new 200 { status: "hash_mismatch" } shape. That makes local artifact reads inconsistent by kind and with the peer protocol. Resolve first, then compare requestedHash against sourceFileHash so all kinds report hash_mismatch uniformly.

return { response: first };
}
const total = first.totalBytes;
if (!params.cache || total == null || total > IMPORTED_ARTIFACT_MAX_CACHE_BYTES) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this returns remote bytes without hashing them whenever cache is false, the request starts at a nonzero offset, or the artifact is over the cache cap. The HTTP route then serves those unverified bytes back to callers as a normal success path. Verify full-file responses even when you are not caching, and avoid returning partial remote pages as trusted bytes unless you can prove their integrity.

return { response: page };
}
chunks.push(Buffer.from(page.bytesB64, 'base64'));
nextOffset = page.nextOffset;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: there is no forward-progress check in this pagination loop. A buggy or malicious peer can keep replying with truncated: true and the same nextOffset, which makes this spin forever and hang the caller. Validate that each page starts at the requested offset and that nextOffset increases strictly before continuing.

@lupuszr lupuszr force-pushed the codex/generic-imported-artifact-bytes branch from fea30a5 to d1a4e88 Compare June 9, 2026 06:58
return denied(req);
}

const authorized = await this.authorizeSyncRequest(syncReq, fromPeerId);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: authorizeSyncRequest() only verifies the envelope for private context graphs. On public-but-not-open graphs it returns true without validating requesterAgentAddress, but the next line still uses that unverified field for the owner check. A peer can forge requesterAgentAddress here and read another agent's imported artifact. Either verify the signed binding whenever owner-gating depends on requesterAgentAddress, or reject owner-gated peer reads unless the envelope was actually authenticated.

const durable = await agent.store.query(`
SELECT ?fileHash ?contentType ?extractionStatus ?structuralTripleCount ?mdIntermediateHash WHERE {
GRAPH <${contextGraphMetaUri(args.contextGraphId)}> {
<${args.assertionUri}> <${DKG_ONTOLOGY}sourceFileHash> ?fileHash .

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this lookup uses args.assertionUri verbatim. The rest of the import-artifact flow still accepts legacy ownerless assertion URIs and derives subGraphName from the canonical URI, but this protocol never normalizes before querying _meta or SWM. Requests with a supported legacy URI, or a sub-graph assertion where the caller omitted subGraphName, will miss existing metadata and incorrectly return unavailable. Parse/canonicalize the assertion URI once and reuse the derived URI + sub-graph for the lookups and selector binding.

@Jurij89 Jurij89 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found blocking gaps against the stated goal of allowing joined/allowlisted Context Graph participants to open imported Markdown/source bytes. I also avoided duplicating the older unresolved auth/CodeQL review threads.

throw new ImportArtifactRouteError(400, 'Invalid hash');
}

const artifact = await resolveImportedArtifactForRead(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: this route still resolves the artifact through the owner/public-open read guard. For a curated CG member in the allowList, ctx.requestAgentAddress is allowed to read/sync the graph but is not the assertion owner, and isPublicOpenContextGraph() is false, so this throws 403 before the SWM fallback can resolve the promoted Markdown/source linkage. That is the exact scenario this PR is supposed to enable. Please make the read policy accept authorized curated participants while keeping write/enrichment paths owner-only, and add a regression for an allowlisted non-owner reading kind: "markdown"/source on a private CG.

Comment thread packages/agent/src/imported-artifact.ts Outdated

const authorized = await this.authorizeSyncRequest(syncReq, fromPeerId);
if (!authorized) return denied(req);
if (!await canReadImportedArtifactByOwnerPolicy(this, req, syncReq)) return denied(req);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: the same access-policy issue exists on the P2P responder. authorizeSyncRequest() can return true for a private/curated CG when the requester is a valid participant/delegate, but canReadImportedArtifactByOwnerPolicy() then reduces the policy back to owner-or-public-open only. A joined allowlisted agent can sync the triples but still gets denied for the actual bytes. The artifact byte gate should preserve the signed selector-bound auth check and allow non-owner private CG participants that passed sync authorization, instead of requiring assertion ownership.

}
}

const sourcePeerId = typeof raw.sourcePeerId === 'string' && raw.sourcePeerId.trim()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking for end-to-end hydration: the route requires the caller to already know sourcePeerId, but when it is missing the response only returns source.agentAddress. A normal SWM reader can resolve the assertion owner and artifact hash from replicated triples, but not the source libp2p peer id, so this returns fetchable without enough information to make the next request. PR #1043 had durable/SWM publisher-peer plus agent-registry candidate resolution and retry; PR #1058 drops that discovery path. Please either resolve/retry candidate source peers server-side or surface persisted sourcePeerIds in artifact metadata so joined readers can actually open the MD file.

});
}

const statResult = await fileStore.stat(resolved.hash);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: this local hit is not scoped to the assertion or to a verified remote fetch. For owner self-reads that is fine, but for cross-agent/SWM fallback reads the file store is a global content-addressed cache: if this node happens to already have resolved.hash from another import/context, the route returns those bytes without proving they came from this artifact or from the source peer. PR #1043 avoided that by requiring a verified cache marker after full-hash remote hydration before reusing local bytes for owner-relaxed SWM reads. Please carry over that protection for non-owner/shared-memory artifact reads, otherwise this becomes a byte oracle for any linked hash already present on the node.

@lupuszr lupuszr force-pushed the codex/generic-imported-artifact-bytes branch from d1a4e88 to bbba0fe Compare June 16, 2026 11:16
if (!resolvedHash) {
throw new ImportArtifactRouteError(404, `No ${kind} artifact is linked to assertionUri`);
}
if (requestedHash && requestedHash !== resolvedHash) return 'hash_mismatch';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: validateContentHash() accepts bare 64-hex hashes, but this exact-string comparison only matches the prefixed form stored in metadata. A caller sending a valid bare hash will get a false hash_mismatch here (and the same pattern exists in the peer read path). Canonicalize both hashes before comparing, or tighten validation to reject bare hashes.

}

return jsonResponse(res, 200, {
status: remote.verifiedBytes ? 'fetched' : 'fetchable',

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: remote.verifiedBytes is set even when cache === false, so this reports status: "fetched" for responses that were verified but never written to the local file store. That makes the status lie about local availability. Base the status on whether the cache promotion actually ran, or introduce a separate status for "verified-but-not-cached".

assertionUri: resolved.assertionUri,
kind: resolved.kind,
hash: resolved.hash,
contentType: page.contentType ?? resolved.contentType,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: page.contentType comes from the remote peer and is not covered by the hash verification you do above. Returning it preferentially lets a buggy or malicious peer change the MIME type the HTTP client sees even when local metadata says otherwise. Prefer the locally-resolved contentType, or reject mismatches before responding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants