Skip to content

MarsPain/agent_data_synthesis

Repository files navigation

Agent Data Synthesis

Agent Data Synthesis is a local-first Python framework for generating, executing, validating, and packaging agent training data. It is not an instruction-response expander: accepted records are grounded in executable environment state, tool calls, observations, verifier results, lineage, and quality evidence.

The repository is still early-stage, but it now has a working synchronous pipeline with two deterministic domains, source governance, profile-based runs, runtime episode evidence, replay checks, reward-label export, and release admission artifacts. Canonical design detail lives in docs/.

What Works Now

  • main.py runs the local synchronous foundation pipeline and writes outputs under artifacts/foundation/ by default.
  • Contacts and synthetic mobile-message domains run through a shared domain pipeline boundary.
  • run_profile_v1 and run_profile_v2 fixtures configure deterministic local runs, scale probes, release candidates, and profile-local governed sources.
  • Profile-local JSON sources are admitted through shared source governance, then parsed by domain-owned importers for contacts or mobile messages.
  • Candidate processing validates task contracts, executes policies, verifies final answers and expected state, classifies rejections, and merges outcomes deterministically.
  • Opt-in reports can add held-out evaluation, profile decisions, dataset release admission, release packs, release quality audits, episode quality, executable replay, and deterministic reward labels.
  • Remote LLM generation is supported through an OpenAI-compatible API, but local LLM serving, distributed workers, external MCP servers, Agentic RL rollout collection, and full AWM runtime extraction are intentionally deferred.

Quick Start

uv run python main.py
uv run python -m unittest
uv run python scripts/validate_docs.py

Default output is written to artifacts/foundation/ and includes samples.jsonl, rejections.jsonl, manifest.json, and quality_report.json.

Common Runs

# Contacts foundation variants
uv run python main.py --enable-branching --output-dir artifacts/foundation-branching
uv run python main.py --enable-task-expansion --output-dir artifacts/foundation-task-expansion
uv run python main.py --enable-source-governance-fixture --output-dir artifacts/foundation-source-governance

# Profile-driven contacts and mobile runs
uv run python main.py --run-profile tests/fixtures/run_profiles/foundation-scale-probe-25.json --write-evaluation-report --write-profile-decision-report --output-dir artifacts/foundation-scale-probe
uv run python main.py --run-profile tests/fixtures/run_profiles/profile-local-contacts.json --output-dir artifacts/profile-local-contacts
uv run python main.py --run-profile tests/fixtures/run_profiles/profile-local-mobile-messages.json --write-episode-replay-report --write-reward-label-report --output-dir artifacts/profile-local-mobile

# Controlled no-network test of the HTTPS source path
uv run python main.py --enable-network-source --source-url https://allowed.example.test/contacts.json --source-license-label cc-by-4.0 --allowed-source-host allowed.example.test --mock-source-fixture tests/fixtures/contacts.json --output-dir artifacts/foundation-network-source

# Release-candidate evidence pack
uv run python main.py --run-profile tests/fixtures/run_profiles/foundation-release-candidate.json --write-evaluation-report --write-profile-decision-report --write-dataset-release-report --write-dataset-release-pack --write-release-quality-audit --write-dataset-release-card --output-dir artifacts/foundation-release-candidate
uv run python scripts/verify_dataset_release.py --output-dir artifacts/foundation-release-candidate

Optional LLM Configuration

The deterministic fixture path runs without provider credentials. Pass --use-llm to generate candidates through a remote OpenAI-compatible /chat/completions API:

export AGENT_DATA_LLM_BASE_URL="https://provider.example/v1"
export AGENT_DATA_API_KEY="..."
export AGENT_DATA_LLM_MODEL="model-id"
uv run python main.py --use-llm --output-dir artifacts/foundation-llm

Provider calls are routed through role contracts and sanitized lineage. API keys, headers, raw provider payloads, prompts, source payload rows, profile paths, and host paths must not be written to public artifacts.

Artifact Families

  • Default dataset artifacts: samples.jsonl, rejections.jsonl, manifest.json, quality_report.json.
  • Source and sandbox audits: source_events.jsonl, sandbox_audits.jsonl.
  • Evaluation and release artifacts: evaluation_report.json, profile_decision_report.json, dataset_release_report.json, dataset_release_pack.json, release_quality_audit.json, dataset_release_card.md.
  • Runtime evidence artifacts: episodes.jsonl, episode_quality_report.json, episode_replay_report.json, reward_labels.jsonl, reward_label_report.json.

All non-default artifact families are explicit opt-ins. They are evidence for local quality, replay, release, or future training workflows; they are not proof of downstream model improvement.

Documentation Map

Repository Rules

  • Keep root files concise and use them as navigation entrypoints.
  • Treat docs/ as the source of truth for architecture, data contracts, security rules, and implementation plans.
  • Keep runtime pipeline outputs under artifacts/.
  • Update affected docs and implementation together when workflows, schemas, commands, or architecture boundaries change.

About

Local-first Python framework for generating, executing, validating, and packaging agent training data with executable state, tool evidence, lineage, replay checks, and release artifacts.

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages