Skip to content

feat: extend benchmarks with agentic variation mode support and comparison#8

Merged
danielbentes merged 7 commits into
mainfrom
feature/benchmark-agentic-comparison
Mar 28, 2026
Merged

feat: extend benchmarks with agentic variation mode support and comparison#8
danielbentes merged 7 commits into
mainfrom
feature/benchmark-agentic-comparison

Conversation

@danielbentes

Copy link
Copy Markdown
Contributor

Summary

Extends the benchmark infrastructure to support the hybrid agentic evolution feature (v1.1.0) and adds a comparison benchmark script for evaluating all 3 variation modes.

  • Config wiring: SelfImprovementConfig and EvolutionBenchmarkConfig now accept AgenticVariationConfig, threaded to EvolutionScheduler
  • CLI flags: --variation-mode, --max-inner-iterations, --agent-model added to run_self_improvement_benchmark.py
  • Comparison script: New run_agentic_comparison.py runs all 3 modes (single_turn, agentic, adaptive) on the same dataset
  • Comparison report: New AgenticComparisonReport generates side-by-side analysis with winner analysis and recommendations
  • Agentic metrics tracking: GenerationSnapshot.agentic_stats captures inner iterations and budget per generation
  • Report updates: Self-improvement reports include variation mode in metadata and configuration

Usage

# Quick comparison (3 generations, 20 samples)
python -m siare.benchmarks.scripts.run_agentic_comparison \
    --provider ollama --model llama3.1:8b \
    --reasoning-model deepseek-r1:7b \
    --dataset-tier 1 --quick

# Full comparison
python -m siare.benchmarks.scripts.run_agentic_comparison \
    --provider openai --model gpt-4o-mini \
    --reasoning-model gpt-4o \
    --dataset-tier 1 --generations 10 --samples 50

# Single mode via existing benchmark
python -m siare.benchmarks.scripts.run_self_improvement_benchmark \
    --provider ollama --model llama3.1:8b \
    --variation-mode adaptive --quick

New files

File Purpose
siare/benchmarks/scripts/run_agentic_comparison.py CLI for running all 3 modes
siare/benchmarks/reports/agentic_comparison_report.py Side-by-side comparison report
tests/unit/test_benchmark_agentic_config.py 14 tests for config wiring and reports

Test plan

  • 389 tests pass (375 existing + 14 new), 0 regressions
  • ruff check siare/ — all checks passed
  • Backward compatible — all defaults are None (existing behavior)
  • Run comparison benchmark with real LLM provider

🤖 Generated with Claude Code

danielbentes and others added 7 commits March 27, 2026 18:08
- Add agentic_config field to SelfImprovementConfig and
  EvolutionBenchmarkConfig (defaults to None = current behavior)
- Thread to EvolutionScheduler in both _create_scheduler() methods
- Add agentic_stats to GenerationSnapshot for tracking inner
  iterations and budget usage per generation
- Add --variation-mode, --max-inner-iterations, --agent-model
  CLI flags to run_self_improvement_benchmark.py
- Add variation mode to self_improvement_report metadata and
  configuration sections
- Extend SelfImprovementResult.summary() with agentic metrics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New run_agentic_comparison.py script that runs SelfImprovementBenchmark
in all 3 variation modes (single_turn, agentic, adaptive) on the same
dataset and generates a side-by-side comparison report.

New AgenticComparisonReport class generates markdown reports with:
- Mode comparison table (accuracy, improvement, convergence, timing)
- Winner analysis (quality, speed, efficiency dimensions)
- Agentic-specific metrics (inner iterations, supervisor redirections)
- Data-driven recommendations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New test_benchmark_agentic_config.py with 14 tests covering:
  config passthrough, stats extraction, summary generation,
  comparison report, and self-improvement report rendering
- Fix improvement_pct parsing in comparison report (handles both
  float and "21.8%" string formats from SelfImprovementResult)

389 tests pass, lint clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Create siare/providers/ollama_provider.py with OllamaProvider for
  local model inference via Ollama API
- Fix run_agentic_comparison.py to handle OllamaProvider import
  (try direct import, fall back to factory)
- Add comprehensive benchmarks guide (docs/guides/benchmarks.md)
  covering all modes, datasets, metrics, and CLI usage
- Update docs/GLOSSARY.md with agentic evolution and benchmark terms
- Update docs/CONFIGURATION.md with agentic evolution settings
- Update docs/README.md with benchmark guide link

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 fixes:
- Fix AgenticComparisonReport constructor call (pass summaries +
  configs, not raw result objects)
- Fix report.save_markdown → report.save (correct API)
- Fix "final" → "evolved" key in comparison report accuracy lookup

P2 fixes:
- Fix negative improvement formatting (+-5% → -5.0%)
- Extract done_reason from Ollama response instead of hardcoded "stop"
- Fix default model llama3.2:7b → llama3.2:3b (7b doesn't exist)
- Fix docs config example to match actual default (gpt-5)

P3 fixes:
- Catch ValueError for JSON parse errors in OllamaProvider

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@danielbentes danielbentes left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-Review: PR #8

Issues Found and Fixed (commit bbec2f6)

P1 Fixes (4 — were blocking)

# Issue Fix
1 AgenticComparisonReport(mode_results) missing 2nd arg + wrong type Pass (mode_summaries, mode_configs_dict)
2 report.save_markdown() method doesn't exist Changed to report.save(str(output_dir))
3 metric_data.get("final") — key is "evolved" Fixed key name
4 (Combined with #1) — dataclass objects instead of dicts Used mode_summaries

P2 Fixes (4)

# Issue Fix
1 +-5.0% for negative improvements Use {pct:+.1f}% format
2 finish_reason hardcoded to "stop" Extract done_reason from Ollama response
3 Default model llama3.2:7b doesn't exist Changed to llama3.2:3b
4 Docs config says gpt-4o, actual default is gpt-5 Fixed YAML example

P3 Fixes (1)

  • Catch ValueError for JSON parse errors in OllamaProvider

Remaining P2 (deferred)

  • agentic_config: Any type safety — requires dataclass/Pydantic compatibility work
  • check_ollama_model() substring match — minor, existing behavior

Quality Gates

  • 389 tests pass
  • Lint clean
  • All P1s fixed

🤖 Generated with Claude Code

@danielbentes danielbentes merged commit f04928d into main Mar 28, 2026
6 checks passed
@danielbentes danielbentes deleted the feature/benchmark-agentic-comparison branch March 28, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant