feat: extend benchmarks with agentic variation mode support and comparison by danielbentes · Pull Request #8 · synaptiai/siare

danielbentes · 2026-03-27T17:15:37Z

Summary

Extends the benchmark infrastructure to support the hybrid agentic evolution feature (v1.1.0) and adds a comparison benchmark script for evaluating all 3 variation modes.

Config wiring: SelfImprovementConfig and EvolutionBenchmarkConfig now accept AgenticVariationConfig, threaded to EvolutionScheduler
CLI flags: --variation-mode, --max-inner-iterations, --agent-model added to run_self_improvement_benchmark.py
Comparison script: New run_agentic_comparison.py runs all 3 modes (single_turn, agentic, adaptive) on the same dataset
Comparison report: New AgenticComparisonReport generates side-by-side analysis with winner analysis and recommendations
Agentic metrics tracking: GenerationSnapshot.agentic_stats captures inner iterations and budget per generation
Report updates: Self-improvement reports include variation mode in metadata and configuration

Usage

# Quick comparison (3 generations, 20 samples)
python -m siare.benchmarks.scripts.run_agentic_comparison \
    --provider ollama --model llama3.1:8b \
    --reasoning-model deepseek-r1:7b \
    --dataset-tier 1 --quick

# Full comparison
python -m siare.benchmarks.scripts.run_agentic_comparison \
    --provider openai --model gpt-4o-mini \
    --reasoning-model gpt-4o \
    --dataset-tier 1 --generations 10 --samples 50

# Single mode via existing benchmark
python -m siare.benchmarks.scripts.run_self_improvement_benchmark \
    --provider ollama --model llama3.1:8b \
    --variation-mode adaptive --quick

New files

File	Purpose
`siare/benchmarks/scripts/run_agentic_comparison.py`	CLI for running all 3 modes
`siare/benchmarks/reports/agentic_comparison_report.py`	Side-by-side comparison report
`tests/unit/test_benchmark_agentic_config.py`	14 tests for config wiring and reports

Test plan

389 tests pass (375 existing + 14 new), 0 regressions
ruff check siare/ — all checks passed
Backward compatible — all defaults are None (existing behavior)
Run comparison benchmark with real LLM provider

🤖 Generated with Claude Code

- Add agentic_config field to SelfImprovementConfig and EvolutionBenchmarkConfig (defaults to None = current behavior) - Thread to EvolutionScheduler in both _create_scheduler() methods - Add agentic_stats to GenerationSnapshot for tracking inner iterations and budget usage per generation - Add --variation-mode, --max-inner-iterations, --agent-model CLI flags to run_self_improvement_benchmark.py - Add variation mode to self_improvement_report metadata and configuration sections - Extend SelfImprovementResult.summary() with agentic metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New run_agentic_comparison.py script that runs SelfImprovementBenchmark in all 3 variation modes (single_turn, agentic, adaptive) on the same dataset and generates a side-by-side comparison report. New AgenticComparisonReport class generates markdown reports with: - Mode comparison table (accuracy, improvement, convergence, timing) - Winner analysis (quality, speed, efficiency dimensions) - Agentic-specific metrics (inner iterations, supervisor redirections) - Data-driven recommendations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- New test_benchmark_agentic_config.py with 14 tests covering: config passthrough, stats extraction, summary generation, comparison report, and self-improvement report rendering - Fix improvement_pct parsing in comparison report (handles both float and "21.8%" string formats from SelfImprovementResult) 389 tests pass, lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Create siare/providers/ollama_provider.py with OllamaProvider for local model inference via Ollama API - Fix run_agentic_comparison.py to handle OllamaProvider import (try direct import, fall back to factory) - Add comprehensive benchmarks guide (docs/guides/benchmarks.md) covering all modes, datasets, metrics, and CLI usage - Update docs/GLOSSARY.md with agentic evolution and benchmark terms - Update docs/CONFIGURATION.md with agentic evolution settings - Update docs/README.md with benchmark guide link Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P1 fixes: - Fix AgenticComparisonReport constructor call (pass summaries + configs, not raw result objects) - Fix report.save_markdown → report.save (correct API) - Fix "final" → "evolved" key in comparison report accuracy lookup P2 fixes: - Fix negative improvement formatting (+-5% → -5.0%) - Extract done_reason from Ollama response instead of hardcoded "stop" - Fix default model llama3.2:7b → llama3.2:3b (7b doesn't exist) - Fix docs config example to match actual default (gpt-5) P3 fixes: - Catch ValueError for JSON parse errors in OllamaProvider Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

danielbentes

Self-Review: PR #8

Issues Found and Fixed (commit `bbec2f6`)

P1 Fixes (4 — were blocking)

#	Issue	Fix
1	`AgenticComparisonReport(mode_results)` missing 2nd arg + wrong type	Pass `(mode_summaries, mode_configs_dict)`
2	`report.save_markdown()` method doesn't exist	Changed to `report.save(str(output_dir))`
3	`metric_data.get("final")` — key is `"evolved"`	Fixed key name
4	(Combined with #1) — dataclass objects instead of dicts	Used `mode_summaries`

P2 Fixes (4)

#	Issue	Fix
1	`+-5.0%` for negative improvements	Use `{pct:+.1f}%` format
2	`finish_reason` hardcoded to `"stop"`	Extract `done_reason` from Ollama response
3	Default model `llama3.2:7b` doesn't exist	Changed to `llama3.2:3b`
4	Docs config says `gpt-4o`, actual default is `gpt-5`	Fixed YAML example

P3 Fixes (1)

Catch ValueError for JSON parse errors in OllamaProvider

Remaining P2 (deferred)

agentic_config: Any type safety — requires dataclass/Pydantic compatibility work
check_ollama_model() substring match — minor, existing behavior

Quality Gates

389 tests pass
Lint clean
All P1s fixed

🤖 Generated with Claude Code

danielbentes and others added 7 commits March 27, 2026 18:08

chore: remove accidentally staged worktrees and plans

56b2aee

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: add datasets and benchmark results to gitignore

f64fa18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

danielbentes commented Mar 28, 2026

View reviewed changes

danielbentes merged commit f04928d into main Mar 28, 2026
6 checks passed

danielbentes deleted the feature/benchmark-agentic-comparison branch March 28, 2026 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: extend benchmarks with agentic variation mode support and comparison#8

feat: extend benchmarks with agentic variation mode support and comparison#8
danielbentes merged 7 commits into
mainfrom
feature/benchmark-agentic-comparison

danielbentes commented Mar 27, 2026

Uh oh!

danielbentes left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielbentes commented Mar 27, 2026

Summary

Usage

New files

Test plan

Uh oh!

danielbentes left a comment

Choose a reason for hiding this comment

Self-Review: PR #8

Issues Found and Fixed (commit bbec2f6)

P1 Fixes (4 — were blocking)

P2 Fixes (4)

P3 Fixes (1)

Remaining P2 (deferred)

Quality Gates

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Issues Found and Fixed (commit `bbec2f6`)