feat: extend benchmarks with agentic variation mode support and comparison#8
Merged
Merged
Conversation
- Add agentic_config field to SelfImprovementConfig and EvolutionBenchmarkConfig (defaults to None = current behavior) - Thread to EvolutionScheduler in both _create_scheduler() methods - Add agentic_stats to GenerationSnapshot for tracking inner iterations and budget usage per generation - Add --variation-mode, --max-inner-iterations, --agent-model CLI flags to run_self_improvement_benchmark.py - Add variation mode to self_improvement_report metadata and configuration sections - Extend SelfImprovementResult.summary() with agentic metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New run_agentic_comparison.py script that runs SelfImprovementBenchmark in all 3 variation modes (single_turn, agentic, adaptive) on the same dataset and generates a side-by-side comparison report. New AgenticComparisonReport class generates markdown reports with: - Mode comparison table (accuracy, improvement, convergence, timing) - Winner analysis (quality, speed, efficiency dimensions) - Agentic-specific metrics (inner iterations, supervisor redirections) - Data-driven recommendations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New test_benchmark_agentic_config.py with 14 tests covering: config passthrough, stats extraction, summary generation, comparison report, and self-improvement report rendering - Fix improvement_pct parsing in comparison report (handles both float and "21.8%" string formats from SelfImprovementResult) 389 tests pass, lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Create siare/providers/ollama_provider.py with OllamaProvider for local model inference via Ollama API - Fix run_agentic_comparison.py to handle OllamaProvider import (try direct import, fall back to factory) - Add comprehensive benchmarks guide (docs/guides/benchmarks.md) covering all modes, datasets, metrics, and CLI usage - Update docs/GLOSSARY.md with agentic evolution and benchmark terms - Update docs/CONFIGURATION.md with agentic evolution settings - Update docs/README.md with benchmark guide link Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 fixes: - Fix AgenticComparisonReport constructor call (pass summaries + configs, not raw result objects) - Fix report.save_markdown → report.save (correct API) - Fix "final" → "evolved" key in comparison report accuracy lookup P2 fixes: - Fix negative improvement formatting (+-5% → -5.0%) - Extract done_reason from Ollama response instead of hardcoded "stop" - Fix default model llama3.2:7b → llama3.2:3b (7b doesn't exist) - Fix docs config example to match actual default (gpt-5) P3 fixes: - Catch ValueError for JSON parse errors in OllamaProvider Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
danielbentes
commented
Mar 28, 2026
danielbentes
left a comment
Contributor
Author
There was a problem hiding this comment.
Self-Review: PR #8
Issues Found and Fixed (commit bbec2f6)
P1 Fixes (4 — were blocking)
| # | Issue | Fix |
|---|---|---|
| 1 | AgenticComparisonReport(mode_results) missing 2nd arg + wrong type |
Pass (mode_summaries, mode_configs_dict) |
| 2 | report.save_markdown() method doesn't exist |
Changed to report.save(str(output_dir)) |
| 3 | metric_data.get("final") — key is "evolved" |
Fixed key name |
| 4 | (Combined with #1) — dataclass objects instead of dicts | Used mode_summaries |
P2 Fixes (4)
| # | Issue | Fix |
|---|---|---|
| 1 | +-5.0% for negative improvements |
Use {pct:+.1f}% format |
| 2 | finish_reason hardcoded to "stop" |
Extract done_reason from Ollama response |
| 3 | Default model llama3.2:7b doesn't exist |
Changed to llama3.2:3b |
| 4 | Docs config says gpt-4o, actual default is gpt-5 |
Fixed YAML example |
P3 Fixes (1)
- Catch
ValueErrorfor JSON parse errors in OllamaProvider
Remaining P2 (deferred)
agentic_config: Anytype safety — requires dataclass/Pydantic compatibility workcheck_ollama_model()substring match — minor, existing behavior
Quality Gates
- 389 tests pass
- Lint clean
- All P1s fixed
🤖 Generated with Claude Code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the benchmark infrastructure to support the hybrid agentic evolution feature (v1.1.0) and adds a comparison benchmark script for evaluating all 3 variation modes.
SelfImprovementConfigandEvolutionBenchmarkConfignow acceptAgenticVariationConfig, threaded toEvolutionScheduler--variation-mode,--max-inner-iterations,--agent-modeladded torun_self_improvement_benchmark.pyrun_agentic_comparison.pyruns all 3 modes (single_turn, agentic, adaptive) on the same datasetAgenticComparisonReportgenerates side-by-side analysis with winner analysis and recommendationsGenerationSnapshot.agentic_statscaptures inner iterations and budget per generationUsage
New files
siare/benchmarks/scripts/run_agentic_comparison.pysiare/benchmarks/reports/agentic_comparison_report.pytests/unit/test_benchmark_agentic_config.pyTest plan
ruff check siare/— all checks passed🤖 Generated with Claude Code