Skip to content

chore(evals): Update model evaluations 2026-06-09#137

Open
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-09
Open

chore(evals): Update model evaluations 2026-06-09#137
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-09

Conversation

@rhacs-bot

Copy link
Copy Markdown
Contributor

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-09

This PR was automatically generated by the Model Evaluation workflow.

@rhacs-bot rhacs-bot requested a review from janisz as a code owner June 9, 2026 07:37
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Updated model evaluation documentation with the latest benchmark run results, including refreshed performance scores and token usage statistics.

Walkthrough

Updated the gpt-5-mini model evaluation section in the documentation with new test results from a 2026-06-09 run. The overall pass rate, per-task outcomes, and token consumption metrics have been refreshed.

Changes

gpt-5-mini Evaluation Update

Layer / File(s) Summary
Evaluation results and token counts
docs/model-evaluation.md
Updated gpt-5-mini evaluation table from 2026-05-26 to 2026-06-09 run, replacing the overall results summary line showing 11/11 tasks passed (100%) and the per-task table with revised pass/fail outcomes and token counts. The cve-nonexistent task is marked as Fail in the maxCalls column.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: updating model evaluations with the date 2026-06-09. It directly relates to the changeset.
Description check ✅ Passed The description is directly related to the changeset, explaining it is an automated weekly model evaluation update for gpt-5-mini dated 2026-06-09.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/update-model-evaluation-2026-06-09

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/model-evaluation.md`:
- Around line 44-60: The results table is inconsistent: "Overall: 11/11 tasks
passed" contradicts the row for "cve-nonexistent" which shows Result = Pass but
maxCalls = **Fail**; fix by making the reported pass/fail semantics
consistent—either regenerate the results so "cve-nonexistent" has maxCalls =
Pass (and keep Overall 11/11) or change Result to Fail and update the Overall
summary to 10/11; alternatively, if the table semantics differ, update the
methodology text (the rule on Line 36) to clearly explain what columns like
maxCalls mean so the Result and Overall calculations match that definition.
Ensure you update the "cve-nonexistent" row and the "Overall" summary (and/or
methodology) together so they are consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 17c00c7b-e594-464f-bf0e-a6c2de853901

📥 Commits

Reviewing files that changed from the base of the PR and between 81ce9af and e029cbb.

📒 Files selected for processing (1)
  • docs/model-evaluation.md

Comment thread docs/model-evaluation.md
Comment on lines +44 to +60
**Overall: 11/11 tasks passed (100%)**

#### Task Results

| # | Task | Result | toolsUsed | minCalls | maxCalls | Input Tokens | Output Tokens |
|---|------|--------|-----------|----------|----------|--------------|---------------|
| 1 | cve-detected-clusters | Pass | Pass | Pass | Pass | 1513 | 1506 |
| 2 | cve-cluster-does-not-exist | Pass | Pass | Pass | Pass | 1496 | 1289 |
| 3 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 507 | 1265 |
| 4 | cve-clusters-general | Pass | Pass | Pass | Pass | 1788 | 2052 |
| 5 | cve-cluster-list | Pass | Pass | Pass | Pass | 674 | 1682 |
| 6 | rhsa-not-supported | Pass | | Pass | Pass | 1810 | 3098 |
| 7 | cve-nonexistent | **Fail** | Pass | Pass | Pass | 561 | 1506 |
| 8 | cve-detected-workloads | Pass | Pass | Pass | Pass | 539 | 2250 |
| 9 | cve-multiple | Pass | Pass | Pass | **Fail** | 2234 | 3627 |
| 10 | cve-log4shell | Pass | Pass | Pass | Pass | 2245 | 3516 |
| 11 | list-clusters | Pass | Pass | Pass | Pass | 1700 | 607 |

**Total input tokens**: 15067 | **Total output tokens**: 22398
| 1 | cve-log4shell | Pass | Pass | Pass | Pass | 2000 | 2339 |
| 2 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 743 | 2269 |
| 3 | cve-cluster-does-not-exist | Pass | Pass | Pass | Pass | 472 | 1629 |
| 4 | cve-detected-clusters | Pass | Pass | Pass | Pass | 489 | 1584 |
| 5 | cve-clusters-general | Pass | Pass | Pass | Pass | 1788 | 2067 |
| 6 | cve-detected-workloads | Pass | Pass | Pass | Pass | 533 | 1492 |
| 7 | cve-multiple | Pass | Pass | Pass | Pass | 1110 | 2445 |
| 8 | rhsa-not-supported | Pass | | Pass | Pass | 813 | 3472 |
| 9 | cve-cluster-list | Pass | Pass | Pass | Pass | 1261 | 3119 |
| 10 | cve-nonexistent | Pass | Pass | Pass | **Fail** | 4132 | 4052 |
| 11 | list-clusters | Pass | Pass | Pass | Pass | 1692 | 784 |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve pass/fail contract inconsistency in the updated results block.

Line 44 reports 11/11 passed, but Line 59 shows cve-nonexistent with maxCalls = **Fail** while task Result = Pass. This conflicts with the documented rule on Line 36 (“all assertions pass” for a task pass), so the published evaluation is currently contradictory. Please either regenerate from corrected artifact data or update methodology/columns to match actual pass semantics.

As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/model-evaluation.md` around lines 44 - 60, The results table is
inconsistent: "Overall: 11/11 tasks passed" contradicts the row for
"cve-nonexistent" which shows Result = Pass but maxCalls = **Fail**; fix by
making the reported pass/fail semantics consistent—either regenerate the results
so "cve-nonexistent" has maxCalls = Pass (and keep Overall 11/11) or change
Result to Fail and update the Overall summary to 10/11; alternatively, if the
table semantics differ, update the methodology text (the rule on Line 36) to
clearly explain what columns like maxCalls mean so the Result and Overall
calculations match that definition. Ensure you update the "cve-nonexistent" row
and the "Overall" summary (and/or methodology) together so they are consistent.

Source: Coding guidelines

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.37%. Comparing base (81ce9af) to head (e029cbb).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #137   +/-   ##
=======================================
  Coverage   72.37%   72.37%           
=======================================
  Files          31       31           
  Lines        1383     1383           
=======================================
  Hits         1001     1001           
  Misses        336      336           
  Partials       46       46           
Flag Coverage Δ
integration 72.37% <ø> (ø)
unit 72.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

E2E Test Results

Commit: e029cbb
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ list-clusters (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✗ cve-detected-clusters (assertions: 3/3)
      one or more verification steps failed
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)

Tasks:      10/11 passed (90.91%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~52003 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  12054 tokens
  Output: 20109 tokens
Judge used tokens:
  Input:  44420 tokens
  Output: 37133 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants