chore(evals): Update model evaluations 2026-06-09 by rhacs-bot · Pull Request #137 · stackrox/stackrox-mcp

rhacs-bot · 2026-06-09T07:37:53Z

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-09

This PR was automatically generated by the Model Evaluation workflow.

coderabbitai · 2026-06-09T07:38:07Z

📝 Walkthrough

Summary by CodeRabbit

Documentation
- Updated model evaluation documentation with the latest benchmark run results, including refreshed performance scores and token usage statistics.

Walkthrough

Updated the gpt-5-mini model evaluation section in the documentation with new test results from a 2026-06-09 run. The overall pass rate, per-task outcomes, and token consumption metrics have been refreshed.

Changes

gpt-5-mini Evaluation Update

Layer / File(s)	Summary
Evaluation results and token counts `docs/model-evaluation.md`	Updated `gpt-5-mini` evaluation table from 2026-05-26 to 2026-06-09 run, replacing the overall results summary line showing `11/11 tasks passed (100%)` and the per-task table with revised pass/fail outcomes and token counts. The `cve-nonexistent` task is marked as `Fail` in the `maxCalls` column.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: updating model evaluations with the date 2026-06-09. It directly relates to the changeset.
Description check	✅ Passed	The description is directly related to the changeset, explaining it is an automated weekly model evaluation update for gpt-5-mini dated 2026-06-09.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/update-model-evaluation-2026-06-09

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/model-evaluation.md`:
- Around line 44-60: The results table is inconsistent: "Overall: 11/11 tasks
passed" contradicts the row for "cve-nonexistent" which shows Result = Pass but
maxCalls = **Fail**; fix by making the reported pass/fail semantics
consistent—either regenerate the results so "cve-nonexistent" has maxCalls =
Pass (and keep Overall 11/11) or change Result to Fail and update the Overall
summary to 10/11; alternatively, if the table semantics differ, update the
methodology text (the rule on Line 36) to clearly explain what columns like
maxCalls mean so the Result and Overall calculations match that definition.
Ensure you update the "cve-nonexistent" row and the "Overall" summary (and/or
methodology) together so they are consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 17c00c7b-e594-464f-bf0e-a6c2de853901

📥 Commits

Reviewing files that changed from the base of the PR and between 81ce9af and e029cbb.

📒 Files selected for processing (1)

docs/model-evaluation.md

coderabbitai · 2026-06-09T07:39:56Z

+**Overall: 11/11 tasks passed (100%)**

 #### Task Results

 | # | Task | Result | toolsUsed | minCalls | maxCalls | Input Tokens | Output Tokens |
 |---|------|--------|-----------|----------|----------|--------------|---------------|
-| 1 | cve-detected-clusters | Pass | Pass | Pass | Pass | 1513 | 1506 |
-| 2 | cve-cluster-does-not-exist | Pass | Pass | Pass | Pass | 1496 | 1289 |
-| 3 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 507 | 1265 |
-| 4 | cve-clusters-general | Pass | Pass | Pass | Pass | 1788 | 2052 |
-| 5 | cve-cluster-list | Pass | Pass | Pass | Pass | 674 | 1682 |
-| 6 | rhsa-not-supported | Pass | — | Pass | Pass | 1810 | 3098 |
-| 7 | cve-nonexistent | **Fail** | Pass | Pass | Pass | 561 | 1506 |
-| 8 | cve-detected-workloads | Pass | Pass | Pass | Pass | 539 | 2250 |
-| 9 | cve-multiple | Pass | Pass | Pass | **Fail** | 2234 | 3627 |
-| 10 | cve-log4shell | Pass | Pass | Pass | Pass | 2245 | 3516 |
-| 11 | list-clusters | Pass | Pass | Pass | Pass | 1700 | 607 |
-
-**Total input tokens**: 15067 | **Total output tokens**: 22398
+| 1 | cve-log4shell | Pass | Pass | Pass | Pass | 2000 | 2339 |
+| 2 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 743 | 2269 |
+| 3 | cve-cluster-does-not-exist | Pass | Pass | Pass | Pass | 472 | 1629 |
+| 4 | cve-detected-clusters | Pass | Pass | Pass | Pass | 489 | 1584 |
+| 5 | cve-clusters-general | Pass | Pass | Pass | Pass | 1788 | 2067 |
+| 6 | cve-detected-workloads | Pass | Pass | Pass | Pass | 533 | 1492 |
+| 7 | cve-multiple | Pass | Pass | Pass | Pass | 1110 | 2445 |
+| 8 | rhsa-not-supported | Pass | — | Pass | Pass | 813 | 3472 |
+| 9 | cve-cluster-list | Pass | Pass | Pass | Pass | 1261 | 3119 |
+| 10 | cve-nonexistent | Pass | Pass | Pass | **Fail** | 4132 | 4052 |
+| 11 | list-clusters | Pass | Pass | Pass | Pass | 1692 | 784 |


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve pass/fail contract inconsistency in the updated results block.

Line 44 reports 11/11 passed, but Line 59 shows cve-nonexistent with maxCalls = **Fail** while task Result = Pass. This conflicts with the documented rule on Line 36 (“all assertions pass” for a task pass), so the published evaluation is currently contradictory. Please either regenerate from corrected artifact data or update methodology/columns to match actual pass semantics.

As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/model-evaluation.md` around lines 44 - 60, The results table is inconsistent: "Overall: 11/11 tasks passed" contradicts the row for "cve-nonexistent" which shows Result = Pass but maxCalls = **Fail**; fix by making the reported pass/fail semantics consistent—either regenerate the results so "cve-nonexistent" has maxCalls = Pass (and keep Overall 11/11) or change Result to Fail and update the Overall summary to 10/11; alternatively, if the table semantics differ, update the methodology text (the rule on Line 36) to clearly explain what columns like maxCalls mean so the Result and Overall calculations match that definition. Ensure you update the "cve-nonexistent" row and the "Overall" summary (and/or methodology) together so they are consistent.

Source: Coding guidelines

codecov-commenter · 2026-06-09T07:42:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.37%. Comparing base (81ce9af) to head (e029cbb).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #137   +/-   ##
=======================================
  Coverage   72.37%   72.37%           
=======================================
  Files          31       31           
  Lines        1383     1383           
=======================================
  Hits         1001     1001           
  Misses        336      336           
  Partials       46       46

Flag	Coverage Δ
integration	`72.37% <ø> (ø)`
unit	`72.37% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

github-actions · 2026-06-09T07:49:13Z

E2E Test Results

Commit: e029cbb
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ list-clusters (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✗ cve-detected-clusters (assertions: 3/3)
      one or more verification steps failed
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)

Tasks:      10/11 passed (90.91%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~52003 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  12054 tokens
  Output: 20109 tokens
Judge used tokens:
  Input:  44420 tokens
  Output: 37133 tokens

Update model evaluations 2026-06-09

e029cbb

rhacs-bot requested a review from janisz as a code owner June 9, 2026 07:37

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(evals): Update model evaluations 2026-06-09#137

chore(evals): Update model evaluations 2026-06-09#137
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-09

rhacs-bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

codecov-commenter commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rhacs-bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 9, 2026

Codecov Report

Uh oh!

github-actions Bot commented Jun 9, 2026

E2E Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading