Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions docs/model-evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,27 +39,27 @@ A task passes when **all** its assertions pass **and** the LLM judge approves th

<!-- model:gpt-5-mini start -->

### gpt-5-mini — 2026-05-26
### gpt-5-mini — 2026-06-09

**Overall: 10/11 tasks passed (90%)**
**Overall: 11/11 tasks passed (100%)**

#### Task Results

| # | Task | Result | toolsUsed | minCalls | maxCalls | Input Tokens | Output Tokens |
|---|------|--------|-----------|----------|----------|--------------|---------------|
| 1 | cve-detected-clusters | Pass | Pass | Pass | Pass | 1513 | 1506 |
| 2 | cve-cluster-does-not-exist | Pass | Pass | Pass | Pass | 1496 | 1289 |
| 3 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 507 | 1265 |
| 4 | cve-clusters-general | Pass | Pass | Pass | Pass | 1788 | 2052 |
| 5 | cve-cluster-list | Pass | Pass | Pass | Pass | 674 | 1682 |
| 6 | rhsa-not-supported | Pass | | Pass | Pass | 1810 | 3098 |
| 7 | cve-nonexistent | **Fail** | Pass | Pass | Pass | 561 | 1506 |
| 8 | cve-detected-workloads | Pass | Pass | Pass | Pass | 539 | 2250 |
| 9 | cve-multiple | Pass | Pass | Pass | **Fail** | 2234 | 3627 |
| 10 | cve-log4shell | Pass | Pass | Pass | Pass | 2245 | 3516 |
| 11 | list-clusters | Pass | Pass | Pass | Pass | 1700 | 607 |

**Total input tokens**: 15067 | **Total output tokens**: 22398
| 1 | cve-log4shell | Pass | Pass | Pass | Pass | 2000 | 2339 |
| 2 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 743 | 2269 |
| 3 | cve-cluster-does-not-exist | Pass | Pass | Pass | Pass | 472 | 1629 |
| 4 | cve-detected-clusters | Pass | Pass | Pass | Pass | 489 | 1584 |
| 5 | cve-clusters-general | Pass | Pass | Pass | Pass | 1788 | 2067 |
| 6 | cve-detected-workloads | Pass | Pass | Pass | Pass | 533 | 1492 |
| 7 | cve-multiple | Pass | Pass | Pass | Pass | 1110 | 2445 |
| 8 | rhsa-not-supported | Pass | | Pass | Pass | 813 | 3472 |
| 9 | cve-cluster-list | Pass | Pass | Pass | Pass | 1261 | 3119 |
| 10 | cve-nonexistent | Pass | Pass | Pass | **Fail** | 4132 | 4052 |
| 11 | list-clusters | Pass | Pass | Pass | Pass | 1692 | 784 |
Comment on lines +44 to +60

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve pass/fail contract inconsistency in the updated results block.

Line 44 reports 11/11 passed, but Line 59 shows cve-nonexistent with maxCalls = **Fail** while task Result = Pass. This conflicts with the documented rule on Line 36 (“all assertions pass” for a task pass), so the published evaluation is currently contradictory. Please either regenerate from corrected artifact data or update methodology/columns to match actual pass semantics.

As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/model-evaluation.md` around lines 44 - 60, The results table is
inconsistent: "Overall: 11/11 tasks passed" contradicts the row for
"cve-nonexistent" which shows Result = Pass but maxCalls = **Fail**; fix by
making the reported pass/fail semantics consistent—either regenerate the results
so "cve-nonexistent" has maxCalls = Pass (and keep Overall 11/11) or change
Result to Fail and update the Overall summary to 10/11; alternatively, if the
table semantics differ, update the methodology text (the rule on Line 36) to
clearly explain what columns like maxCalls mean so the Result and Overall
calculations match that definition. Ensure you update the "cve-nonexistent" row
and the "Overall" summary (and/or methodology) together so they are consistent.

Source: Coding guidelines


**Total input tokens**: 15033 | **Total output tokens**: 25252

<!-- model:gpt-5-mini end -->

Expand Down
Loading