♟️ LLM Chess Evaluation Harness

A sophisticated evaluation framework for testing the chess-playing capabilities of Large Language Models (LLMs) against Stockfish. 🤖 The harness uses a Model Context Protocol (MCP) server to maintain secure, stateful chess games and provides a rich live-updating terminal interface. 🖥️

It supports any OpenAI-compatible API endpoint, allowing you to test OpenAI models, Anthropic models (via proxies), or local LLMs (via vLLM, Ollama, etc.). 🌐

✨ Features

📺 Live Terminal Dashboard: A beautiful, real-time TUI built with rich, displaying the board, move history, side-to-move, and live game status.
🛠️ MCP Backend: Chess state and rules are fully mediated by a local MCP server (chess_mcp_server.py), providing robust move validation and state management.
🐟 Stockfish Integration: Test against Stockfish with configurable ELO levels to find the exact rating of your LLM.
🛡️ Robust Error Handling: Automatically detects illegal LLM moves and prompts the LLM to retry, tracking "illegal move attempts" as a metric.
💾 PGN Export: Automatically saves all finished games to standard Portable Game Notation (PGN) files for later review in any chess GUI. Filenames include the model id and Stockfish ELO for easy sorting.

📋 Requirements

🐍 Python 3.12+
⚙️ Stockfish: The Stockfish chess engine binary must be installed on your system.
- 🐧 Ubuntu/Debian: sudo apt install stockfish
- 🍎 macOS: brew install stockfish
- 🪟 Windows: Download from the Stockfish website and add it to your PATH, or specify the path via configuration.
🔑 API Key: An API key for your chosen LLM provider (e.g., OpenAI).

🚀 Installation

📥 Clone this repository.
📦 Install the required dependencies:

pip install -r requirements.txt
# or using uv:
# uv sync

🔧 Configure your environment variables. You can create a .env file in the project root:

LLM_API_KEY=your_api_key_here
# Optional:
# LLM_BASE_URL=https://api.your-custom-provider.com/v1
# STOCKFISH_PATH=/path/to/custom/stockfish

⚡ Direct Execution (No Cloning/Installation Required)

You can run the harness or the MCP server directly from the GitHub repository using uv or uvx without cloning the project.

1. Using `uv run`

Run the evaluation harness command directly:

uv run --with git+https://github.com/PythonicVarun/llm-chess-evaluation-harness.git eval [options]

2. Using `uvx` (or `uv tool run`)

Run the evaluation harness:

uvx --from git+https://github.com/PythonicVarun/llm-chess-evaluation-harness.git eval [options]

Or start the standalone MCP Chess server:

uvx --from git+https://github.com/PythonicVarun/llm-chess-evaluation-harness.git mcp

Note

Make sure your OPENAI_API_KEY (or LLM_API_KEY) and STOCKFISH_PATH environment variables are exported in your terminal before running the direct command.

🎮 Usage

Run the evaluation harness using the main script:

python chess_eval.py

By default, this will play a 3-game match using gpt-4o-mini as White against Stockfish (1500 ELO) as Black. ⚔️

🎛️ Command Line Arguments

You can override the default configuration using CLI arguments:

--model <name>: 🧠 LLM model name (default: gpt-4o-mini).
--reasoning-effort <low|medium|high>: 🧠 Reasoning effort for supported models.
--color <white|black>: 🎨 The color the LLM will play (default: white).
--games <n>: 🔢 Number of games to play in the match (default: 3).
--elo <n>: 📈 Stockfish target ELO (default: 1320).
--base-url <url>: 🔗 Custom OpenAI-compatible endpoint URL.
--api-key <key>: 🗝️ Override the API key explicitly.
--stockfish <path>: 📍 Path to the Stockfish binary.
--output-dir <path>: 📁 Directory to save PGN files (default: pgn_output).
--temperature <n>: 🌡️ Model temperature (default: 0.2).
--log-level <DEBUG|INFO|WARNING|ERROR>: 📝 Set internal logging verbosity (saved to logs/<model>_elo<stockfish-elo>.log).

💡 Example: Testing a local model via vLLM

If you are running a local model server compatible with the OpenAI API:

python chess_eval.py --model meta-llama/Meta-Llama-3-8B-Instruct --reasoning-effort medium --base-url http://localhost:8000/v1 --games 5 --elo 1320

📊 Evaluation Results

Below is a summary of the matches played between various LLM models and Stockfish at different target ELO levels. The table shows the number of wins, losses, draws, total games played, and the resulting win rate for each model at each ELO level.

Model	Reasoning	Stockfish ELO	Wins	Losses	Draws	Total Games	Win Rate
gemini-3.5-flash	default	2000	0	3	0	3	0.0%
gpt-5.5	none	2000	0	3	0	3	0.0%
gemini-3.5-flash	default	1900	0	2	1	3	0.0%
gpt-5.5	none	1900	0	2	1	3	0.0%
gemini-3.5-flash	default	1800	1	1	1	3	33.3%
gpt-5.5	none	1700	0	2	1	3	0.0%
gpt-5.5	low	1700	0	3	0	3	0.0%
gemini-3.5-flash	default	1600	1	2	0	3	33.3%
gemini-3.5-flash	default	1500	2	1	0	3	66.7%
gpt-5.4-mini	high	1500	0	2	0	2	0.0%
gpt-5.4-nano	default	1500	0	2	0	2	0.0%
gpt-5.4-nano	none	1500	0	3	0	3	0.0%
gpt-5.4-nano	low	1500	0	3	0	3	0.0%
gpt-5.4-nano	medium	1500	0	3	0	3	0.0%
gpt-5.5	none	1500	0	2	1	3	0.0%
gpt-5.5	low	1500	1	1	1	3	33.3%
gemini-3.5-flash	default	1450	1	2	0	3	33.3%
gemini-3.5-flash	default	1400	1	1	1	3	33.3%
gemini-3.1-flash-lite	default	1320	0	3	0	3	0.0%
gemini-3.5-flash	default	1320	2	1	0	3	66.7%
gpt-4.1	default	1320	0	3	0	3	0.0%
gpt-4.1-mini	default	1320	0	3	0	3	0.0%
gpt-4.1-nano	default	1320	0	3	0	3	0.0%
gpt-4o	default	1320	0	3	0	3	0.0%
gpt-4o-mini	default	1320	0	3	0	3	0.0%
gpt-5.4	default	1320	0	3	0	3	0.0%
gpt-5.4-mini	default	1320	0	3	0	3	0.0%
gpt-5.4-mini	none	1320	0	3	0	3	0.0%
gpt-5.4-mini	low	1320	0	3	0	3	0.0%
gpt-5.4-mini	medium	1320	0	3	0	3	0.0%
gpt-5.4-mini	high	1320	0	2	1	3	0.0%
gpt-5.4-nano	default	1320	0	6	0	6	0.0%
gpt-5.4-nano	none	1320	0	2	1	3	0.0%
gpt-5.4-nano	low	1320	0	3	0	3	0.0%
gpt-5.4-nano	medium	1320	0	4	2	6	0.0%
gpt-5.4-nano	high	1320	0	3	0	3	0.0%
gpt-5.5	default	1320	3	0	0	3	100.0%
gpt-5.5	none	1320	0	1	1	2	0.0%
gpt-5.5	low	1320	1	1	1	3	33.3%
gpt-5.5	medium	1320	2	0	1	3	66.7%
gpt-5.5	high	1320	1	0	0	1	100.0%

🪙 Token & Cost Summarization

At the end of each evaluation match, the harness prints a detailed Token Usage & Cost Summary in your terminal. This shows the total prompt (input) tokens, completion (output) tokens, and the estimated total cost in USD for the model run.

Model costs are independently stored in eval_config.py and are based on the following official developer pricing guides:

OpenAI Models: Pricing is retrieved from the official OpenAI Developer Pricing.
Google Gemini Models: Pricing is retrieved from the official Google Gemini API Pricing.
Anthropic Claude Models: Pricing is retrieved from the official Anthropic Claude Pricing.

For custom or local models where pricing is not known, the estimated cost defaults to $0.00000. You can configure or override custom model costs by editing the model_pricing dictionary in eval_config.py or overriding it in your custom code.

🏗️ Architecture

🎬 chess_eval.py: The orchestrator and UI layer. It starts the MCP server, initializes Stockfish, queries the LLM for moves, and updates the dashboard.
⚙️ chess_mcp_server.py: A stateless script running as an MCP server over stdio. It wraps the python-chess library, exposing tools to read the board (get_board_state), validate moves (validate_move), apply moves (make_move), and export the game (export_pgn).
📝 eval_config.py: Centralized dataclass for evaluation settings and defaults.

📜 License

This project is licensed under the MIT License. ⚖️

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
pgn_output		pgn_output
v2 @ e4e2c67		v2 @ e4e2c67
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
chess_eval.py		chess_eval.py
chess_mcp_server.py		chess_mcp_server.py
eval_config.py		eval_config.py
index.html		index.html
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

♟️ LLM Chess Evaluation Harness

✨ Features

📋 Requirements

🚀 Installation

⚡ Direct Execution (No Cloning/Installation Required)

1. Using `uv run`

2. Using `uvx` (or `uv tool run`)

🎮 Usage

🎛️ Command Line Arguments

💡 Example: Testing a local model via vLLM

📊 Evaluation Results

🪙 Token & Cost Summarization

🏗️ Architecture

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

♟️ LLM Chess Evaluation Harness

✨ Features

📋 Requirements

🚀 Installation

⚡ Direct Execution (No Cloning/Installation Required)

1. Using uv run

2. Using uvx (or uv tool run)

🎮 Usage

🎛️ Command Line Arguments

💡 Example: Testing a local model via vLLM

📊 Evaluation Results

🪙 Token & Cost Summarization

🏗️ Architecture

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Using `uv run`

2. Using `uvx` (or `uv tool run`)

Packages