A sophisticated evaluation framework for testing the chess-playing capabilities of Large Language Models (LLMs) against Stockfish. 🤖 The harness uses a Model Context Protocol (MCP) server to maintain secure, stateful chess games and provides a rich live-updating terminal interface. 🖥️
It supports any OpenAI-compatible API endpoint, allowing you to test OpenAI models, Anthropic models (via proxies), or local LLMs (via vLLM, Ollama, etc.). 🌐
- 📺 Live Terminal Dashboard: A beautiful, real-time TUI built with
rich, displaying the board, move history, side-to-move, and live game status. - 🛠️ MCP Backend: Chess state and rules are fully mediated by a local MCP server (
chess_mcp_server.py), providing robust move validation and state management. - 🐟 Stockfish Integration: Test against Stockfish with configurable ELO levels to find the exact rating of your LLM.
- 🛡️ Robust Error Handling: Automatically detects illegal LLM moves and prompts the LLM to retry, tracking "illegal move attempts" as a metric.
- 💾 PGN Export: Automatically saves all finished games to standard Portable Game Notation (PGN) files for later review in any chess GUI. Filenames include the model id and Stockfish ELO for easy sorting.
- 🐍 Python 3.12+
- ⚙️ Stockfish: The Stockfish chess engine binary must be installed on your system.
- 🐧 Ubuntu/Debian:
sudo apt install stockfish - 🍎 macOS:
brew install stockfish - 🪟 Windows: Download from the Stockfish website and add it to your
PATH, or specify the path via configuration.
- 🐧 Ubuntu/Debian:
- 🔑 API Key: An API key for your chosen LLM provider (e.g., OpenAI).
- 📥 Clone this repository.
- 📦 Install the required dependencies:
pip install -r requirements.txt
# or using uv:
# uv sync- 🔧 Configure your environment variables. You can create a
.envfile in the project root:
LLM_API_KEY=your_api_key_here
# Optional:
# LLM_BASE_URL=https://api.your-custom-provider.com/v1
# STOCKFISH_PATH=/path/to/custom/stockfishYou can run the harness or the MCP server directly from the GitHub repository using uv or uvx without cloning the project.
Run the evaluation harness command directly:
uv run --with git+https://github.com/PythonicVarun/llm-chess-evaluation-harness.git eval [options]Run the evaluation harness:
uvx --from git+https://github.com/PythonicVarun/llm-chess-evaluation-harness.git eval [options]Or start the standalone MCP Chess server:
uvx --from git+https://github.com/PythonicVarun/llm-chess-evaluation-harness.git mcpNote
Make sure your OPENAI_API_KEY (or LLM_API_KEY) and STOCKFISH_PATH environment variables are exported in your terminal before running the direct command.
Run the evaluation harness using the main script:
python chess_eval.pyBy default, this will play a 3-game match using gpt-4o-mini as White against Stockfish (1500 ELO) as Black. ⚔️
You can override the default configuration using CLI arguments:
--model <name>: 🧠 LLM model name (default:gpt-4o-mini).--reasoning-effort <low|medium|high>: 🧠 Reasoning effort for supported models.--color <white|black>: 🎨 The color the LLM will play (default:white).--games <n>: 🔢 Number of games to play in the match (default:3).--elo <n>: 📈 Stockfish target ELO (default:1320).--base-url <url>: 🔗 Custom OpenAI-compatible endpoint URL.--api-key <key>: 🗝️ Override the API key explicitly.--stockfish <path>: 📍 Path to the Stockfish binary.--output-dir <path>: 📁 Directory to save PGN files (default:pgn_output).--temperature <n>: 🌡️ Model temperature (default:0.2).--log-level <DEBUG|INFO|WARNING|ERROR>: 📝 Set internal logging verbosity (saved tologs/<model>_elo<stockfish-elo>.log).
If you are running a local model server compatible with the OpenAI API:
python chess_eval.py --model meta-llama/Meta-Llama-3-8B-Instruct --reasoning-effort medium --base-url http://localhost:8000/v1 --games 5 --elo 1320Below is a summary of the matches played between various LLM models and Stockfish at different target ELO levels. The table shows the number of wins, losses, draws, total games played, and the resulting win rate for each model at each ELO level.
| Model | Reasoning | Stockfish ELO | Wins | Losses | Draws | Total Games | Win Rate |
|---|---|---|---|---|---|---|---|
| gemini-3.5-flash | default | 2000 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.5 | none | 2000 | 0 | 3 | 0 | 3 | 0.0% |
| gemini-3.5-flash | default | 1900 | 0 | 2 | 1 | 3 | 0.0% |
| gpt-5.5 | none | 1900 | 0 | 2 | 1 | 3 | 0.0% |
| gemini-3.5-flash | default | 1800 | 1 | 1 | 1 | 3 | 33.3% |
| gpt-5.5 | none | 1700 | 0 | 2 | 1 | 3 | 0.0% |
| gpt-5.5 | low | 1700 | 0 | 3 | 0 | 3 | 0.0% |
| gemini-3.5-flash | default | 1600 | 1 | 2 | 0 | 3 | 33.3% |
| gemini-3.5-flash | default | 1500 | 2 | 1 | 0 | 3 | 66.7% |
| gpt-5.4-mini | high | 1500 | 0 | 2 | 0 | 2 | 0.0% |
| gpt-5.4-nano | default | 1500 | 0 | 2 | 0 | 2 | 0.0% |
| gpt-5.4-nano | none | 1500 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.4-nano | low | 1500 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.4-nano | medium | 1500 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.5 | none | 1500 | 0 | 2 | 1 | 3 | 0.0% |
| gpt-5.5 | low | 1500 | 1 | 1 | 1 | 3 | 33.3% |
| gemini-3.5-flash | default | 1450 | 1 | 2 | 0 | 3 | 33.3% |
| gemini-3.5-flash | default | 1400 | 1 | 1 | 1 | 3 | 33.3% |
| gemini-3.1-flash-lite | default | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gemini-3.5-flash | default | 1320 | 2 | 1 | 0 | 3 | 66.7% |
| gpt-4.1 | default | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-4.1-mini | default | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-4.1-nano | default | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-4o | default | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-4o-mini | default | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.4 | default | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.4-mini | default | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.4-mini | none | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.4-mini | low | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.4-mini | medium | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.4-mini | high | 1320 | 0 | 2 | 1 | 3 | 0.0% |
| gpt-5.4-nano | default | 1320 | 0 | 6 | 0 | 6 | 0.0% |
| gpt-5.4-nano | none | 1320 | 0 | 2 | 1 | 3 | 0.0% |
| gpt-5.4-nano | low | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.4-nano | medium | 1320 | 0 | 4 | 2 | 6 | 0.0% |
| gpt-5.4-nano | high | 1320 | 0 | 3 | 0 | 3 | 0.0% |
| gpt-5.5 | default | 1320 | 3 | 0 | 0 | 3 | 100.0% |
| gpt-5.5 | none | 1320 | 0 | 1 | 1 | 2 | 0.0% |
| gpt-5.5 | low | 1320 | 1 | 1 | 1 | 3 | 33.3% |
| gpt-5.5 | medium | 1320 | 2 | 0 | 1 | 3 | 66.7% |
| gpt-5.5 | high | 1320 | 1 | 0 | 0 | 1 | 100.0% |
At the end of each evaluation match, the harness prints a detailed Token Usage & Cost Summary in your terminal. This shows the total prompt (input) tokens, completion (output) tokens, and the estimated total cost in USD for the model run.
Model costs are independently stored in eval_config.py and are based on the following official developer pricing guides:
- OpenAI Models: Pricing is retrieved from the official OpenAI Developer Pricing.
- Google Gemini Models: Pricing is retrieved from the official Google Gemini API Pricing.
- Anthropic Claude Models: Pricing is retrieved from the official Anthropic Claude Pricing.
For custom or local models where pricing is not known, the estimated cost defaults to $0.00000. You can configure or override custom model costs by editing the model_pricing dictionary in eval_config.py or overriding it in your custom code.
- 🎬
chess_eval.py: The orchestrator and UI layer. It starts the MCP server, initializes Stockfish, queries the LLM for moves, and updates the dashboard. - ⚙️
chess_mcp_server.py: A stateless script running as an MCP server over stdio. It wraps thepython-chesslibrary, exposing tools to read the board (get_board_state), validate moves (validate_move), apply moves (make_move), and export the game (export_pgn). - 📝
eval_config.py: Centralized dataclass for evaluation settings and defaults.
This project is licensed under the MIT License. ⚖️