CompMath-MCQ is a benchmark dataset of 1,528 multiple-choice questions designed to evaluate LLMs on graduate-level computational mathematics. All questions were originally authored by university professors and are not sourced from existing textbooks or online repositories, ensuring zero data leakage.
Each question provides 3 answer choices with exactly one correct answer, enabling fully automatic and deterministic evaluation via the lm_eval library.
| Topic | Description |
|---|---|
| Linear Algebra | Matrix norms, eigenvalues, definiteness, decompositions |
| Numerical Optimization | Convergence, gradient methods, constrained optimization |
| Vector Calculus | Gradients, divergence, Jacobians, integral theorems |
| Probability | Distributions, expectation, conditional probability, Bayes |
| Python | NumPy, SciPy, scientific computing idioms |
The easiest way to use CompMath-MCQ is to load it directly from Hugging Face Datasets:
from datasets import load_dataset
dataset = load_dataset("biancaraimondi/CompMath-MCQ", split="test")
# Browse a sample
print(dataset[0])
# {
# 'question': 'Given the matrix A = ..., compute the 2-norm and 1-norm of A.',
# 'options': ['\\(\\|A\\|_2 = 4,\\ \\|A\\|_1 = 4\\)', ...],
# 'correct_label': 0,
# 'subtopic': 'Linear Algebra'
# }You can also load it with pandas:
import pandas as pd
df = pd.read_json(
"hf://datasets/biancaraimondi/CompMath-MCQ/data.json"
)
print(df.head())
print(df["subtopic"].value_counts())| Field | Type | Description |
|---|---|---|
question |
string |
The question text (LaTeX-formatted) |
options |
list[string] |
3 answer choices (LaTeX-formatted) |
correct_label |
int |
Index of the correct answer (0, 1, or 2) |
subtopic |
string |
One of: Linear Algebra, Numerical Optimization, Vector Calculus, Probability, Python |
CompMath-MCQ is designed for plug-and-play evaluation with the Language Model Evaluation Harness.
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCopy the task files into your lm_eval installation:
# Find your lm_eval tasks directory
TASK_DIR=$(python -c "import lm_eval; import os; print(os.path.join(os.path.dirname(lm_eval.__file__), 'tasks'))")
# Create the custom task folder and copy files
mkdir -p "$TASK_DIR/my_custom_task"
cp my_eval_task/mcq_lm_eval_data.jsonl "$TASK_DIR/my_custom_task/"
cp my_eval_task/my_mcq_task.yaml "$TASK_DIR/my_custom_task/"Use the provided script or call lm_eval directly:
# Using the provided script (edit model paths inside first)
bash test_script.sh
# Or run directly
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3-8B \
--tasks my_mcq_task \
--output_path results/llama3-8b \
--batch_size autoResults are saved to results/{model_name}/.
CompMath-MCQ/
├── README.md
├── requirements.txt
├── test_script.sh # Evaluation runner script
├── my_eval_task/
│ ├── mcq_lm_eval_data.jsonl # Dataset in lm_eval format
│ └── my_mcq_task.yaml # lm_eval task definition
└── ...
If you use CompMath-MCQ in your research, please cite:
@article{raimondi2026compmath,
title = {The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?},
author = {Raimondi, Bianca and Pivi, Francesco and Evangelista, Davide and Gabbrielli, Maurizio},
journal = {arXiv preprint arXiv:2603.03334},
year = {2026}
}This dataset is released under a Creative Commons license. See the Hugging Face dataset card for full details.