CompMath-MCQ

Are LLMs Ready for Higher-Level Math?

CompMath-MCQ is a benchmark dataset of 1,528 multiple-choice questions designed to evaluate LLMs on graduate-level computational mathematics. All questions were originally authored by university professors and are not sourced from existing textbooks or online repositories, ensuring zero data leakage.

Each question provides 3 answer choices with exactly one correct answer, enabling fully automatic and deterministic evaluation via the lm_eval library.

Topics

Topic	Description
Linear Algebra	Matrix norms, eigenvalues, definiteness, decompositions
Numerical Optimization	Convergence, gradient methods, constrained optimization
Vector Calculus	Gradients, divergence, Jacobians, integral theorems
Probability	Distributions, expectation, conditional probability, Bayes
Python	NumPy, SciPy, scientific computing idioms

Quick Start

Load from Hugging Face

The easiest way to use CompMath-MCQ is to load it directly from Hugging Face Datasets:

from datasets import load_dataset
 
dataset = load_dataset("biancaraimondi/CompMath-MCQ", split="test")
 
# Browse a sample
print(dataset[0])
# {
#   'question': 'Given the matrix A = ..., compute the 2-norm and 1-norm of A.',
#   'options': ['\\(\\|A\\|_2 = 4,\\ \\|A\\|_1 = 4\\)', ...],
#   'correct_label': 0,
#   'subtopic': 'Linear Algebra'
# }

You can also load it with pandas:

import pandas as pd
 
df = pd.read_json(
    "hf://datasets/biancaraimondi/CompMath-MCQ/data.json"
)
print(df.head())
print(df["subtopic"].value_counts())

Dataset Schema

Field	Type	Description
`question`	`string`	The question text (LaTeX-formatted)
`options`	`list[string]`	3 answer choices (LaTeX-formatted)
`correct_label`	`int`	Index of the correct answer (0, 1, or 2)
`subtopic`	`string`	One of: Linear Algebra, Numerical Optimization, Vector Calculus, Probability, Python

Evaluation with `lm_eval`

CompMath-MCQ is designed for plug-and-play evaluation with the Language Model Evaluation Harness.

Setup

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
 
# Install dependencies
pip install -r requirements.txt

Register the Custom Task

Copy the task files into your lm_eval installation:

# Find your lm_eval tasks directory
TASK_DIR=$(python -c "import lm_eval; import os; print(os.path.join(os.path.dirname(lm_eval.__file__), 'tasks'))")
 
# Create the custom task folder and copy files
mkdir -p "$TASK_DIR/my_custom_task"
cp my_eval_task/mcq_lm_eval_data.jsonl "$TASK_DIR/my_custom_task/"
cp my_eval_task/my_mcq_task.yaml "$TASK_DIR/my_custom_task/"

Run Evaluation

Use the provided script or call lm_eval directly:

# Using the provided script (edit model paths inside first)
bash test_script.sh
 
# Or run directly
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3-8B \
    --tasks my_mcq_task \
    --output_path results/llama3-8b \
    --batch_size auto

Results are saved to results/{model_name}/.

Repository Structure

CompMath-MCQ/
├── README.md
├── requirements.txt
├── test_script.sh              # Evaluation runner script
├── my_eval_task/
│   ├── mcq_lm_eval_data.jsonl  # Dataset in lm_eval format
│   └── my_mcq_task.yaml        # lm_eval task definition
└── ...

Citation

If you use CompMath-MCQ in your research, please cite:

@article{raimondi2026compmath,
  title   = {The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?},
  author  = {Raimondi, Bianca and Pivi, Francesco and Evangelista, Davide and Gabbrielli, Maurizio},
  journal = {arXiv preprint arXiv:2603.03334},
  year    = {2026}
}

License

This dataset is released under a Creative Commons license. See the Hugging Face dataset card for full details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
my_eval_task		my_eval_task
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
test_script.sh		test_script.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompMath-MCQ

Are LLMs Ready for Higher-Level Math?

Topics

Quick Start

Load from Hugging Face

Dataset Schema

Evaluation with `lm_eval`

Setup

Register the Custom Task

Run Evaluation

Repository Structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CompMath-MCQ

Are LLMs Ready for Higher-Level Math?

Topics

Quick Start

Load from Hugging Face

Dataset Schema

Evaluation with lm_eval

Setup

Register the Custom Task

Run Evaluation

Repository Structure

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Evaluation with `lm_eval`

Packages