Kai Xiong1*, Yanwei Huang2*, Rongjunchen Zhang1♠, Kun Chen1, Haipang Wu1, Yingcai Wu3
1HiThink Research
2HKUST
3Zhejiang University
*Equal Contribution ♠Corresponding Author
ACL 2026 Findings
[API Docs] | [Tutorials] | [Benchmark] | [Evaluation Toolkit]
Overview of the PuzzleClone framework.
- 🧭 Overview
- 📦 PC-83K Benchmark
- 📊 Benchmark Results
- 🔄 Data Synthesis Pipeline
- 🛠️ Quick Start
- 📚 Documentation
- ⚖️ License
- 📚 Citation
PuzzleClone is a data synthesis framework and comprehensive dataset for logical reasoning problems. It features:
- ✅ Guaranteed Verifiability: Every problem is generated with a ground-truth solution and is formally verifiable via a symbolic solver or deterministic program execution, ensuring correctness.
- 🎯 Granular Control: Offers fine-grained control over problem attributes like scale, structure, and difficulty through a set of adjustable parameters, enabling large-scale batch generation.
- ✨ Flexible Adaptation: Facilitates the easy customization of problem scenarios and translation into different languages or domains.
- 📊 Expansive and Diverse Coverage: Based on PuzzleClone, we have curated a benchmark including 83,657 unique logical reasoning puzzles procedurally generated from 86 seed questions. The dataset spans:
- Various applications of Satisfiability Modulo Theories (SMT) and SMT-like puzzles,
- Classic logical puzzles like Sudoku, the Knapsack problem, and linear optimization (LP).
- Diverse mathematical problems of varying difficulties.
- 🚀 State-of-the-Art Performance: Achieves SOTA results among open-source datasets, outperforming the public dataset by 18.4 points on SATBench (from 51.6 to 70.0).
Applying PuzzleClone, we construct PC-83K, a benchmark covering 83,657 unique logical reasoning puzzles. The generated puzzles span Satisfiability Modulo Theories (SMT), SMT-like reasoning tasks, classic puzzles such as Sudoku and knapsack, linear optimization, and diverse mathematical problems.
| Split | SFT | RL-Train | RL-Val | Total Train | Test |
|---|---|---|---|---|---|
| Normal | 2,161 | 50,738 | 430 | 51,168 | 5,730 |
| Hard | 2,139 | 23,616 | 430 | 24,046 | 2,713 |
| Sum | 4,300 | 74,354 | 860 | 75,214 | 8,443 |
Puzzle difficulty distribution before and after deduplication.
Current LLMs still show large gaps on complex logical reasoning. On PC-83K, stronger reasoning models achieve substantially higher accuracy, while post-training on PC-83K improves Qwen2.5-7B-Instruct from 14.5 to 66.0 average accuracy.
Baseline Performance on PC-83K (Click to Expand)
| Model | Normal | Hard | Avg. |
|---|---|---|---|
| ChatGPT-4o | 31.7 | 24.6 | 28.2 |
| ChatGPT-o3 | 87.1 | 83.4 | 85.3 |
| ChatGPT-5 | 91.1 | 86.3 | 88.7 |
| Gemini-2.0-flash | 42.0 | 31.6 | 36.8 |
| Gemini-2.5-pro | 75.8 | 67.2 | 71.5 |
| Gemini-3-pro | 86.5 | 83.0 | 84.8 |
| Claude-3.5-sonnet | 37.6 | 27.4 | 32.5 |
| Claude-4-sonnet | 62.7 | 47.8 | 55.3 |
| Seed1.6 | 87.8 | 82.4 | 85.1 |
| GLM-Z1-9B-0414 | 63.6 | 53.5 | 58.6 |
| GLM-Z1-32B-0414 | 71.1 | 60.9 | 66.0 |
| Qwen2.5-7B-Instruct | 16.8 | 12.1 | 14.5 |
| Qwen2.5-14B-Instruct | 24.3 | 17.9 | 21.1 |
| Qwen2.5-32B-Instruct | 31.4 | 23.5 | 27.4 |
| Qwen2.5-72B-Instruct | 32.8 | 25.3 | 29.0 |
| Qwen3-8B | 71.6 | 59.4 | 65.5 |
| Qwen3-14B | 78.6 | 67.0 | 72.8 |
| Qwen3-32B | 77.0 | 68.1 | 72.5 |
| Qwen3-235B-A22B | 82.9 | 73.8 | 78.3 |
| DeepSeek-R1-Distill-Qwen-14B | 47.9 | 38.4 | 43.1 |
| DeepSeek-R1-Distill-Qwen-32B | 53.3 | 43.2 | 48.3 |
| DeepSeek-R1-0528-Qwen3-8B | 76.0 | 66.8 | 71.4 |
| DeepSeek-R1-0528 | 88.7 | 82.6 | 85.6 |
Post-Training Results (Click to Expand)
| Model | PC-83K Normal | PC-83K Hard | PC-SL-35K | SATBench | BBEH-mini | AIME24 | AIME25 | AMC2023 | MATH500 | OlympiadBench |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 16.8 | 12.1 | 9.6 | 51.6 | 11.3 | 13.3 | 6.7 | 52.5 | 75.2 | 41.0 |
| SFT | 61.9 | 48.0 | 14.7 | 70.0 | 9.8 | 20.0 | 13.3 | 67.5 | 80.8 | 43.4 |
| RL (PC-83K) | 71.0 | 61.0 | 15.2 | 62.0 | 17.0 | 16.7 | 13.3 | 65.0 | 80.0 | 44.4 |
| SynLogic-7B | - | - | - | - | 8.0 | 10.0 | - | 55.0 | 71.8 | - |
| RL (PC-SL-35K) | 22.0 | 14.3 | 55.3 | 58.4 | 16.5 | 23.3 | 10.0 | 62.5 | 79.8 | 42.4 |
| RL (PC-83K+PC-SL-35K) | 64.8 | 54.1 | 54.2 | 57.2 | 17.0 | 16.7 | 16.7 | 60.0 | 80.4 | 52.2 |
Average accuracy of evaluated models on the PuzzleClone test set, grouped by seed puzzle.
PuzzleClone synthesizes data through three stages: puzzle encoding, puzzle generation, and config-based validation. Each seed puzzle is manually encoded into a DSL specification and a config file. The generator then produces randomized configs, renders new puzzle instances, computes reference answers, and validates correctness through deterministic reproduction.
The data synthesis pipeline of PuzzleClone.
git clone https://github.com/HiThink-Research/PuzzleClone.git
cd PuzzleClone
pip install -r requirements.txtRun the translator in test mode to generate a sample question from a specification file:
python translator.py -t path/to/spec.yamlThe generated data ({spec_name}_data.jsonl) and debugging files such as {spec_name}_synthesizer.py are written to temp/.
Run the translator in deployment mode to generate many puzzle instances:
python translator.py -d path/to/spec.yaml -o data.jsonlIf -o is omitted, the output is saved to output/{spec_name}_data.jsonl.
Use -g to load existing puzzle configs and render them with a new specification:
python translator.py -d path/to/new_spec.yaml -g old_data.jsonl -o new_data.jsonlScripts in data_processing_scripts/ transform generated data into standard benchmark formats. See data_processing_scripts/README.md for details.
Use PolyhedronEvaluator for benchmark evaluation.
- API documentation: https://puzzleclone.github.io/PuzzleClone/api/index.html
- Tutorials: https://puzzleclone.github.io/PuzzleClone/tutorial/
- Documentation build notes:
create_docs.md
This project is licensed under the Apache 2.0 License. See LICENSE for details.
If you find PuzzleClone useful, please cite:
@inproceedings{xiong2026puzzleclone,
title = {PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data},
author = {Xiong, Kai and Huang, Yanwei and Zhang, Rongjunchen and Chen, Kun and Wu, Haipang and Wu, Yingcai},
booktitle = {ACL 2026 Findings},
year = {2026},
url = {https://github.com/HiThink-Research/PuzzleClone}
}


