Skip to content

HiThink-Research/PuzzleClone

Repository files navigation

PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data

Kai Xiong1*, Yanwei Huang2*, Rongjunchen Zhang1♠, Kun Chen1, Haipang Wu1, Yingcai Wu3

1HiThink Research    2HKUST    3Zhejiang University
*Equal Contribution    Corresponding Author

ACL 2026 Findings

[API Docs] | [Tutorials] | [Benchmark] | [Evaluation Toolkit]

License Python Version GitHub stars


Overview of the PuzzleClone framework

Overview of the PuzzleClone framework.


Table of Contents


🧭 Overview

PuzzleClone is a data synthesis framework and comprehensive dataset for logical reasoning problems. It features:

  • Guaranteed Verifiability: Every problem is generated with a ground-truth solution and is formally verifiable via a symbolic solver or deterministic program execution, ensuring correctness.
  • 🎯 Granular Control: Offers fine-grained control over problem attributes like scale, structure, and difficulty through a set of adjustable parameters, enabling large-scale batch generation.
  • Flexible Adaptation: Facilitates the easy customization of problem scenarios and translation into different languages or domains.
  • 📊 Expansive and Diverse Coverage: Based on PuzzleClone, we have curated a benchmark including 83,657 unique logical reasoning puzzles procedurally generated from 86 seed questions. The dataset spans:
    • Various applications of Satisfiability Modulo Theories (SMT) and SMT-like puzzles,
    • Classic logical puzzles like Sudoku, the Knapsack problem, and linear optimization (LP).
    • Diverse mathematical problems of varying difficulties.
  • 🚀 State-of-the-Art Performance: Achieves SOTA results among open-source datasets, outperforming the public dataset by 18.4 points on SATBench (from 51.6 to 70.0).

📦 PC-83K Benchmark

Applying PuzzleClone, we construct PC-83K, a benchmark covering 83,657 unique logical reasoning puzzles. The generated puzzles span Satisfiability Modulo Theories (SMT), SMT-like reasoning tasks, classic puzzles such as Sudoku and knapsack, linear optimization, and diverse mathematical problems.

Split SFT RL-Train RL-Val Total Train Test
Normal 2,161 50,738 430 51,168 5,730
Hard 2,139 23,616 430 24,046 2,713
Sum 4,300 74,354 860 75,214 8,443

Puzzle difficulty distribution

Puzzle difficulty distribution before and after deduplication.


📊 Benchmark Results

Current LLMs still show large gaps on complex logical reasoning. On PC-83K, stronger reasoning models achieve substantially higher accuracy, while post-training on PC-83K improves Qwen2.5-7B-Instruct from 14.5 to 66.0 average accuracy.

Baseline Performance on PC-83K (Click to Expand)
Model Normal Hard Avg.
ChatGPT-4o 31.7 24.6 28.2
ChatGPT-o3 87.1 83.4 85.3
ChatGPT-5 91.1 86.3 88.7
Gemini-2.0-flash 42.0 31.6 36.8
Gemini-2.5-pro 75.8 67.2 71.5
Gemini-3-pro 86.5 83.0 84.8
Claude-3.5-sonnet 37.6 27.4 32.5
Claude-4-sonnet 62.7 47.8 55.3
Seed1.6 87.8 82.4 85.1
GLM-Z1-9B-0414 63.6 53.5 58.6
GLM-Z1-32B-0414 71.1 60.9 66.0
Qwen2.5-7B-Instruct 16.8 12.1 14.5
Qwen2.5-14B-Instruct 24.3 17.9 21.1
Qwen2.5-32B-Instruct 31.4 23.5 27.4
Qwen2.5-72B-Instruct 32.8 25.3 29.0
Qwen3-8B 71.6 59.4 65.5
Qwen3-14B 78.6 67.0 72.8
Qwen3-32B 77.0 68.1 72.5
Qwen3-235B-A22B 82.9 73.8 78.3
DeepSeek-R1-Distill-Qwen-14B 47.9 38.4 43.1
DeepSeek-R1-Distill-Qwen-32B 53.3 43.2 48.3
DeepSeek-R1-0528-Qwen3-8B 76.0 66.8 71.4
DeepSeek-R1-0528 88.7 82.6 85.6
Post-Training Results (Click to Expand)
Model PC-83K Normal PC-83K Hard PC-SL-35K SATBench BBEH-mini AIME24 AIME25 AMC2023 MATH500 OlympiadBench
Qwen2.5-7B-Instruct 16.8 12.1 9.6 51.6 11.3 13.3 6.7 52.5 75.2 41.0
SFT 61.9 48.0 14.7 70.0 9.8 20.0 13.3 67.5 80.8 43.4
RL (PC-83K) 71.0 61.0 15.2 62.0 17.0 16.7 13.3 65.0 80.0 44.4
SynLogic-7B - - - - 8.0 10.0 - 55.0 71.8 -
RL (PC-SL-35K) 22.0 14.3 55.3 58.4 16.5 23.3 10.0 62.5 79.8 42.4
RL (PC-83K+PC-SL-35K) 64.8 54.1 54.2 57.2 17.0 16.7 16.7 60.0 80.4 52.2

Average accuracy grouped by seed puzzles

Average accuracy of evaluated models on the PuzzleClone test set, grouped by seed puzzle.


🔄 Data Synthesis Pipeline

PuzzleClone synthesizes data through three stages: puzzle encoding, puzzle generation, and config-based validation. Each seed puzzle is manually encoded into a DSL specification and a config file. The generator then produces randomized configs, renders new puzzle instances, computes reference answers, and validates correctness through deterministic reproduction.

PuzzleClone data synthesis pipeline

The data synthesis pipeline of PuzzleClone.


🛠️ Quick Start

Environment Setup

git clone https://github.com/HiThink-Research/PuzzleClone.git
cd PuzzleClone
pip install -r requirements.txt

Generate a Single Test Case

Run the translator in test mode to generate a sample question from a specification file:

python translator.py -t path/to/spec.yaml

The generated data ({spec_name}_data.jsonl) and debugging files such as {spec_name}_synthesizer.py are written to temp/.

Generate a Full Dataset

Run the translator in deployment mode to generate many puzzle instances:

python translator.py -d path/to/spec.yaml -o data.jsonl

If -o is omitted, the output is saved to output/{spec_name}_data.jsonl.

Apply a New Template

Use -g to load existing puzzle configs and render them with a new specification:

python translator.py -d path/to/new_spec.yaml -g old_data.jsonl -o new_data.jsonl

Data Transformation

Scripts in data_processing_scripts/ transform generated data into standard benchmark formats. See data_processing_scripts/README.md for details.

Evaluation

Use PolyhedronEvaluator for benchmark evaluation.


📚 Documentation


⚖️ License

Code License

This project is licensed under the Apache 2.0 License. See LICENSE for details.


📚 Citation

If you find PuzzleClone useful, please cite:

@inproceedings{xiong2026puzzleclone,
  title     = {PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data},
  author    = {Xiong, Kai and Huang, Yanwei and Zhang, Rongjunchen and Chen, Kun and Wu, Haipang and Wu, Yingcai},
  booktitle = {ACL 2026 Findings},
  year      = {2026},
  url       = {https://github.com/HiThink-Research/PuzzleClone}
}

About

[ACL 2026] PuzzleClone: An SMT-Powered Framework for Synthesizing Verified Mathematical Reasoning Data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors