MDEval

Welcome to MDEval! This is an innovative benchmark for evaluating the performance of large language models (LLMs) chatbots in handling Markdown content. Our goal is to advance research and applications in Markdown Awareness for LLMs.

🚀 Human Evaluation Platform

To validate the effectiveness of our benchmark, we have built a Human Evaluation Platform where we invited human reviewers to comprehensively evaluate the outputs of LLM chatbots. Experience our platform here: Human Evaluation Arena.

🔧 Quick Start

We recommend using virtualenv to create an isolated virtual environment. Use the following command to quickly install the required dependencies:

pip install -r requirements.txt

Before using MDEval, ensure that you set your API key as an environment variable. To streamline the process of accessing multiple model APIs through a single interface, MDEval uses a third-party proxy, allowing users to interact with models from various providers, such as Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet, all through the same platform. Notable providers include OpenRouter, Zhizengzeng, and No.1 API-Pro. These services enhance developer productivity and simplify the integration process.

📂 Sample Data

In the data folder, we provide partial data obtained through MDEval, while the human_data folder contains some human evaluation data. You can refer to the example script md_eval.py to understand how to obtain responses, rewrite them, and calculate scores for each model.

📈 Test Results

We randomly selected a subset of data from our overall dataset to display the test results:

Accuracy: The predicted accuracy for MDEval's Markdown Awareness is 0.8240. You can run accuracy.py to check the accuracy for different models.
Correlation: We calculated the following three correlation coefficients to compare MDEval's model rankings with human rankings. Run correlation.py to compute the results:
- Spearman Correlation: 0.7313
- Pearson Correlation: 0.8007
- Kendall Correlation: 0.6021
Average Scores: The following are the average scores for the models, calculated on a subset of test data. You can run md_rank.py to compute these scores:

Model	Average Score
deepseek-chat	0.939
chatgpt-4o-latest	0.911
gpt-4o-mini-2024-07-18	0.829
gemini-1.5-pro	0.795
gpt-4-turbo-2024-04-09	0.792
llama3.1:8b	0.711
claude-3-5-sonnet-20240620	0.566
gpt-3.5-turbo	0.475
baichuan2-13b-chat-v1	0.164

📖 bibtex

@inproceedings{chen2025mdeval,
title={MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models},
author={Zhongpu Chen and Yinfeng Liu and Long Shi and Zhi-Jie Wang and Xingyan Chen and Yu Zhao and Fuji Ren},
booktitle={Proceedings of the ACM Web Conference},
year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
human_data		human_data
prompts		prompts
query		query
.gitignore		.gitignore
README.md		README.md
WWW2025_Poster_MDEval.pdf		WWW2025_Poster_MDEval.pdf
accuracy.py		accuracy.py
correlation.py		correlation.py
md_eval.py		md_eval.py
md_rank.py		md_rank.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MDEval

🚀 Human Evaluation Platform

🔧 Quick Start

📂 Sample Data

📈 Test Results

📖 bibtex

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MDEval

🚀 Human Evaluation Platform

🔧 Quick Start

📂 Sample Data

📈 Test Results

📖 bibtex

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages