Skip to main content
Version: v0.7.4

Datasets Benchmark

Get started with the Benchmark API

Create Dataset Benchmark Task

POST /api/v2/serve/evaluate/execute_benchmark_task
DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X POST "http://localhost:5670/api/v2/serve/evaluate/execute_benchmark_task" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"scene_value": "Falcon_benchmark_01",
"model_list": ["DeepSeek-V3.1", "Qwen3-235B-A22B"]
}'

The Benchmark Request Object


scene_key string Required

The scene type of the evaluation, e.g. support app, recall


scene_value string Required

The scene value of the benchmark, e.g. The marking evaluation task name


model_list object Required

The model name list of the benchmark will execute, e.g. ["DeepSeek-V3.1","Qwen3-235B-A22B"] Notice: The model name configured on the db-gpt platform needs to be entered.


temperature float

The temperature of the llm model, Default is 0.7


max_tokens int

The max tokens of the llm model, Default is None


The Benchmark Result


status string

The benchmark status,e.g. success, failed, running


Query Benchmark Task List

GET /api/v2/serve/evaluate/benchmark_task_list
DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X GET "http://localhost:5670/api/v2/serve/evaluate/benchmark_task_list?page=1&page_size=20" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json"

The Benchmark Task List Request Object


page string Required

Query task list page number, Default is 1


page_size string Required

Query task list page size, Default is 20


The Benchmark Task List Result

{
"success": true,
"err_code": null,
"err_msg": null,
"data": {
"items": [
{
"evaluate_code": "1ec15dcbf5d54124bd5a5d23992af35d",
"scene_key": "dataset",
"scene_value": "local_benchmark_task_for_Qwen",
"datasets_name": "Falcon评测集",
"input_file_path": "2025_07_27_public_500_standard_benchmark_question_list.xlsx",
"output_file_path": "/DB-GPT/pilot/benchmark_meta_data/result/1ec15dcbf5d54124bd5a5d23992af35d/202510201650_multi_round_benchmark_result.xlsx",
"model_list": [
"Qwen3-Coder-480B-A35B-Instruct"
],
"context": {
"benchmark_config": "{\"file_parse_type\":\"EXCEL\", \"format_type\":\"TEXT\", \"content_type\":\"SQL\", \"benchmark_mode_type\":\"EXECUTE\", \"scene_key\":\"dataset\", \"temperature\":0.6, \"max_tokens\":6000}"
},
"user_name": null,
"user_id": null,
"sys_code": "benchmark_system",
"parallel_num": 1,
"state": "running",
"temperature": null,
"max_tokens": null,
"log_info": null,
"gmt_create": "2025-10-20 16:50:46",
"gmt_modified": "2025-10-20 16:50:46",
"cost_time": null,
"round_time": 1
}
],
"total_count": 80,
"total_pages": 4,
"page": 1,
"page_size": 20
}
}

evaluate_code string

The benchmark task unique code


scene_key string

The benchmark task scene, e.g. dataset


scene_value string

The benchmark task name


datasets_name string

The benchmark execute dataset name


input_file_path string

The benchmark dataset file path


output_file_path string

The benchmark execute result file path


model_list object

The benchmark execute model list


context object

The benchmark task context


user_name string

The benchmark task user name


user_id string

The benchmark task user id


sys_code string

The benchmark task system code, e.g. benchmark_system


parallel_num int

The benchmark task execute parallel num


state string

The benchmark task state, e.g. running, success, failed


temperature float

The benchmark task LLM temperature


max_tokens int

The benchmark task LLM max tokens


log_info int

If benchmark task execute error, It will show error message,


gmt_create string

Task create time


gmt_modified string

Task Finish time


cost_time int

Benchmark Task cost time


round_time int

Benchmark Task execute round time


Benchmark Compare Result

GET /api/v2/serve/evaluate/benchmark/result/{evaluate_code}
DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X GET "http://localhost:5670/api/v2/serve/evaluate/benchmark/result/{evaluate_code}" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json"

The Benchmark Request Object


evaluate_code string Required

The benchMark task unique code


The Benchmark Result

{
"success": true,
"err_code": null,
"err_msg": null,
"data": {
"evaluate_code": "c827a274b4084f5dbce4c630f5267239",
"scene_value": "Falcon评测集_benchmark",
"summaries": [
{
"roundId": 1,
"llmCode": "Qwen3-Coder-480B-A35B-Instruct",
"right": 136,
"wrong": 269,
"failed": 95,
"exception": 0,
"accuracy": 0.272,
"execRate": 0.81,
"outputPath": "/DB-GPT/pilot/benchmark_meta_data/result/c827a274b4084f5dbce4c630f5267239/202510181449_multi_round_benchmark_result.xlsx"
}
]
}
}

roundId string

The benchmark task execute round time


llmCode string

The benchmark task execute model name


right int The benchmark task execute right question number


wrong int The benchmark task execute wrong question number


failed int The benchmark task execute failed question number


exception int The benchmark task execute exception question number


accuracy float The benchmark task question list execute accuracy rate


execRate float The benchmark task question list executable rate


outputPath string The benchmark task execute result output file path