Datasets Benchmark
Get started with the Benchmark API
Create Dataset Benchmark Task
POST /api/v2/serve/evaluate/execute_benchmark_task
DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}
curl -X POST "http://localhost:5670/api/v2/serve/evaluate/execute_benchmark_task" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"scene_value": "Falcon_benchmark_01",
"model_list": ["DeepSeek-V3.1", "Qwen3-235B-A22B"]
}'
The Benchmark Request Object
scene_key string Required
The scene type of the evaluation, e.g. support app, recall
scene_value string Required
The scene value of the benchmark, e.g. The marking evaluation task name
model_list object Required
The model name list of the benchmark will execute, e.g. ["DeepSeek-V3.1","Qwen3-235B-A22B"] Notice: The model name configured on the db-gpt platform needs to be entered.
temperature float
The temperature of the llm model, Default is 0.7
max_tokens int
The max tokens of the llm model, Default is None
The Benchmark Result
status string
The benchmark status,e.g. success, failed, running
Query Benchmark Task List
GET /api/v2/serve/evaluate/benchmark_task_list
DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}
curl -X GET "http://localhost:5670/api/v2/serve/evaluate/benchmark_task_list?page=1&page_size=20" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json"
The Benchmark Task List Request Object
page string Required
Query task list page number, Default is 1
page_size string Required
Query task list page size, Default is 20
The Benchmark Task List Result
{
"success": true,
"err_code": null,
"err_msg": null,
"data": {
"items": [
{
"evaluate_code": "1ec15dcbf5d54124bd5a5d23992af35d",
"scene_key": "dataset",
"scene_value": "local_benchmark_task_for_Qwen",
"datasets_name": "Falcon评测集",
"input_file_path": "2025_07_27_public_500_standard_benchmark_question_list.xlsx",
"output_file_path": "/DB-GPT/pilot/benchmark_meta_data/result/1ec15dcbf5d54124bd5a5d23992af35d/202510201650_multi_round_benchmark_result.xlsx",
"model_list": [
"Qwen3-Coder-480B-A35B-Instruct"
],
"context": {
"benchmark_config": "{\"file_parse_type\":\"EXCEL\", \"format_type\":\"TEXT\", \"content_type\":\"SQL\", \"benchmark_mode_type\":\"EXECUTE\", \"scene_key\":\"dataset\", \"temperature\":0.6, \"max_tokens\":6000}"
},
"user_name": null,
"user_id": null,
"sys_code": "benchmark_system",
"parallel_num": 1,
"state": "running",
"temperature": null,
"max_tokens": null,
"log_info": null,
"gmt_create": "2025-10-20 16:50:46",
"gmt_modified": "2025-10-20 16:50:46",
"cost_time": null,
"round_time": 1
}
],
"total_count": 80,
"total_pages": 4,
"page": 1,
"page_size": 20
}
}
evaluate_code string
The benchmark task unique code
scene_key string
The benchmark task scene, e.g. dataset
scene_value string
The benchmark task name
datasets_name string
The benchmark execute dataset name
input_file_path string
The benchmark dataset file path
output_file_path string
The benchmark execute result file path
model_list object
The benchmark execute model list
context object
The benchmark task context
user_name string
The benchmark task user name
user_id string
The benchmark task user id
sys_code string
The benchmark task system code, e.g. benchmark_system
parallel_num int
The benchmark task execute parallel num
state string
The benchmark task state, e.g. running, success, failed
temperature float
The benchmark task LLM temperature
max_tokens int
The benchmark task LLM max tokens
log_info int
If benchmark task execute error, It will show error message,
gmt_create string
Task create time
gmt_modified string
Task Finish time
cost_time int
Benchmark Task cost time
round_time int
Benchmark Task execute round time
Benchmark Compare Result
GET /api/v2/serve/evaluate/benchmark/result/{evaluate_code}
DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}
curl -X GET "http://localhost:5670/api/v2/serve/evaluate/benchmark/result/{evaluate_code}" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json"
The Benchmark Request Object
evaluate_code string Required
The benchMark task unique code
The Benchmark Result
{
"success": true,
"err_code": null,
"err_msg": null,
"data": {
"evaluate_code": "c827a274b4084f5dbce4c630f5267239",
"scene_value": "Falcon评测集_benchmark",
"summaries": [
{
"roundId": 1,
"llmCode": "Qwen3-Coder-480B-A35B-Instruct",
"right": 136,
"wrong": 269,
"failed": 95,
"exception": 0,
"accuracy": 0.272,
"execRate": 0.81,
"outputPath": "/DB-GPT/pilot/benchmark_meta_data/result/c827a274b4084f5dbce4c630f5267239/202510181449_multi_round_benchmark_result.xlsx"
}
]
}
}
roundId string
The benchmark task execute round time
llmCode string
The benchmark task execute model name
right int The benchmark task execute right question number
wrong int The benchmark task execute wrong question number
failed int The benchmark task execute failed question number
exception int The benchmark task execute exception question number
accuracy float The benchmark task question list execute accuracy rate
execRate float The benchmark task question list executable rate
outputPath string The benchmark task execute result output file path