Version: dev

Datasets Benchmark

Get started with the Benchmark API

Create Dataset Benchmark Task

POST /api/v2/serve/evaluate/execute_benchmark_task

DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X POST "http://localhost:5670/api/v2/serve/evaluate/execute_benchmark_task" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
    "scene_value": "Falcon_benchmark_01",
    "model_list": ["DeepSeek-V3.1", "Qwen3-235B-A22B"]
}'

The Benchmark Request Object

scene_key string Required

The scene type of the evaluation, e.g. support app, recall

scene_value string Required

The scene value of the benchmark, e.g. The marking evaluation task name

model_list object Required

The model name list of the benchmark will execute, e.g. ["DeepSeek-V3.1","Qwen3-235B-A22B"] Notice: The model name configured on the db-gpt platform needs to be entered.

temperature float

The temperature of the llm model, Default is 0.7

max_tokens int

The max tokens of the llm model, Default is None

The Benchmark Result

status string

The benchmark status，e.g. success, failed, running

Query Benchmark Task List

GET /api/v2/serve/evaluate/benchmark_task_list

DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X GET "http://localhost:5670/api/v2/serve/evaluate/benchmark_task_list?page=1&page_size=20" \ 
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json"

The Benchmark Task List Request Object

page string Required

Query task list page number, Default is 1

page_size string Required

Query task list page size, Default is 20

The Benchmark Task List Result

{
    "success": true,
    "err_code": null,
    "err_msg": null,
    "data": {
        "items": [
            {
                "evaluate_code": "1ec15dcbf5d54124bd5a5d23992af35d",
                "scene_key": "dataset",
                "scene_value": "local_benchmark_task_for_Qwen",
                "datasets_name": "Falcon评测集",
                "input_file_path": "2025_07_27_public_500_standard_benchmark_question_list.xlsx",
                "output_file_path": "/DB-GPT/pilot/benchmark_meta_data/result/1ec15dcbf5d54124bd5a5d23992af35d/202510201650_multi_round_benchmark_result.xlsx",
                "model_list": [
                    "Qwen3-Coder-480B-A35B-Instruct"
                ],
                "context": {
                    "benchmark_config": "{\"file_parse_type\":\"EXCEL\", \"format_type\":\"TEXT\", \"content_type\":\"SQL\", \"benchmark_mode_type\":\"EXECUTE\", \"scene_key\":\"dataset\", \"temperature\":0.6, \"max_tokens\":6000}"
                },
                "user_name": null,
                "user_id": null,
                "sys_code": "benchmark_system",
                "parallel_num": 1,
                "state": "running",
                "temperature": null,
                "max_tokens": null,
                "log_info": null,
                "gmt_create": "2025-10-20 16:50:46",
                "gmt_modified": "2025-10-20 16:50:46",
                "cost_time": null,
                "round_time": 1
            }
        ],
        "total_count": 80,
        "total_pages": 4,
        "page": 1,
        "page_size": 20
    }
}

evaluate_code string

The benchmark task unique code

scene_key string

The benchmark task scene, e.g. dataset

scene_value string

The benchmark task name

datasets_name string

The benchmark execute dataset name

input_file_path string

The benchmark dataset file path

output_file_path string

The benchmark execute result file path

model_list object

The benchmark execute model list

context object

The benchmark task context

user_name string

The benchmark task user name

user_id string

The benchmark task user id

sys_code string

The benchmark task system code, e.g. benchmark_system

parallel_num int

The benchmark task execute parallel num

state string

The benchmark task state, e.g. running, success, failed

temperature float

The benchmark task LLM temperature

max_tokens int

The benchmark task LLM max tokens

log_info int

If benchmark task execute error, It will show error message,

gmt_create string

Task create time

gmt_modified string

Task Finish time

cost_time int

Benchmark Task cost time

round_time int

Benchmark Task execute round time

Benchmark Compare Result

GET /api/v2/serve/evaluate/benchmark/result/{evaluate_code}

DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X GET "http://localhost:5670/api/v2/serve/evaluate/benchmark/result/{evaluate_code}" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json"

The Benchmark Request Object

evaluate_code string Required

The benchMark task unique code

The Benchmark Result

{
    "success": true,
    "err_code": null,
    "err_msg": null,
    "data": {
        "evaluate_code": "c827a274b4084f5dbce4c630f5267239",
        "scene_value": "Falcon评测集_benchmark",
        "summaries": [
            {
                "roundId": 1,
                "llmCode": "Qwen3-Coder-480B-A35B-Instruct",
                "right": 136,
                "wrong": 269,
                "failed": 95,
                "exception": 0,
                "accuracy": 0.272,
                "execRate": 0.81,
                "outputPath": "/DB-GPT/pilot/benchmark_meta_data/result/c827a274b4084f5dbce4c630f5267239/202510181449_multi_round_benchmark_result.xlsx"
            }
        ]
    }
}

roundId string

The benchmark task execute round time

llmCode string

The benchmark task execute model name

right int The benchmark task execute right question number

wrong int The benchmark task execute wrong question number

failed int The benchmark task execute failed question number

exception int The benchmark task execute exception question number

accuracy float The benchmark task question list execute accuracy rate

execRate float The benchmark task question list executable rate

outputPath string The benchmark task execute result output file path

Datasets Benchmark

Create Dataset Benchmark Task​

The Benchmark Request Object​

The Benchmark Result​

Query Benchmark Task List​

The Benchmark Task List Request Object​

The Benchmark Task List Result​

Benchmark Compare Result​

The Benchmark Request Object​

The Benchmark Result​

Create Dataset Benchmark Task

The Benchmark Request Object

The Benchmark Result

Query Benchmark Task List

The Benchmark Task List Request Object

The Benchmark Task List Result

Benchmark Compare Result

The Benchmark Request Object

The Benchmark Result