Skip to main content
Version: v0.7.4

Ant Group Data Retrieval Benchmark Dataset Guide

For Text2SQL tasks, we provide a dataset benchmarking capability. It evaluates different large language models (LLMs) and agents on Text2SQL, covering syntax correctness, semantic accuracy, and execution validity. It outputs metrics such as executability rate and accuracy rate, and provides an evaluation report.

  1. Open-source Text2SQL dataset repository by Ant Group: Falcon
  2. DB-GPT supports LLM evaluation based on the Falcon benchmark dataset

Introduction

To objectively and fairly evaluate models on Text2SQL tasks, we provide a benchmarking module and dataset. This module supports comprehensive evaluation of all models in the DB-GPT framework and provides an evaluation report.

The benchmark dataset used by the module, Falcon, is a high-quality and evolving open-source Text2SQL dataset from Ant Group. The dataset aims to stress-test models in complex, cross-domain analysis scenarios, with a focus on:

  • SQL computation challenges — multi-table joins, nested CTEs, window functions, ranking, type casting, regex filters...
  • Language challenges — Chinese fuzzy time expressions, colloquial business terms, ellipsis, multi-intent questions...

The benchmark includes 28 datasets and 90 tables. As of now, 500 Chinese questions of varying difficulty have been officially released.

Among them: easy: 151, medium: 130, hard: 219.

Core Features Of Benchmark Dataset

  • ✅ Multi-dimensional evaluation: three-layer checks on syntax correctness, semantic accuracy, and execution validity
  • 🧠 Dynamic difficulty levels: 500 Chinese questions from Kaggle datasets (various difficulties), covering multi-step reasoning, complex nested queries, and advanced SQL features
  • ✍️ Detailed schema annotations: rich schema information including data types, natural language aliases, table relations, and sample data, helping models understand database structures
  • 🌐 Real-world scenario modeling: more ambiguous language expressions and more questions collected from Ant Group’s real production scenarios (in preparation)

System Design

Core capabilities of the benchmarking module:

  • Text2SQL evaluation API: provide APIs to create evaluation tasks
  • Benchmark execution framework: run Text2SQL tasks based on the benchmark questions
  • Result comparison framework: compare results between the standard answers and LLM-generated SQL, and aggregate the evaluation results
  • Dataset installation and database mapping: install the benchmark dataset and map data into the database to provide LLM SQL query service

Evaluation Metrics

MetricFormulaDescription
Executability RateNumber of syntactically correct samples / Total samplesThe proportion of SQL statements generated by the model that are syntactically correct and can execute correctly in the database
Accuracy RateNumber of semantically correct samples / Total samplesThe proportion of SQL statements generated by the model that are syntactically correct, execute correctly in the database, and are semantically correct

Dataset Structure

Standard Benchmark Structure

FieldDescriptionexample
编号Question serial number1, 2...
数据集IDDataset IDD2025050900161503000025249569, ...
用户问题Question title各性别的平均年龄是多少,并按年龄顺序显示结果?
自定义标签Question source, SQL typeKAGGLE_DS_1, CTE1
知识Knowledge context required暂无
标准答案SQLCorrect SQL for the question(based on Alibaba Cloud MaxCompute syntax)SELECT gender, AVG(age) AS avg_age FROM users GROUP BY gender ORDER BY avg_age
标准结果Correct SQL query result on the Alibaba Cloud MaxCompute engine (some questions have multiple answers){"性别":["Female","Male"],"平均年龄":["27.73","27.84"]}
是否排序Whether the question involves sorting{"性别":["Female","Male"],"平均年龄":[27.73,27.84]}
promptModel conversation prompt已知以下数据集,包含了字段名及其采样信息:...

How To Use

Environment Setup

  • Step1: Upgrade to V0.7.4 and upgrade the metadata database

    For SQLite, the table schema is upgraded automatically by default. For MySQL, you need to run the DDL manually. The file assets/schema/dbgpt.sql contains the complete DDL for the current version. Version-specific DDL changes can be found under assets/schema/upgrade. For example, if you are upgrading from v0.7.1 to v0.7.4, you can run the following DDL:

    mysql -h127.0.0.1 -uroot -p{your_password} < assets/schema/upgrade/v0_7_4/upgrade_to_v0.7.4.sql
  • Step2: Start the DB-GPT service, and wait for the benchmark dataset to load automatically. When you see the log line, the dataset has finished loading (about 1~5 minute).

  • Step3: Register LLM on the DB-GPT platform

Create Evaluation Task

  • Step1: Click "Create Benchmark" to create an evaluation task
  • Step2: Enter the task name and select model list
  • Step3: Submit the task

  • Step4: Wait for the task to complete (evaluation may take a long time)

View Evaluation Results

  • When the status is "Completed", click "View Details" to see the evaluation report
  • The report shows:
    • Total number of models, number of questions, numbers of correct, incorrect, and failed questions
    • For each round and model: numbers of executed, correct, incorrect, and failed questions; executability rate; accuracy rate
    • Bar charts for executability rate and accuracy rate

Correct: the model answered the question correctly. Incorrect: the SQL generated by the model is syntactically correct but semantically wrong. Failed: usually the SQL is syntactically or semantically wrong.

Download Evaluation Results

  • Click "Download Evaluation Results" to download the detailed Excel report
  • The Excel report includes LLM execution details and comparison results (shown in different sheets)

Dataset Details

  • Click "View Dataset Details" to view benchmark details
    • Shows tables, fields, and sample data of the Falcon dataset

Excel Evaluation Result Data Structure

Excel Evaluation Result Example

Execution Result Example

Execution Result Data Structure

  • Sheet name: dataset_evaluation_result
FieldDescriptionexample
编号Question serial number1, 2...
大模型名称Name of the evaluated modelDeepSeek-V3.1
轮次Evaluation round1
数据集IDDataset ID for the questionD2025050900161503000025249569
用户问题Evaluation question各性别的平均年龄是多少,并按年龄顺序显示结果?
自定义标签Question source, SQL typeKAGGLE_DS_1, CTE1
知识Knowledge context required暂无
promptModel conversation prompt已知以下数据集,包含了字段名及其采样信息:...
Cot长度CoT tokens consumed100
LLM输出结果SQL generated by the LLMselect gender as gender, avg(cast(age as real)) as average_age from di_finance_data group by gender order by avg(cast(age as real))
结果执行Query result of the LLM-generated SQL{"性别":["Female","Male"],"平均年龄":[27.73,27.84]}
执行结果的报错信息Error message if the SQL fails
traceIdLog ID暂无
耗时(秒)Time consumed10

Comparison Result Example

Comparison Result Example

Comparison Result Data Structure

  • Sheet name: benchmark_compare_result
FieldDescriptionexample
serialNoQuestion serial number1, 2...
analysisModelIdDataset ID for the questionD2025050900161503000025249569
questionEvaluation question各性别的平均年龄是多少,并按年龄顺序显示结果?
selfDefineTagsQuestion source, SQL typeKAGGLE_DS_1, CTE1
promptModel conversation prompt已知以下数据集,包含了字段名及其采样信息:...
standardAnswerSqlCorrect SQL for the question(based on Alibaba Cloud MaxCompute syntax)select gender as gender, avg(cast(age as real)) as average_age from di_finance_data group by gender order by avg(cast(age as real))
standardAnswerCorrect SQL query result on the Alibaba Cloud MaxCompute engine (some questions have multiple answers){"性别": ["Female", "Male"], "平均年龄": ["27.73", "27.84"]}
llmCodeEvaluated model nameDeepSeek-V3.1
llmOutputSQL generated by the LLMselect gender as gender, avg(cast(age as real)) as average_age from di_finance_data group by gender order by avg(cast(age as real))
executeResultQuery result of the LLM-generated SQL{"性别":["Female","Male"],"平均年龄":[27.73,27.84]}
errorMsgComparison error message
compareResultComparison result between the reference answer and the LLM outputRIGHT: correct; WRONG: incorrect; FAILED: failed (usually the SQL has issues)

Currently Supported Evaluation Capabilities

Metrics

MetricSupported
Executability Rate
Accuracy Rate

Datasets

DatasetSupported
Falcon

Input/Output Formats

FormatSupported
Excel
CSV
JSON
Yuque

Databases

Database TypeSupported
SQLite
MySQL
ODPS

Features

  • Support single-round multi-model evaluation
  • Support Excel File
  • Support SQLite Database
  • Support multi-round evaluation
  • Support evaluating agents
  • Support different data sources
  • Support CSV, JSON, Yuque, and other file systems