Ant Group Data Retrieval Benchmark Dataset Guide
For Text2SQL tasks, we provide a dataset benchmarking capability. It evaluates different large language models (LLMs) and agents on Text2SQL, covering syntax correctness, semantic accuracy, and execution validity. It outputs metrics such as executability rate and accuracy rate, and provides an evaluation report.
- Open-source Text2SQL dataset repository by Ant Group: Falcon
- DB-GPT supports LLM evaluation based on the Falcon benchmark dataset
Introduction
To objectively and fairly evaluate models on Text2SQL tasks, we provide a benchmarking module and dataset. This module supports comprehensive evaluation of all models in the DB-GPT framework and provides an evaluation report.
The benchmark dataset used by the module, Falcon, is a high-quality and evolving open-source Text2SQL dataset from Ant Group. The dataset aims to stress-test models in complex, cross-domain analysis scenarios, with a focus on:
- SQL computation challenges — multi-table joins, nested CTEs, window functions, ranking, type casting, regex filters...
- Language challenges — Chinese fuzzy time expressions, colloquial business terms, ellipsis, multi-intent questions...
The benchmark includes 28 datasets and 90 tables. As of now, 500 Chinese questions of varying difficulty have been officially released.
Among them: easy: 151, medium: 130, hard: 219.
Core Features Of Benchmark Dataset
- ✅ Multi-dimensional evaluation: three-layer checks on syntax correctness, semantic accuracy, and execution validity
- 🧠 Dynamic difficulty levels: 500 Chinese questions from Kaggle datasets (various difficulties), covering multi-step reasoning, complex nested queries, and advanced SQL features
- ✍️ Detailed schema annotations: rich schema information including data types, natural language aliases, table relations, and sample data, helping models understand database structures
- 🌐 Real-world scenario modeling: more ambiguous language expressions and more questions collected from Ant Group’s real production scenarios (in preparation)
System Design
Core capabilities of the benchmarking module:
- Text2SQL evaluation API: provide APIs to create evaluation tasks
- Benchmark execution framework: run Text2SQL tasks based on the benchmark questions
- Result comparison framework: compare results between the standard answers and LLM-generated SQL, and aggregate the evaluation results
- Dataset installation and database mapping: install the benchmark dataset and map data into the database to provide LLM SQL query service

Evaluation Metrics
| Metric | Formula | Description |
|---|---|---|
| Executability Rate | Number of syntactically correct samples / Total samples | The proportion of SQL statements generated by the model that are syntactically correct and can execute correctly in the database |
| Accuracy Rate | Number of semantically correct samples / Total samples | The proportion of SQL statements generated by the model that are syntactically correct, execute correctly in the database, and are semantically correct |