版本：dev

Evaluation

Get started with the Evaluation API

Create Evaluation

POST /api/v2/serve/evaluate/evaluation

Curl
Python

DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X POST "http://localhost:5670/api/v2/serve/evaluate/evaluation" 
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
  "scene_key": "recall",
  "scene_value":147,
  "context":{"top_k":5},
  "sys_code":"xx",
  "evaluate_metrics":["RetrieverHitRateMetric","RetrieverMRRMetric","RetrieverSimilarityMetric"],
  "datasets": [{
            "query": "what awel talked about",
            "doc_name":"awel.md"
        }]
}'

from dbgpt_client import Client
from dbgpt_client.evaluation import run_evaluation
from dbgpt.serve.evaluate.api.schemas import EvaluateServeRequest

DBGPT_API_KEY = "dbgpt"
client = Client(api_key=DBGPT_API_KEY)
request = EvaluateServeRequest(
    # The scene type of the evaluation, e.g. support app, recall
    scene_key="recall",
    # e.g. app id(when scene_key is app), space id(when scene_key is recall)
    scene_value="147",
    context={"top_k": 5},
    evaluate_metrics=[
        "RetrieverHitRateMetric",
        "RetrieverMRRMetric",
        "RetrieverSimilarityMetric",
    ],
    datasets=[
        {
            "query": "what awel talked about",
            "doc_name": "awel.md",
        }
    ],
)
data = await run_evaluation(client, request=request)

Request body

Request Evaluation Object

when scene_key is app, the request body should be like this:

{
  "scene_key": "app",
  "scene_value":"2c76eea2-83b6-11ef-b482-acde48001122",
  "context":{"top_k":5, "prompt":"942acd7e33b54ce28565f89f9b278044","model":"zhipu_proxyllm"},
  "sys_code":"xx",
  "evaluate_metrics":["AnswerRelevancyMetric"],
  "datasets": [{
            "query": "what awel talked about",
            "doc_name":"awel.md"
        }]
}

when scene_key is recall, the request body should be like this:

{
  "scene_key": "recall",
  "scene_value":"2c76eea2-83b6-11ef-b482-acde48001122",
  "context":{"top_k":5, "prompt":"942acd7e33b54ce28565f89f9b278044","model":"zhipu_proxyllm"},
  "evaluate_metrics":["RetrieverHitRateMetric", "RetrieverMRRMetric", "RetrieverSimilarityMetric"],
  "datasets": [{
            "query": "what awel talked about",
            "doc_name":"awel.md"
        }]
}

Response body

Return Evaluation Object List

The Evaluation Request Object

scene_key string Required

The scene type of the evaluation, e.g. support app, recall

scene_value string Required

The scene value of the evaluation, e.g. app id(when scene_key is app), space id(when scene_key is recall)

context object Required

The context of the evaluation

top_k int Required
prompt string prompt code
model string llm model name

evaluate_metrics array Required

The evaluate metrics of the evaluation, e.g.

AnswerRelevancyMetric: the answer relevancy metric(when scene_key is app)
RetrieverHitRateMetric: Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses. (when scene_key is recall)
RetrieverMRRMetric: For each query, MRR evaluates the system’s accuracy by looking at the rank of the highest-placed relevant document. Specifically, it’s the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so on. (when scene_key is recall)
RetrieverSimilarityMetric: Embedding Similarity Metric (when scene_key is recall)

datasets array Required

The datasets of the evaluation

The Evaluation Result

prediction string

The prediction result

contexts string

The contexts of RAG Retrieve chunk

score float

The score of the prediction

passing bool

The passing of the prediction

metric_name string

The metric name of the evaluation

prediction_cost int

The prediction cost of the evaluation

query string

The query of the evaluation

raw_dataset object

The raw dataset of the evaluation

feedback string

The feedback of the llm evaluation

Evaluation

Create Evaluation​

Request body​

Response body​

The Evaluation Request Object​

The Evaluation Result​

Create Evaluation

Request body

Response body

The Evaluation Request Object

The Evaluation Result