> For the complete documentation index, see [llms.txt](https://docs.skydeck.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.skydeck.ai/zh-cn/ai-documentations/llm-evaluation-report.md).

# LLM 评估报告

| 模型                            | 日期         | 总响应时间 (秒) | 通过测试数量 | 平均 CodeBLEU (0-1) | 平均有用性评分 (0-4) | 平均功能正确性评分 (0-4) |
| ----------------------------- | ---------- | --------: | -----: | ----------------: | ------------: | --------------: |
| gpt-5.4                       | 2026-03-18 |   371.302 |    151 |           0.30016 |       3.85976 |         3.88415 |
| gpt-5                         | 2026-03-18 |   3307.15 |    160 |          0.312013 |       3.71951 |         3.82927 |
| gpt-5-mini                    | 2026-03-18 |   2223.12 |    161 |          0.305418 |       3.79268 |         3.93293 |
| claude-opus-4-6               | 2026-03-18 |   630.643 |    164 |           0.38829 |       3.87195 |         3.90854 |
| claude-sonnet-4-6             | 2026-03-18 |    604.89 |    161 |          0.379059 |       3.85366 |         3.90854 |
| claude-opus-4-1               | 2026-03-18 |   635.166 |    157 |          0.349491 |       3.85366 |         3.92683 |
| claude-sonnet-4-5             | 2026-03-18 |    546.74 |    162 |          0.331766 |       3.89024 |         3.95732 |
| claude-haiku-4-5              | 2026-03-18 |   280.497 |    154 |          0.317284 |       3.84756 |         3.92073 |
| gemini-3.1-pro-preview        | 2026-03-18 |   3339.78 |    162 |          0.395161 |       3.73171 |         3.82317 |
| gemini-3.1-flash-lite-preview | 2026-03-18 |   176.493 |    148 |          0.370935 |       3.77439 |         3.87805 |
| gemini-3-flash-preview        | 2026-03-18 |   2146.97 |    142 |          0.395257 |       3.59146 |         3.60366 |
| gemini-2.5-pro                | 2026-03-18 |   2788.94 |    118 |          0.373488 |        3.2561 |         3.38415 |
| gemini-2.5-flash              | 2026-03-18 |   952.543 |    148 |          0.338621 |        3.7439 |         3.83537 |

**总响应时间 (秒)：** 模型生成所有输出所花费的总时间。

**通过测试数量：** 模型在评估过程中通过的单元测试数量，总共 164 个测试。

**平均** [**CodeBLEU**](https://arxiv.org/abs/2009.10297)**：** 平均 CodeBLEU 分数，是评估代码生成质量的指标，基于语法和语义的正确性。

**平均** [**有用性评分**](https://arxiv.org/abs/2304.14317)**：** 模型输出的有用性平均评分，由 LLM 模型评定。

* **0：** 片段完全没有帮助，与问题无关。
* **1：** 片段稍微有帮助，包含与问题相关的信息，但从头写解决方案更容易。
* **2：** 片段有些帮助，需要进行重大更改（与片段的大小相比），但仍然有用。
* **3：** 片段有帮助，但需要稍微修改才能解决问题。
* **4：** 片段非常有帮助，解决了问题。

**平均** [**功能正确性评分**](https://arxiv.org/abs/2304.14317)**：** 模型输出的功能正确性平均评分，评估输出满足功能要求的程度，由 LLM 模型评定。

* **0（未通过所有可能的测试）：** 代码片段完全错误且毫无意义。
* **4（通过所有可能的测试）：** 代码片段完全正确，能够处理所有情况。


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.skydeck.ai/zh-cn/ai-documentations/llm-evaluation-report.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
