# LLM Evaluation Report

| Model                         | Date       | Total Response Time (s) | Tests Passed | Mean CodeBLEU (0-1) | Mean Usefulness Score (0-4) | Mean Functional Correctness Score (0-4) |
| ----------------------------- | ---------- | ----------------------: | -----------: | ------------------: | --------------------------: | --------------------------------------: |
| gpt-5.4                       | 2026-03-18 |                 371.302 |          151 |             0.30016 |                     3.85976 |                                 3.88415 |
| gpt-5                         | 2026-03-18 |                 3307.15 |          160 |            0.312013 |                     3.71951 |                                 3.82927 |
| gpt-5-mini                    | 2026-03-18 |                 2223.12 |          161 |            0.305418 |                     3.79268 |                                 3.93293 |
| claude-opus-4-6               | 2026-03-18 |                 630.643 |          164 |             0.38829 |                     3.87195 |                                 3.90854 |
| claude-sonnet-4-6             | 2026-03-18 |                  604.89 |          161 |            0.379059 |                     3.85366 |                                 3.90854 |
| claude-opus-4-1               | 2026-03-18 |                 635.166 |          157 |            0.349491 |                     3.85366 |                                 3.92683 |
| claude-sonnet-4-5             | 2026-03-18 |                  546.74 |          162 |            0.331766 |                     3.89024 |                                 3.95732 |
| claude-haiku-4-5              | 2026-03-18 |                 280.497 |          154 |            0.317284 |                     3.84756 |                                 3.92073 |
| gemini-3.1-pro-preview        | 2026-03-18 |                 3339.78 |          162 |            0.395161 |                     3.73171 |                                 3.82317 |
| gemini-3.1-flash-lite-preview | 2026-03-18 |                 176.493 |          148 |            0.370935 |                     3.77439 |                                 3.87805 |
| gemini-3-flash-preview        | 2026-03-18 |                 2146.97 |          142 |            0.395257 |                     3.59146 |                                 3.60366 |
| gemini-2.5-pro                | 2026-03-18 |                 2788.94 |          118 |            0.373488 |                      3.2561 |                                 3.38415 |
| gemini-2.5-flash              | 2026-03-18 |                 952.543 |          148 |            0.338621 |                      3.7439 |                                 3.83537 |

**Total Response Time (s):** The total time taken by the model to generate all the outputs.

**Tests passed:** The number of unit tests that the model has passed during evaluation, out of a total of 164 tests.

**Mean** [**CodeBLEU**](https://arxiv.org/abs/2009.10297)**:** Average CodeBLEU score, a metric for evaluating code generation quality based on both syntactic and semantic correctness.

**Mean** [**Usefulness Score**](https://arxiv.org/abs/2304.14317)**:** Average rating of the model's output usefulness as rated by a LLM model.

* **0:** Snippet is not at all helpful, it is irrelevant to the problem.
* **1:** Snippet is slightly helpful, it contains information relevant to the problem, but it is easier to write the solution from scratch.
* **2:** Snippet is somewhat helpful, it requires significant changes (compared to the size of the snippet), but is still useful.
* **3:** Snippet is helpful, but needs to be slightly changed to solve the problem.
* **4:** Snippet is very helpful, it solves the problem.

**Mean** [**Functional Correctness Score**](https://arxiv.org/abs/2304.14317)**:** Average score of the functional correctness of the model's outputs, assessing how well the outputs meet the functional requirements, rated by a LLM model.

* **0 (failing all possible tests):** The code snippet is totally incorrect and meaningless.
* **4 (passing all possible tests):** The code snippet is totally correct and can handle all cases.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.skydeck.ai/ai-documentations/llm-evaluation-report.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
