# LLM Evaluation Report

| Model                         | Date       | Total Response Time (s) | Tests Passed | Mean CodeBLEU (0-1) | Mean Usefulness Score (0-4) | Mean Functional Correctness Score (0-4) |
| ----------------------------- | ---------- | ----------------------: | -----------: | ------------------: | --------------------------: | --------------------------------------: |
| gpt-5.4                       | 2026-03-18 |                 371.302 |          151 |             0.30016 |                     3.85976 |                                 3.88415 |
| gpt-5                         | 2026-03-18 |                 3307.15 |          160 |            0.312013 |                     3.71951 |                                 3.82927 |
| gpt-5-mini                    | 2026-03-18 |                 2223.12 |          161 |            0.305418 |                     3.79268 |                                 3.93293 |
| claude-opus-4-6               | 2026-03-18 |                 630.643 |          164 |             0.38829 |                     3.87195 |                                 3.90854 |
| claude-sonnet-4-6             | 2026-03-18 |                  604.89 |          161 |            0.379059 |                     3.85366 |                                 3.90854 |
| claude-opus-4-1               | 2026-03-18 |                 635.166 |          157 |            0.349491 |                     3.85366 |                                 3.92683 |
| claude-sonnet-4-5             | 2026-03-18 |                  546.74 |          162 |            0.331766 |                     3.89024 |                                 3.95732 |
| claude-haiku-4-5              | 2026-03-18 |                 280.497 |          154 |            0.317284 |                     3.84756 |                                 3.92073 |
| gemini-3.1-pro-preview        | 2026-03-18 |                 3339.78 |          162 |            0.395161 |                     3.73171 |                                 3.82317 |
| gemini-3.1-flash-lite-preview | 2026-03-18 |                 176.493 |          148 |            0.370935 |                     3.77439 |                                 3.87805 |
| gemini-3-flash-preview        | 2026-03-18 |                 2146.97 |          142 |            0.395257 |                     3.59146 |                                 3.60366 |
| gemini-2.5-pro                | 2026-03-18 |                 2788.94 |          118 |            0.373488 |                      3.2561 |                                 3.38415 |
| gemini-2.5-flash              | 2026-03-18 |                 952.543 |          148 |            0.338621 |                      3.7439 |                                 3.83537 |

**Total Response Time (s):** The total time taken by the model to generate all the outputs.

**Tests passed:** The number of unit tests that the model has passed during evaluation, out of a total of 164 tests.

**Mean** [**CodeBLEU**](https://arxiv.org/abs/2009.10297)**:** Average CodeBLEU score, a metric for evaluating code generation quality based on both syntactic and semantic correctness.

**Mean** [**Usefulness Score**](https://arxiv.org/abs/2304.14317)**:** Average rating of the model's output usefulness as rated by a LLM model.

* **0:** Snippet is not at all helpful, it is irrelevant to the problem.
* **1:** Snippet is slightly helpful, it contains information relevant to the problem, but it is easier to write the solution from scratch.
* **2:** Snippet is somewhat helpful, it requires significant changes (compared to the size of the snippet), but is still useful.
* **3:** Snippet is helpful, but needs to be slightly changed to solve the problem.
* **4:** Snippet is very helpful, it solves the problem.

**Mean** [**Functional Correctness Score**](https://arxiv.org/abs/2304.14317)**:** Average score of the functional correctness of the model's outputs, assessing how well the outputs meet the functional requirements, rated by a LLM model.

* **0 (failing all possible tests):** The code snippet is totally incorrect and meaningless.
* **4 (passing all possible tests):** The code snippet is totally correct and can handle all cases.
