LLM Evaluation Report
gpt-5.4
2026-03-18
371.302
151
0.30016
3.85976
3.88415
gpt-5
2026-03-18
3307.15
160
0.312013
3.71951
3.82927
gpt-5-mini
2026-03-18
2223.12
161
0.305418
3.79268
3.93293
claude-opus-4-6
2026-03-18
630.643
164
0.38829
3.87195
3.90854
claude-sonnet-4-6
2026-03-18
604.89
161
0.379059
3.85366
3.90854
claude-opus-4-1
2026-03-18
635.166
157
0.349491
3.85366
3.92683
claude-sonnet-4-5
2026-03-18
546.74
162
0.331766
3.89024
3.95732
claude-haiku-4-5
2026-03-18
280.497
154
0.317284
3.84756
3.92073
gemini-3.1-pro-preview
2026-03-18
3339.78
162
0.395161
3.73171
3.82317
gemini-3.1-flash-lite-preview
2026-03-18
176.493
148
0.370935
3.77439
3.87805
gemini-3-flash-preview
2026-03-18
2146.97
142
0.395257
3.59146
3.60366
gemini-2.5-pro
2026-03-18
2788.94
118
0.373488
3.2561
3.38415
gemini-2.5-flash
2026-03-18
952.543
148
0.338621
3.7439
3.83537
Total Response Time (s): The total time taken by the model to generate all the outputs.
Tests passed: The number of unit tests that the model has passed during evaluation, out of a total of 164 tests.
Mean CodeBLEU: Average CodeBLEU score, a metric for evaluating code generation quality based on both syntactic and semantic correctness.
Mean Usefulness Score: Average rating of the model's output usefulness as rated by a LLM model.
0: Snippet is not at all helpful, it is irrelevant to the problem.
1: Snippet is slightly helpful, it contains information relevant to the problem, but it is easier to write the solution from scratch.
2: Snippet is somewhat helpful, it requires significant changes (compared to the size of the snippet), but is still useful.
3: Snippet is helpful, but needs to be slightly changed to solve the problem.
4: Snippet is very helpful, it solves the problem.
Mean Functional Correctness Score: Average score of the functional correctness of the model's outputs, assessing how well the outputs meet the functional requirements, rated by a LLM model.
0 (failing all possible tests): The code snippet is totally incorrect and meaningless.
4 (passing all possible tests): The code snippet is totally correct and can handle all cases.
Last updated