LLM Evaluation Report

Model

Date

Total Response Time (s)

Tests Passed

Mean CodeBLEU (0-1)

Mean Usefulness Score (0-4)

Mean Functional Correctness Score (0-4)

gpt-5

2025-10-01

2864.33

161

0.307856

3.84756

gpt-5-mini

2025-10-01

2529.73

160

0.309437

3.88415

3.92073

gpt-5-nano

2025-10-01

1681.91

152

0.305554

3.82927

3.85366

gpt-4.1

2025-10-01

252.895

156

0.337819

3.89634

3.92073

claude-opus-4-1-20250805

2025-10-01

761.552

161

0.35051

3.87195

3.92683

claude-opus-4-20250514

2025-10-01

705.543

159

0.347384

3.86585

3.93293

claude-sonnet-4-5-20250929

2025-10-01

632.707

162

0.335302

3.95122

3.96341

claude-sonnet-4-20250514

2025-10-01

578.039

161

0.321841

3.90854

3.95732

gemini-2.5-pro

2025-10-01

3375.77

141

0.365963

3.82927

3.90244

gemini-2.5-flash

2025-10-01

1324.6

151

0.331303

3.84756

3.92683

Total Response Time (s): The total time taken by the model to generate all the outputs.

Tests passed: The number of unit tests that the model has passed during evaluation, out of a total of 164 tests.

Mean CodeBLEU: Average CodeBLEU score, a metric for evaluating code generation quality based on both syntactic and semantic correctness.

Mean Usefulness Score: Average rating of the model's output usefulness as rated by a LLM model.

0: Snippet is not at all helpful, it is irrelevant to the problem.
1: Snippet is slightly helpful, it contains information relevant to the problem, but it is easier to write the solution from scratch.
2: Snippet is somewhat helpful, it requires significant changes (compared to the size of the snippet), but is still useful.
3: Snippet is helpful, but needs to be slightly changed to solve the problem.
4: Snippet is very helpful, it solves the problem.

Mean Functional Correctness Score: Average score of the functional correctness of the model's outputs, assessing how well the outputs meet the functional requirements, rated by a LLM model.

0 (failing all possible tests): The code snippet is totally incorrect and meaningless.
4 (passing all possible tests): The code snippet is totally correct and can handle all cases.

Last updated 2 months ago