SkyDeck.ai Docs
Sign UpAdmin Sign InContact Us
English
English
  • SkyDeck.ai
  • GenStudio Workspace
    • Conversations
    • SkyDeck AI Helper App
    • Document Upload
    • Sharing and Collaboration
    • Slack Synchronization
    • Public Snapshots
    • Web Browsing
    • Tools
      • Pair Programmer
        • How to Use
        • Example – Python Script Assistance
      • SQL Assistant
        • How to Use
        • Example – Query Debugging
      • Legal Agreement Review
        • How to Use
        • Example – NDA Clause
      • Teach Me Anything
        • How to Use
        • Example – Intro to Programming
      • Strategy Consultant
        • How to Use
        • Example – Employee Retention
      • Image Generator
        • How to Use
        • Example – Winter Wonderland
    • Data Security
      • Data Loss Prevention
  • Control Center
    • Admin & Owner Tools
    • Setup Guide
      • Set Up Account
      • Set Up Integrations
        • Integration Assistance
      • Set Up Security
        • Authentication (SSO)
      • Organize Teams
        • Add New Group
        • Remove Groups
      • Curate Tools
        • System Tools
        • Assign Tags
      • Manage Members
        • Add Members
        • Import File
        • Invite Members
        • Edit Members
    • Billing
      • Free Trial
      • Buy Credit
      • Plans and Upgrades
      • Model Usage Prices
  • Integrations
    • LLMs and Databases
      • Anthropic Integration
      • Database Integration
      • Groq Integration
      • HuggingFace Integration
      • Mistral Integration
      • OpenAI Integration
      • Perplexity Integration
      • Together AI Integration
      • Vertex AI Integration
    • App Integrations
      • Rememberizer Integration
      • Slack Integration
  • Developers
    • Develop Your Own Tools
      • JSON format for Tools
      • JSON Format for LLM Tools
      • Example: Text-based UI Generator
      • JSON Format for Smart Tools
  • Use Cases
    • Creating a Privacy Policy
  • Notices
    • Terms of Use
    • Privacy Policy
    • Cookie Notice
  • Releases
    • May 23rd, 2025
    • May 16th, 2025
    • May 9th, 2025
    • May 2nd, 2025
    • Apr 25th, 2025
    • Apr 18th, 2025
    • Apr 11th, 2025
    • Apr 4th, 2025
    • Mar 28th, 2025
    • Mar 21st, 2025
    • Mar 14th, 2025
    • Mar 7th, 2025
    • Feb 28th, 2025
    • Feb 21st, 2025
    • Feb 14th, 2025
    • Feb 7th, 2025
    • Jan 31st, 2025
    • Jan 24th, 2025
    • Jan 17th, 2025
    • Jan 10th, 2025
    • Jan 3rd, 2025
    • Dec 27th, 2024
    • Dec 20th, 2024
    • Dec 13th, 2024
    • Dec 6th, 2024
    • Nov 29th, 2024
    • Nov 22nd, 2024
    • Nov 15th, 2024
    • Nov 8th, 2024
    • Nov 1st, 2024
    • Oct 25th, 2024
    • Oct 18th, 2024
    • Oct 11th, 2024
    • Oct 4th, 2024
    • Sep 27th, 2024
    • Sep 20th, 2024
    • Sep 13th, 2024
    • Sep 6th, 2024
    • Aug 23rd, 2024
    • Aug 16th, 2024
    • Aug 9th, 2024
    • Aug 2nd, 2024
    • Jul 26th, 2024
    • Jul 12th, 2024
    • Jul 5th, 2024
    • Jun 28th, 2024
    • Jun 21st, 2024
    • Nov 12th 2023
    • Nov 6th 2023
    • Oct 30th 2023
    • Oct 23th 2023
    • Oct 16th 2023
    • Sep 18th 2023
    • Sep 8th 2023
  • Security
    • SkyDeck.ai Security Practices
    • Bug Bounty Program
  • AI Documentation
    • LLM Evaluation Report
    • SkyDeck.ai LLM Ready Documentation
Powered by GitBook
On this page
  1. AI Documentation

LLM Evaluation Report

Last updated 2 days ago

Model
Date
Total Response Time (s)
Tests Passed
Mean CodeBLEU (0-1)
Mean Usefulness Score (0-4)
Mean Functional Correctness Score (0-4)

claude-opus-4-20250514

2025-05-27

682.341

45

0.373498

3.68902

3.71951

claude-sonnet-4-20250514

2025-05-27

685.546

112

0.317174

3.7378

3.65854

claude-3-7-sonnet-20250219

2025-05-27

746.497

108

0.319258

3.65244

3.65244

claude-3-5-sonnet-20241022

2025-05-27

445.549

114

0.332094

3.65244

3.72561

gpt-4.1

2025-05-27

340.45

114

0.345565

3.71951

3.79878

o4-mini

2025-05-27

1380.26

128

0.322408

3.70122

3.7439

o3

2025-05-27

1592.45

141

0.314449

3.71341

3.85366

gpt-4o

2025-05-27

254.478

123

0.305002

3.70732

3.7378

gemini_gemini-2.0-flash

2025-05-27

428.324

102

0.304022

3.65244

3.60976

gemini_gemini-2.5-pro-preview-05-06

2025-05-27

1317.42

71

0.319577

2.45732

2.67683

gemini_gemini-2.5-flash-preview-05-20

2025-05-27

1042.03

108

0.32728

3.39024

3.46341

Total Response Time (s): The total time taken by the model to generate all the outputs.

Tests passed: The number of unit tests that the model has passed during evaluation, out of a total of 164 tests.

Mean : Average CodeBLEU score, a metric for evaluating code generation quality based on both syntactic and semantic correctness.

Mean : Average rating of the model's output usefulness as rated by a LLM model.

  • 0: Snippet is not at all helpful, it is irrelevant to the problem.

  • 1: Snippet is slightly helpful, it contains information relevant to the problem, but it is easier to write the solution from scratch.

  • 2: Snippet is somewhat helpful, it requires significant changes (compared to the size of the snippet), but is still useful.

  • 3: Snippet is helpful, but needs to be slightly changed to solve the problem.

  • 4: Snippet is very helpful, it solves the problem.

  • 0 (failing all possible tests): The code snippet is totally incorrect and meaningless.

  • 4 (passing all possible tests): The code snippet is totally correct and can handle all cases.

Mean : Average score of the functional correctness of the model's outputs, assessing how well the outputs meet the functional requirements, rated by a LLM model.

CodeBLEU
Usefulness Score
Functional Correctness Score