SkyDeck.ai Docs
Sign UpAdmin Sign InContact Us
English
English
  • SkyDeck.ai
  • GenStudio Workspace
    • Conversations
    • SkyDeck AI Helper App
    • Document Upload
    • Sharing and Collaboration
    • Slack Synchronization
    • Public Snapshots
    • Web Browsing
    • Tools
      • Pair Programmer
        • How to Use
        • Example – Python Script Assistance
      • SQL Assistant
        • How to Use
        • Example – Query Debugging
      • Legal Agreement Review
        • How to Use
        • Example – NDA Clause
      • Teach Me Anything
        • How to Use
        • Example – Intro to Programming
      • Strategy Consultant
        • How to Use
        • Example – Employee Retention
      • Image Generator
        • How to Use
        • Example – Winter Wonderland
    • Data Security
      • Data Loss Prevention
  • Control Center
    • Admin & Owner Tools
    • Setup Guide
      • Set Up Account
      • Set Up Integrations
        • Integration Assistance
      • Set Up Security
        • Authentication (SSO)
      • Organize Teams
        • Add New Group
        • Remove Groups
      • Curate Tools
        • System Tools
        • Assign Tags
      • Manage Members
        • Add Members
        • Import File
        • Invite Members
        • Edit Members
    • Billing
      • Free Trial
      • Buy Credit
      • Plans and Upgrades
      • Model Usage Prices
  • Integrations
    • LLMs and Databases
      • Anthropic Integration
      • Database Integration
      • Groq Integration
      • HuggingFace Integration
      • Mistral Integration
      • OpenAI Integration
      • Perplexity Integration
      • Together AI Integration
      • Vertex AI Integration
    • App Integrations
      • Rememberizer Integration
      • Slack Integration
  • Developers
    • Develop Your Own Tools
      • JSON format for Tools
      • JSON Format for LLM Tools
      • Example: Text-based UI Generator
      • JSON Format for Smart Tools
  • Use Cases
    • Creating a Privacy Policy
  • Notices
    • Terms of Use
    • Privacy Policy
    • Cookie Notice
  • Releases
    • May 9th, 2025
    • May 2nd, 2025
    • Apr 25th, 2025
    • Apr 18th, 2025
    • Apr 11th, 2025
    • Apr 4th, 2025
    • Mar 28th, 2025
    • Mar 21st, 2025
    • Mar 14th, 2025
    • Mar 7th, 2025
    • Feb 28th, 2025
    • Feb 21st, 2025
    • Feb 14th, 2025
    • Feb 7th, 2025
    • Jan 31st, 2025
    • Jan 24th, 2025
    • Jan 17th, 2025
    • Jan 10th, 2025
    • Jan 3rd, 2025
    • Dec 27th, 2024
    • Dec 20th, 2024
    • Dec 13th, 2024
    • Dec 6th, 2024
    • Nov 29th, 2024
    • Nov 22nd, 2024
    • Nov 15th, 2024
    • Nov 8th, 2024
    • Nov 1st, 2024
    • Oct 25th, 2024
    • Oct 18th, 2024
    • Oct 11th, 2024
    • Oct 4th, 2024
    • Sep 27th, 2024
    • Sep 20th, 2024
    • Sep 13th, 2024
    • Sep 6th, 2024
    • Aug 23rd, 2024
    • Aug 16th, 2024
    • Aug 9th, 2024
    • Aug 2nd, 2024
    • Jul 26th, 2024
    • Jul 12th, 2024
    • Jul 5th, 2024
    • Jun 28th, 2024
    • Jun 21st, 2024
    • Nov 12th 2023
    • Nov 6th 2023
    • Oct 30th 2023
    • Oct 23th 2023
    • Oct 16th 2023
    • Sep 18th 2023
    • Sep 8th 2023
  • Security
    • SkyDeck.ai Security Practices
    • Bug Bounty Program
  • AI Documentation
    • LLM Evaluation Report
    • SkyDeck.ai LLM Ready Documentation
Powered by GitBook
On this page
  1. AI Documentation

LLM Evaluation Report

Last updated 11 days ago

Model
Date
Total Response Time (s)
Tests Passed
Mean CodeBLEU (0-1)
Mean Usefulness Score (0-4)
Mean Functional Correctness Score (0-4)

o1-preview

2025-04-02

3264.19

134

0.320351

3.60976

3.59756

o1-mini

2025-04-02

964.977

129

0.336816

3.69512

3.75

gpt-4o

2025-04-02

228.668

128

0.310692

3.71951

3.67073

gpt-4o-mini

2025-04-02

248.679

116

0.321981

3.62805

3.61585

claude-3-5-sonnet-20240620

2025-04-02

276.394

108

0.30484

3.67683

3.66463

claude-3-5-sonnet-20241022

2025-04-02

291.706

112

0.328969

3.68902

3.70732

gemini-1.5-pro

2025-04-02

518.354

103

0.327295

3.46951

3.41463

gemini-1.5-flash

2025-04-02

763.949

0

0.261228

0.792683

1.32317

Total Response Time (s): The total time taken by the model to generate all the outputs.

Tests passed: The number of unit tests that the model has passed during evaluation, out of a total of 164 tests.

Mean : Average CodeBLEU score, a metric for evaluating code generation quality based on both syntactic and semantic correctness.

Mean : Average rating of the model's output usefulness as rated by a LLM model.

  • 0: Snippet is not at all helpful, it is irrelevant to the problem.

  • 1: Snippet is slightly helpful, it contains information relevant to the problem, but it is easier to write the solution from scratch.

  • 2: Snippet is somewhat helpful, it requires significant changes (compared to the size of the snippet), but is still useful.

  • 3: Snippet is helpful, but needs to be slightly changed to solve the problem.

  • 4: Snippet is very helpful, it solves the problem.

Mean : Average score of the functional correctness of the model's outputs, assessing how well the outputs meet the functional requirements, rated by a LLM model.

  • 0 (failing all possible tests): The code snippet is totally incorrect and meaningless.

  • 4 (passing all possible tests): The code snippet is totally correct and can handle all cases.

CodeBLEU
Usefulness Score
Functional Correctness Score