HuggingFace Integration

Using as the front end for your HuggingFace models.

After deploying your inference endpoint on HuggingFace, you should see the following user interface:

On this page, you will need the following information:

  • Endpoint URL

  • Model Repository

  • API token. You can view this by checking the "Add API token" box in the Call Examples code block.

In addition to these, you will also need the context window of your model. This can be found on the model's information page.

After collecting this information, format it into JSON as shown in the example below:

    "endpoint": "your_api_endpoint",
    "model_name": "meta-llama/Llama-2-7b-chat-hf",
    "context_window": 4096

Next, paste this into the Credential field of your integration.

Once the credential is successfully validated, you should see your HuggingFace model listed in GenStudio's model list:

Scaling HuggingFace Endpoints to Zero

Scaling to 0 is a dynamic feature offered by Inference Endpoints, designed to optimize resource utilization and costs. By intelligently monitoring request patterns and reducing the number of replicas to none during idle times, it ensures that you only use resources when necessary.

However, this does introduce a cold start period when traffic resumes, and there are a few considerations to be mindful of. For an in-depth look at how this feature functions, its benefits, and potential challenges, please refer to HuggingFace's guide on Autoscaling.

Supported models

At the moment, we only support endpoints for models with a text-generation tag that are deployed as text-generation-inference containers. We are working to expand our list of supported models.

Last updated