Serving LLMs

Serving LLMs with Wisp #

It’s very simple to serve LLMs on the cloud with Wisp. If you haven’t done so yet, check out Quickstart to understand how Wisp works. Otherwise, keep reading here!

In practice, you can host your models with Wisp using any technology that exposes a port. We’re using vLLM as it greatly simplifies the process, and supportd Docker containers.

The LLM #

We’ll use vLLM to host a Mistral-7B model with Docker. To use other models, see the Documentation for vLLM.

Configuration #

If you haven’t done yet, run wisp init to create the configuration file. Open wisp-config.yml and enter the following information:


setup:
    project: local

run:
    docker run --runtime nvidia --gpus all \
        --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
        -p 8000:8000 \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model mistralai/Mistral-7B-v0.1

resources:
    accelerators:
compute_capability: 7.0+
        vram: 6+
    memory: 4+

io:
    # Expose port 8000 from the Docker container on the server
    ports: 8000
    # Require login with a Wisp account for the endpoint
    secure_endpoint: true

Launch the Server #

We’re ready to launch the server! In your terminal, run:

wisp run

Wisp will pull the image and run it using the command supplied. The command will output an external IP you can access through your browser.

You can see your job, stats and cost overview for your job in the dashboard under Jobs.