Serving LLMs with Wisp #
It’s very simple to serve LLMs on the cloud with Wisp. If you haven’t done so yet, check out Quickstart to understand how Wisp works. Otherwise, keep reading here!
In practice, you can host your models with Wisp using any technology that exposes a port. We’re using vLLM as it greatly simplifies the process, and supportd Docker containers.
The LLM #
We’ll use
vLLM to host a Mistral-7B
model with
Docker. To use other models, see the
Documentation for vLLM.
Configuration #
If you haven’t done yet, run wisp init
to create the configuration file. Open
wisp-config.yml
and enter the following information:
setup:
project: local
run:
docker run --runtime nvidia --gpus all \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
resources:
accelerators:
compute_capability: 7.0+
vram: 6+
memory: 4+
io:
# Expose port 8000 from the Docker container on the server
ports: 8000
# Require login with a Wisp account for the endpoint
secure_endpoint: true
Launch the Server #
We’re ready to launch the server! In your terminal, run:
wisp run
Wisp will pull the image and run it using the command supplied. The command will output an external IP you can access through your browser.
You can see your job, stats and cost overview for your job in the dashboard under Jobs.