vLLM Deploy — Docker Compose Generator

Your Specs

Pick a Model

Get Compose File

01 — Your Hardware

Hardware Configuration

// Tell us what you're working with, we'll find models that fit

System RAM *

CPU Cores

GPU Type *

🖥 CPU Only

🟢 NVIDIA (CUDA)

🔴 AMD (ROCm)

🍎 Apple Silicon

GPU VRAM per Card

Number of GPUs

Host Port

HuggingFace Token (optional)

Required for gated models (Llama, Gemma)

Extras

FP16 precision

AWQ quantization

+ Open WebUI

Auto-restart

02 — Choose Your Model

Recommended Models

// Filtered for your hardware. US-origin, open-weight models only

Models compatible with your setup

Based on your specs, these models will run comfortably. Select one to continue.

03 — Your Compose File

Building optimized configuration...

docker-compose.yml — ready

04 — Deploy with Docker

Up and running in 4 steps.

Everything you need to go from zero to a live LLM endpoint.

Install Docker & Docker Compose

Install Docker Desktop (Mac/Windows) or Docker Engine + Compose plugin (Linux). For NVIDIA GPUs, also install the NVIDIA Container Toolkit.

Linux — Docker + NVIDIA Toolkit
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# NVIDIA Container Toolkit (GPU only)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Save your Compose file

Generate your config above, download it, and place it in a dedicated project directory.

Terminal
mkdir ~/vllm-deploy && cd ~/vllm-deploy
# Move your downloaded docker-compose.yml here, then verify:
cat docker-compose.yml

Pull & start the services

Docker pulls the vLLM image and starts your containers. First model download may take several minutes depending on size.

Terminal
# Foreground (see live logs)
docker compose up

# Or run in background
docker compose up -d

# Follow logs when backgrounded
docker compose logs -f vllm

Hit the API

vLLM exposes an OpenAI-compatible REST API. Works with curl, the OpenAI Python SDK, or any tool that accepts a custom base URL.

curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Python — OpenAI SDK
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

📦

Model Caching

Models download from HuggingFace on first start and are cached in a Docker volume. Subsequent starts are instant. No re-download needed.

🔐

Gated Models

Llama and Gemma require you to accept a license on HuggingFace, then provide your HF_TOKEN in Step 1 above.

⚡

Useful Commands

docker compose stop (pause)
docker compose down (teardown)
docker compose pull (update images)

05 — Glossary

The terms, plain English.

Plain-English definitions for the AI vocabulary on this page. No prior AI experience required.

LLM

Concept

The technology behind ChatGPT, Claude, and similar tools. An LLM (Large Language Model) is a program that's been trained on huge amounts of text and can then read and write text on its own: answer questions, summarize documents, generate code, translate, and so on. The "model" itself is essentially a large file (often 4 to 80 GB) that holds everything the system learned during training. Llama, Phi, Gemma, and GPT-OSS in this catalog are all LLMs you can download and run yourself.

Open-weight model

Concept

An AI model whose creator has made it freely available to download and run. ChatGPT and Claude are "closed": you can only use them through the company's servers, paying per request. Open-weight models like Llama (from Meta) or Phi (from Microsoft) you download once and run forever on your own hardware, with no monthly fee.

Note: "open-weight" is not exactly "open source." The model files are public, but the training data usually is not. Each model has its own license; quick to read and worth a glance before commercial use.

vLLM

Software

The software that actually runs the model. Think of it like a web server: nginx serves websites, vLLM serves AI models. You point it at a model file, it loads the model into memory, and it accepts chat requests over HTTP and sends back answers. vLLM itself isn't an AI; it's the engine that runs one. It's built for production use, which is why it scales to handle multiple users at once.

Docker

Tooling

A tool that packages software into self-contained units called containers. Each container bundles the application with everything it needs to run (libraries, drivers, config files) so it works identically on any machine, whether a laptop, a server, or a cloud VM.

Why this site uses it: instead of installing vLLM, NVIDIA libraries, Python, and dozens of dependencies directly onto your server (and dealing with version conflicts forever), you run one container. Swap the container to update, delete it to remove. Nothing leaks into your operating system. Docker Compose, the format this site generates, is a way to describe a multi-container setup in a single YAML file.

VRAM

Hardware

Memory that lives on the graphics card itself, separate from your computer's main RAM. AI models work best when they fit entirely inside this memory. A rough rule of thumb: a model with 8 billion parameters needs around 16 GB of VRAM. A 70-billion-parameter model needs around 140 GB (which means multiple GPUs working together). VRAM is the main reason hardware matters for AI: more VRAM lets you run bigger, smarter models.

Quantization

Optimization

A technique for shrinking AI models so they fit on smaller hardware. Models are made of billions of numbers; quantization rounds those numbers to use fewer digits each, which dramatically cuts the file size. FP16 cuts the size in half with almost no quality loss, and is what most people use. AWQ 4-bit cuts it to a quarter with a small drop in quality. Practical effect: a model that wouldn't fit on your GPU at full size might fit perfectly after quantization.

Tokens & Context

Concept

AI models don't read words exactly; they read tokens, which are chunks of text. A token is roughly three quarters of a word, so 100 tokens is about 75 words. Context length is how much text the model can hold in mind during a single conversation. A 4,000-token context fits about 3,000 words, or roughly six pages. Longer context lets the model handle bigger documents but uses more VRAM and slows responses down.

Tensor Parallelism

Hardware

When a model is too big for a single graphics card, you can split it across multiple cards working together. Think of it as RAID for GPUs: instead of striping data across drives, you stripe a model across cards. vLLM handles all the coordination automatically; you just tell it how many GPUs you have. This is how the largest models (70B, 405B) get hosted without buying datacenter-class single GPUs.

Gated Model

Workflow

Some model creators (notably Meta and Google) ask you to accept their license terms on HuggingFace before you can download their models. It's free, just a one-time click-through, similar to accepting an EULA. Once accepted, you generate a token (a long random string) and paste it into the form on this page. vLLM uses that token to prove it's you when downloading the model files.

Mixture of Experts (MoE)

Architecture

A clever model design that runs faster than its size suggests. Instead of every part of the model being involved in every word it writes, only a small group of "experts" inside the model is used at a time. Imagine a 100-person company where only 3 specialists work on each task instead of all 100. Result: a 132-billion-parameter model that runs about as fast as a 36-billion one. DBRX and Phi-3.5 MoE in this catalog use this approach.

HuggingFace

Service

The internet's biggest library of open AI models, hosted by a company of the same name. If Docker Hub is where containers live, HuggingFace is where AI models live. When vLLM starts up, it downloads model files from HuggingFace automatically. Most models download anonymously and freely; a few need an account and a token (see "Gated Model" above).

OpenAI-compatible API

Integration

OpenAI publishes a standard way for software to talk to ChatGPT (a specific URL format and request shape). vLLM speaks the exact same language, so any tool, app, or script that already works with ChatGPT can be pointed at your vLLM server by changing a single setting: the URL. Existing integrations keep working, and you swap the back-end from a paid commercial API to a free local one.

06 — Frequently Asked Questions

The questions that come up.

Practical answers to the questions that come up when you're looking at this for the first time.

How big are these models on disk?

Bigger than typical software downloads, but not unreasonable. A small model like Llama 3.2 3B takes around 6 GB. A mid-size like Llama 3.1 8B is about 16 GB. The really large ones run from 140 GB (Llama 70B) up to over 800 GB (Llama 405B).

The download happens once, then the model files sit in a Docker volume on your server. Restarts and reboots don't re-download anything.

Does it run offline after the first download?

Yes. The model downloads from HuggingFace the very first time vLLM starts, but after that everything runs locally. No outside connection is needed for daily use, and no chat data ever leaves your network.

For air-gapped or isolated environments, you can download the model on a connected computer, copy the model files onto a USB drive or storage volume, and mount that into the vLLM container on your isolated machine. The internet step only ever happens once.

How do I update the model or vLLM?

Updating works just like any other Docker container. To get a newer vLLM version, run docker compose pull followed by docker compose up -d and it restarts on the latest image.

To switch to a different model, open your compose file, change the model name in the --model line, and restart. The new model downloads automatically on first run; the old one stays cached in case you want to switch back.

What happens if the GPU is busy with something else?

vLLM grabs as much GPU memory as it needs the moment it starts up. If something else is already using the graphics card (a 3D rendering job, another AI service, video transcoding, etc.), vLLM either fails to start or won't have enough memory to load its model.

The clean fix: dedicate the GPU to vLLM. The shared option: lower the --gpu-memory-utilization setting (default is 0.90, meaning 90 percent) to leave room for other tasks.

Can I run multiple models at the same time?

Yes, with a catch about memory. You can copy the vllm service block in your compose file, give the copy a different port and a different model name, and run them side by side. The catch: each model needs its own slice of VRAM. A 12 GB graphics card cannot host two 8 GB models at once.

For serving multiple models at scale, most organizations dedicate one GPU per model and use a load balancer (like nginx) to route incoming requests to whichever model is best suited.

How many concurrent users can a single instance handle?

More than people expect. vLLM is good at handling many requests at once because it answers them in parallel batches rather than one at a time. A typical 8-billion-parameter model on a single high-end consumer GPU (like an RTX 4090) comfortably handles 10 to 30 simultaneous chat users. Heavy tasks like long-document analysis or code generation cut that number down.

The real answer depends on how your users actually use the system. Run a small pilot, watch the numbers, and scale from there.

What about logging, audit trails, and compliance?

Standard Docker logging applies. vLLM writes its logs to the container's standard output, which you can view with docker compose logs vllm. To send those logs to your existing log aggregator (Splunk, ELK, syslog server, etc.), configure Docker's logging driver, the same way you would for any other containerized service.

For chat history specifically, the Open WebUI add-on (toggleable in the form above) stores every conversation per-user in a separate volume. That's typically what FOIA requests and internal audits care about, and it's already in a format your existing tools can read.

Does this run on Windows Server?

Yes, with caveats. Windows Server can run this through Docker Desktop or WSL2 with GPU passthrough. It works, but Linux is the smoother path. Most production deployments end up on Ubuntu or RHEL because GPU drivers, Docker, and vLLM all play together more cleanly there.

Apple Silicon Macs (M-series chips) also work fine for evaluation and small workloads. They're less suitable for sustained production traffic because their built-in GPU performance, while impressive, isn't on par with a dedicated NVIDIA card.

What hardware do I actually need to get started?

To try it: any laptop or desktop in your office. The smaller models (Phi-3 Mini, Gemma 2B, Llama 3.2 1B) run on integrated graphics, an Apple Silicon Mac, or even on the CPU alone. CPU-only is slower than ideal, but enough to demo the experience without buying anything new.

For a small-team pilot: a single desktop with any 8 to 12 GB consumer graphics card (used cards in the few-hundred-dollar range work fine) handles the popular Llama 3.1 8B for a small group of users. Most departments already have a machine like this in storage or repurposable from another project.

For department-wide rollout: a workstation or 1U server with a higher-VRAM GPU comfortably serves dozens of users on the larger, smarter models. This is the hardware to budget for after a pilot proves out, not the starting point.

Is there support and a long-term roadmap?

vLLM is an actively maintained open-source project. It started at UC Berkeley and now has contributors from large tech companies, universities, and a wide community of users. Updates ship regularly, including security patches and performance improvements.

If your procurement process requires a paid SLA or a vendor support agreement, several companies offer commercial support contracts on top of the open-source software.

Deploy Local AI
in minutes.

Data stays on your network

Start with the technology you have

OpenAI-compatible API

Hardware Configuration

Recommended Models

Models compatible with your setup

Up and running in 4 steps.

Model Caching

Gated Models

Useful Commands

The terms, plain English.

LLM

Open-weight model

vLLM

Docker

VRAM

Quantization

Tokens & Context

Tensor Parallelism

Gated Model

Mixture of Experts (MoE)

HuggingFace

OpenAI-compatible API

The questions that come up.

Deploy Local AIin minutes.

Data stays on your network

Start with the technology you have

OpenAI-compatible API

Hardware Configuration

Recommended Models

Models compatible with your setup

Up and running in 4 steps.

Model Caching

Gated Models

Useful Commands

The terms, plain English.

LLM

Open-weight model

vLLM

Docker

VRAM

Quantization

Tokens & Context

Tensor Parallelism

Gated Model

Mixture of Experts (MoE)

HuggingFace

OpenAI-compatible API

The questions that come up.

Deploy Local AI
in minutes.