Enter your hardware specs and we'll recommend the best US-built open Large Language Models that fit, then generate a production-ready Docker Compose file tuned to your machine.
Once the model is downloaded, nothing leaves the machine. No third-party API calls, no external logging, no data processing agreements to negotiate. Citizen records, internal documents, and draft policies are processed on hardware you control.
The smaller models in this catalog run on a regular laptop, an Apple Silicon Mac, or a desktop with any modern graphics card. No new hardware, no procurement cycle. Bigger machines unlock bigger models, but the door isn't "datacenter required."
vLLM speaks the same protocol as ChatGPT's API. Existing tools, scripts, and SDK integrations work by changing one config value: the base URL. A drop-in replacement for downstream systems already wired up to commercial AI.
// Tell us what you're working with, we'll find models that fit
// Filtered for your hardware. US-origin, open-weight models only
Based on your specs, these models will run comfortably. Select one to continue.
Everything you need to go from zero to a live LLM endpoint.
Install Docker Desktop (Mac/Windows) or Docker Engine + Compose plugin (Linux). For NVIDIA GPUs, also install the NVIDIA Container Toolkit.
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# NVIDIA Container Toolkit (GPU only)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Generate your config above, download it, and place it in a dedicated project directory.
mkdir ~/vllm-deploy && cd ~/vllm-deploy
# Move your downloaded docker-compose.yml here, then verify:
cat docker-compose.yml
Docker pulls the vLLM image and starts your containers. First model download may take several minutes depending on size.
# Foreground (see live logs)
docker compose up
# Or run in background
docker compose up -d
# Follow logs when backgrounded
docker compose logs -f vllm
vLLM exposes an OpenAI-compatible REST API. Works with curl, the OpenAI Python SDK, or any tool that accepts a custom base URL.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [{"role": "user", "content": "Hello!"}]
}'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Models download from HuggingFace on first start and are cached in a Docker volume. Subsequent starts are instant. No re-download needed.
Llama and Gemma require you to accept a license on HuggingFace, then provide your HF_TOKEN in Step 1 above.
docker compose stop (pause)
docker compose down (teardown)
docker compose pull (update images)
Plain-English definitions for the AI vocabulary on this page. No prior AI experience required.
The technology behind ChatGPT, Claude, and similar tools. An LLM (Large Language Model) is a program that's been trained on huge amounts of text and can then read and write text on its own: answer questions, summarize documents, generate code, translate, and so on. The "model" itself is essentially a large file (often 4 to 80 GB) that holds everything the system learned during training. Llama, Phi, Gemma, and GPT-OSS in this catalog are all LLMs you can download and run yourself.
An AI model whose creator has made it freely available to download and run. ChatGPT and Claude are "closed": you can only use them through the company's servers, paying per request. Open-weight models like Llama (from Meta) or Phi (from Microsoft) you download once and run forever on your own hardware, with no monthly fee.
Note: "open-weight" is not exactly "open source." The model files are public, but the training data usually is not. Each model has its own license; quick to read and worth a glance before commercial use.
The software that actually runs the model. Think of it like a web server: nginx serves websites, vLLM serves AI models. You point it at a model file, it loads the model into memory, and it accepts chat requests over HTTP and sends back answers. vLLM itself isn't an AI; it's the engine that runs one. It's built for production use, which is why it scales to handle multiple users at once.
A tool that packages software into self-contained units called containers. Each container bundles the application with everything it needs to run (libraries, drivers, config files) so it works identically on any machine, whether a laptop, a server, or a cloud VM.
Why this site uses it: instead of installing vLLM, NVIDIA libraries, Python, and dozens of dependencies directly onto your server (and dealing with version conflicts forever), you run one container. Swap the container to update, delete it to remove. Nothing leaks into your operating system. Docker Compose, the format this site generates, is a way to describe a multi-container setup in a single YAML file.
Memory that lives on the graphics card itself, separate from your computer's main RAM. AI models work best when they fit entirely inside this memory. A rough rule of thumb: a model with 8 billion parameters needs around 16 GB of VRAM. A 70-billion-parameter model needs around 140 GB (which means multiple GPUs working together). VRAM is the main reason hardware matters for AI: more VRAM lets you run bigger, smarter models.
A technique for shrinking AI models so they fit on smaller hardware. Models are made of billions of numbers; quantization rounds those numbers to use fewer digits each, which dramatically cuts the file size. FP16 cuts the size in half with almost no quality loss, and is what most people use. AWQ 4-bit cuts it to a quarter with a small drop in quality. Practical effect: a model that wouldn't fit on your GPU at full size might fit perfectly after quantization.
AI models don't read words exactly; they read tokens, which are chunks of text. A token is roughly three quarters of a word, so 100 tokens is about 75 words. Context length is how much text the model can hold in mind during a single conversation. A 4,000-token context fits about 3,000 words, or roughly six pages. Longer context lets the model handle bigger documents but uses more VRAM and slows responses down.
When a model is too big for a single graphics card, you can split it across multiple cards working together. Think of it as RAID for GPUs: instead of striping data across drives, you stripe a model across cards. vLLM handles all the coordination automatically; you just tell it how many GPUs you have. This is how the largest models (70B, 405B) get hosted without buying datacenter-class single GPUs.
Some model creators (notably Meta and Google) ask you to accept their license terms on HuggingFace before you can download their models. It's free, just a one-time click-through, similar to accepting an EULA. Once accepted, you generate a token (a long random string) and paste it into the form on this page. vLLM uses that token to prove it's you when downloading the model files.
A clever model design that runs faster than its size suggests. Instead of every part of the model being involved in every word it writes, only a small group of "experts" inside the model is used at a time. Imagine a 100-person company where only 3 specialists work on each task instead of all 100. Result: a 132-billion-parameter model that runs about as fast as a 36-billion one. DBRX and Phi-3.5 MoE in this catalog use this approach.
The internet's biggest library of open AI models, hosted by a company of the same name. If Docker Hub is where containers live, HuggingFace is where AI models live. When vLLM starts up, it downloads model files from HuggingFace automatically. Most models download anonymously and freely; a few need an account and a token (see "Gated Model" above).
OpenAI publishes a standard way for software to talk to ChatGPT (a specific URL format and request shape). vLLM speaks the exact same language, so any tool, app, or script that already works with ChatGPT can be pointed at your vLLM server by changing a single setting: the URL. Existing integrations keep working, and you swap the back-end from a paid commercial API to a free local one.
Practical answers to the questions that come up when you're looking at this for the first time.
Bigger than typical software downloads, but not unreasonable. A small model like Llama 3.2 3B takes around 6 GB. A mid-size like Llama 3.1 8B is about 16 GB. The really large ones run from 140 GB (Llama 70B) up to over 800 GB (Llama 405B).
The download happens once, then the model files sit in a Docker volume on your server. Restarts and reboots don't re-download anything.
Yes. The model downloads from HuggingFace the very first time vLLM starts, but after that everything runs locally. No outside connection is needed for daily use, and no chat data ever leaves your network.
For air-gapped or isolated environments, you can download the model on a connected computer, copy the model files onto a USB drive or storage volume, and mount that into the vLLM container on your isolated machine. The internet step only ever happens once.
Updating works just like any other Docker container. To get a newer vLLM version, run docker compose pull followed by docker compose up -d and it restarts on the latest image.
To switch to a different model, open your compose file, change the model name in the --model line, and restart. The new model downloads automatically on first run; the old one stays cached in case you want to switch back.
vLLM grabs as much GPU memory as it needs the moment it starts up. If something else is already using the graphics card (a 3D rendering job, another AI service, video transcoding, etc.), vLLM either fails to start or won't have enough memory to load its model.
The clean fix: dedicate the GPU to vLLM. The shared option: lower the --gpu-memory-utilization setting (default is 0.90, meaning 90 percent) to leave room for other tasks.
Yes, with a catch about memory. You can copy the vllm service block in your compose file, give the copy a different port and a different model name, and run them side by side. The catch: each model needs its own slice of VRAM. A 12 GB graphics card cannot host two 8 GB models at once.
For serving multiple models at scale, most organizations dedicate one GPU per model and use a load balancer (like nginx) to route incoming requests to whichever model is best suited.
More than people expect. vLLM is good at handling many requests at once because it answers them in parallel batches rather than one at a time. A typical 8-billion-parameter model on a single high-end consumer GPU (like an RTX 4090) comfortably handles 10 to 30 simultaneous chat users. Heavy tasks like long-document analysis or code generation cut that number down.
The real answer depends on how your users actually use the system. Run a small pilot, watch the numbers, and scale from there.
Standard Docker logging applies. vLLM writes its logs to the container's standard output, which you can view with docker compose logs vllm. To send those logs to your existing log aggregator (Splunk, ELK, syslog server, etc.), configure Docker's logging driver, the same way you would for any other containerized service.
For chat history specifically, the Open WebUI add-on (toggleable in the form above) stores every conversation per-user in a separate volume. That's typically what FOIA requests and internal audits care about, and it's already in a format your existing tools can read.
Yes, with caveats. Windows Server can run this through Docker Desktop or WSL2 with GPU passthrough. It works, but Linux is the smoother path. Most production deployments end up on Ubuntu or RHEL because GPU drivers, Docker, and vLLM all play together more cleanly there.
Apple Silicon Macs (M-series chips) also work fine for evaluation and small workloads. They're less suitable for sustained production traffic because their built-in GPU performance, while impressive, isn't on par with a dedicated NVIDIA card.
To try it: any laptop or desktop in your office. The smaller models (Phi-3 Mini, Gemma 2B, Llama 3.2 1B) run on integrated graphics, an Apple Silicon Mac, or even on the CPU alone. CPU-only is slower than ideal, but enough to demo the experience without buying anything new.
For a small-team pilot: a single desktop with any 8 to 12 GB consumer graphics card (used cards in the few-hundred-dollar range work fine) handles the popular Llama 3.1 8B for a small group of users. Most departments already have a machine like this in storage or repurposable from another project.
For department-wide rollout: a workstation or 1U server with a higher-VRAM GPU comfortably serves dozens of users on the larger, smarter models. This is the hardware to budget for after a pilot proves out, not the starting point.
vLLM is an actively maintained open-source project. It started at UC Berkeley and now has contributors from large tech companies, universities, and a wide community of users. Updates ship regularly, including security patches and performance improvements.
If your procurement process requires a paid SLA or a vendor support agreement, several companies offer commercial support contracts on top of the open-source software.