| .env | ||
| docker-compose.yml | ||
| Dockerfile | ||
| README.md | ||
Run llama-server from llama-cpp within a Docker container
This guide will help you set up a llama-server to run a model on your own computer.
Build the Docker Image
Install Dependencies
First, install docker-compose:
$ sudo apt install docker-compose
Then add your username to the docker group:
$ sudo usermod -aG docker $USER
In order for this to take effect, you need to log out and log in again. Alternatively, you can run:
$ exec su - $USER
Clone the Repository
$ git clone https://forgejo.cyberpica.de/pica/docker-llama.git
$ cd docker-llama
Set Up Model Path and Build Options
Model Path
Create a directory ~/llama-models to store your GGUF models. This directory will be mounted in the Docker container under /models.
$ mkdir ~/llama-models
Set your model name in the .env file:
MODEL_PATH=/models/<your-model>.gguf
Build Options
To build llama-cpp with VULKAN or CUDA, set the variables USE_CUDA or USE_VULKAN in the .env file. The default is:
USE_CUDA=false
USE_VULKAN=false
Build the Image
$ docker-compose build
Run the Docker Container
$ docker-compose up -d
Access the Web Interface
Unless specified otherwise via the PORT variable in .env, you can access the web interface at http://127.0.0.1:8080
Stop the Docker Container
From within the docker-llama directory, run:
$ docker-compose down
Change the Model
Adjust the MODEL_PATH variable in the .env file, and then run:
$ docker-compose down && docker-compose up -d
Integration into VSCode
To use your local model in VSCode, install the continue plugin. Here’s an example configuration for Qwen3.5-9B-UD-Q4_K_XL.gguf:
name: local-qwen
version: 0.0.1
schema: v1
models:
- name: Qwen3.5-9B-UD-Q4_K_XL.gguf
model: Qwen3.5-9B-UD-Q4_K_XL.gguf
provider: llama.cpp
apiBase: http://127.0.0.1:8080
contextLength: 8192
roles:
- chat
- autocomplete
Model Recommendations
| RAM | GPU | Model | Size on Disk | Notes |
|---|---|---|---|---|
| 8GB | No | Phi-3 Mini (Q4_K_M) | ~2.4GB | Good for coding/chat; punches above its weight |
| 8GB | No | Mistral 7B (Q2_K) | ~2.8GB | Reduced quality but fits comfortably |
| 8GB | No | Gemma 2B (Q8) | ~2.7GB | Good general purpose; very fast |
| 8GB | Yes (4GB VRAM) | Phi-3 Mini (Q4_K_M) | ~2.4GB | Fully GPU accelerated; very fast |
| 8GB | Yes (4GB VRAM) | Mistral 7B (Q4_K_M) | ~4.1GB | Fits in VRAM; excellent quality/size ratio |
| 16GB | No | Mistral 7B (Q4_K_M) | ~4.1GB | Comfortable fit; good performance |
| 16GB | No | LLaMA 3 8B (Q4_K_M) | ~4.6GB | Strong general purpose; Meta's latest small model |
| 16GB | No | Mistral 7B (Q8) | ~7.2GB | Near full quality; still fits |
| 16GB | Yes (8GB VRAM) | LLaMA 3 8B (Q4_K_M) | ~4.6GB | Fully in VRAM; fast inference |
| 16GB | Yes (8GB VRAM) | Mistral 7B (Q8) | ~7.2GB | Near full quality; fully GPU accelerated |
| 32GB | No | LLaMA 3 8B (Q8) | ~8.5GB | Full quality; very comfortable |
| 32GB | No | Mistral 22B (Q4_K_M) | ~13GB | Big step up in reasoning quality |
| 32GB | No | Qwen2 14B (Q4_K_M) | ~8.4GB | Excellent coding + multilingual |
| 32GB | No | LLaMA 3 70B (Q2_K) | ~26GB | Fits but degraded quality |
| 32GB | Yes (16GB VRAM) | Mistral 22B (Q4_K_M) | ~13GB | Fully in VRAM; excellent quality |
| 32GB | Yes (16GB VRAM) | Qwen2 14B (Q8) | ~15GB | Near full quality; fast |
| 64GB | No | LLaMA 3 70B (Q4_K_M) | ~40GB | Best open-source quality at this size |
| 64GB | No | Mixtral 8x7B (Q4_K_M) | ~26GB | MoE architecture; very capable |
| 64GB | No | Qwen2 72B (Q4_K_M) | ~43GB | Excellent multilingual + coding |
| 64GB | Yes (24GB VRAM) | LLaMA 3 70B (Q4_K_M) | ~40GB | Split across VRAM+RAM; fast |
| 64GB | Yes (24GB VRAM) | Mixtral 8x7B (Q4_K_M) | ~26GB | Fully in VRAM+RAM; very fast |
Key Points:
- Q4_K_M is ideal for most use cases: good quality with reasonable size.
- Q8 is near-lossless but roughly twice the size.
- Q2_K is a last resort: noticeable quality degradation.
- Models fully fitting in VRAM are significantly faster than CPU or split inference.
- All models are available as GGUF format on HuggingFace.