No description
Find a file
2026-03-11 10:30:29 +01:00
.env Make port 8080 the default, and let the user define the port via .env 2026-03-11 10:30:29 +01:00
docker-compose.yml Make port 8080 the default, and let the user define the port via .env 2026-03-11 10:30:29 +01:00
Dockerfile Make port 8080 the default, and let the user define the port via .env 2026-03-11 10:30:29 +01:00
README.md Make port 8080 the default, and let the user define the port via .env 2026-03-11 10:30:29 +01:00

Run llama-server from llama-cpp within a Docker container

This guide will help you set up a llama-server to run a model on your own computer.

Build the Docker Image

Install Dependencies

First, install docker-compose:

$ sudo apt install docker-compose

Then add your username to the docker group:

$ sudo usermod -aG docker $USER

In order for this to take effect, you need to log out and log in again. Alternatively, you can run:

$ exec su - $USER

Clone the Repository

$ git clone https://forgejo.cyberpica.de/pica/docker-llama.git
$ cd docker-llama

Set Up Model Path and Build Options

Model Path

Create a directory ~/llama-models to store your GGUF models. This directory will be mounted in the Docker container under /models.

$ mkdir ~/llama-models

Set your model name in the .env file:

MODEL_PATH=/models/<your-model>.gguf

Build Options

To build llama-cpp with VULKAN or CUDA, set the variables USE_CUDA or USE_VULKAN in the .env file. The default is:

USE_CUDA=false
USE_VULKAN=false

Build the Image

$ docker-compose build

Run the Docker Container

$ docker-compose up -d

Access the Web Interface

Unless specified otherwise via the PORT variable in .env, you can access the web interface at http://127.0.0.1:8080

Stop the Docker Container

From within the docker-llama directory, run:

$ docker-compose down

Change the Model

Adjust the MODEL_PATH variable in the .env file, and then run:

$ docker-compose down && docker-compose up -d

Integration into VSCode

To use your local model in VSCode, install the continue plugin. Heres an example configuration for Qwen3.5-9B-UD-Q4_K_XL.gguf:

name: local-qwen
version: 0.0.1
schema: v1

models:
  - name: Qwen3.5-9B-UD-Q4_K_XL.gguf
    model: Qwen3.5-9B-UD-Q4_K_XL.gguf
    provider: llama.cpp
    apiBase: http://127.0.0.1:8080
    contextLength: 8192
    roles:
      - chat
      - autocomplete

Model Recommendations

RAM GPU Model Size on Disk Notes
8GB No Phi-3 Mini (Q4_K_M) ~2.4GB Good for coding/chat; punches above its weight
8GB No Mistral 7B (Q2_K) ~2.8GB Reduced quality but fits comfortably
8GB No Gemma 2B (Q8) ~2.7GB Good general purpose; very fast
8GB Yes (4GB VRAM) Phi-3 Mini (Q4_K_M) ~2.4GB Fully GPU accelerated; very fast
8GB Yes (4GB VRAM) Mistral 7B (Q4_K_M) ~4.1GB Fits in VRAM; excellent quality/size ratio
16GB No Mistral 7B (Q4_K_M) ~4.1GB Comfortable fit; good performance
16GB No LLaMA 3 8B (Q4_K_M) ~4.6GB Strong general purpose; Meta's latest small model
16GB No Mistral 7B (Q8) ~7.2GB Near full quality; still fits
16GB Yes (8GB VRAM) LLaMA 3 8B (Q4_K_M) ~4.6GB Fully in VRAM; fast inference
16GB Yes (8GB VRAM) Mistral 7B (Q8) ~7.2GB Near full quality; fully GPU accelerated
32GB No LLaMA 3 8B (Q8) ~8.5GB Full quality; very comfortable
32GB No Mistral 22B (Q4_K_M) ~13GB Big step up in reasoning quality
32GB No Qwen2 14B (Q4_K_M) ~8.4GB Excellent coding + multilingual
32GB No LLaMA 3 70B (Q2_K) ~26GB Fits but degraded quality
32GB Yes (16GB VRAM) Mistral 22B (Q4_K_M) ~13GB Fully in VRAM; excellent quality
32GB Yes (16GB VRAM) Qwen2 14B (Q8) ~15GB Near full quality; fast
64GB No LLaMA 3 70B (Q4_K_M) ~40GB Best open-source quality at this size
64GB No Mixtral 8x7B (Q4_K_M) ~26GB MoE architecture; very capable
64GB No Qwen2 72B (Q4_K_M) ~43GB Excellent multilingual + coding
64GB Yes (24GB VRAM) LLaMA 3 70B (Q4_K_M) ~40GB Split across VRAM+RAM; fast
64GB Yes (24GB VRAM) Mixtral 8x7B (Q4_K_M) ~26GB Fully in VRAM+RAM; very fast

Key Points:

  • Q4_K_M is ideal for most use cases: good quality with reasonable size.
  • Q8 is near-lossless but roughly twice the size.
  • Q2_K is a last resort: noticeable quality degradation.
  • Models fully fitting in VRAM are significantly faster than CPU or split inference.
  • All models are available as GGUF format on HuggingFace.