No description

Dockerfile 100%

Find a file

Marcel Kaiser 0740575fcb All checks were successful demo / test (push) Successful in 45s Details Change image		2026-03-17 15:03:20 +01:00
.forgejo/workflows	Change image	2026-03-17 15:03:20 +01:00
.env	Make port 8080 the default, and let the user define the port via .env	2026-03-11 10:30:29 +01:00
docker-compose.yml	Make port 8080 the default, and let the user define the port via .env	2026-03-11 10:30:29 +01:00
Dockerfile	Make port 8080 the default, and let the user define the port via .env	2026-03-11 10:30:29 +01:00
README.md	Use 4K context length for better performance.	2026-03-11 16:31:39 +01:00

README.md

Run llama-server from llama-cpp within a Docker container

This guide will help you set up a llama-server to run a model on your own computer.

Build the Docker Image

Install Dependencies

First, install docker-compose:

$ sudo apt install docker-compose

Then add your username to the docker group:

$ sudo usermod -aG docker $USER

In order for this to take effect, you need to log out and log in again. Alternatively, you can run:

$ exec su - $USER

Clone the Repository

$ git clone https://forgejo.cyberpica.de/pica/docker-llama.git
$ cd docker-llama

Set Up Model Path and Build Options

Model Path

Create a directory ~/llama-models to store your GGUF models. This directory will be mounted in the Docker container under /models.

$ mkdir ~/llama-models

Set your model name in the .env file:

MODEL_PATH=/models/<your-model>.gguf

Build Options

To build llama-cpp with VULKAN or CUDA, set the variables USE_CUDA or USE_VULKAN in the .env file. The default is:

USE_CUDA=false
USE_VULKAN=false

Build the Image

$ docker-compose build

Run the Docker Container

$ docker-compose up -d

Access the Web Interface

Unless specified otherwise via the PORT variable in .env, you can access the web interface at http://127.0.0.1:8080

Stop the Docker Container

From within the docker-llama directory, run:

$ docker-compose down

Change the Model

Adjust the MODEL_PATH variable in the .env file, and then run:

$ docker-compose down && docker-compose up -d

Integration into VSCode

To use your local model in VSCode, install the continue plugin. Here’s an example configuration for Qwen3.5-9B-UD-Q4_K_XL.gguf:

name: local-qwen
version: 0.0.1
schema: v1

models:
  - name: Qwen3.5-9B-UD-Q4_K_XL.gguf
    model: Qwen3.5-9B-UD-Q4_K_XL.gguf
    provider: llama.cpp
    apiBase: http://127.0.0.1:8080
    contextLength: 4096
    roles:
      - chat
      - autocomplete

Model Recommendations

RAM	GPU	Model	Size on Disk	Notes
8GB	No	Phi-3 Mini (Q4_K_M)	~2.4GB	Good for coding/chat; punches above its weight
8GB	No	Mistral 7B (Q2_K)	~2.8GB	Reduced quality but fits comfortably
8GB	No	Gemma 2B (Q8)	~2.7GB	Good general purpose; very fast
8GB	Yes (4GB VRAM)	Phi-3 Mini (Q4_K_M)	~2.4GB	Fully GPU accelerated; very fast
8GB	Yes (4GB VRAM)	Mistral 7B (Q4_K_M)	~4.1GB	Fits in VRAM; excellent quality/size ratio
16GB	No	Mistral 7B (Q4_K_M)	~4.1GB	Comfortable fit; good performance
16GB	No	LLaMA 3 8B (Q4_K_M)	~4.6GB	Strong general purpose; Meta's latest small model
16GB	No	Mistral 7B (Q8)	~7.2GB	Near full quality; still fits
16GB	Yes (8GB VRAM)	LLaMA 3 8B (Q4_K_M)	~4.6GB	Fully in VRAM; fast inference
16GB	Yes (8GB VRAM)	Mistral 7B (Q8)	~7.2GB	Near full quality; fully GPU accelerated
32GB	No	LLaMA 3 8B (Q8)	~8.5GB	Full quality; very comfortable
32GB	No	Mistral 22B (Q4_K_M)	~13GB	Big step up in reasoning quality
32GB	No	Qwen2 14B (Q4_K_M)	~8.4GB	Excellent coding + multilingual
32GB	No	LLaMA 3 70B (Q2_K)	~26GB	Fits but degraded quality
32GB	Yes (16GB VRAM)	Mistral 22B (Q4_K_M)	~13GB	Fully in VRAM; excellent quality
32GB	Yes (16GB VRAM)	Qwen2 14B (Q8)	~15GB	Near full quality; fast
64GB	No	LLaMA 3 70B (Q4_K_M)	~40GB	Best open-source quality at this size
64GB	No	Mixtral 8x7B (Q4_K_M)	~26GB	MoE architecture; very capable
64GB	No	Qwen2 72B (Q4_K_M)	~43GB	Excellent multilingual + coding
64GB	Yes (24GB VRAM)	LLaMA 3 70B (Q4_K_M)	~40GB	Split across VRAM+RAM; fast
64GB	Yes (24GB VRAM)	Mixtral 8x7B (Q4_K_M)	~26GB	Fully in VRAM+RAM; very fast

Key Points:

Q4_K_M is ideal for most use cases: good quality with reasonable size.
Q8 is near-lossless but roughly twice the size.
Q2_K is a last resort: noticeable quality degradation.
Models fully fitting in VRAM are significantly faster than CPU or split inference.
All models are available as GGUF format on HuggingFace.

README.md Unescape Escape