Skip to main content
Back to home
ML/AIJan 25, 202615 min read

Building a Full AI Stack on Proxmox

A complete reference for deploying production-grade LLMs, vector databases, RAG pipelines, autonomous agents, and full observability on self-hosted Proxmox infrastructure -- from GPU passthrough to cost analysis.

Why Self-Host AI Infrastructure

Cloud-hosted AI APIs are convenient, but they introduce three structural risks that compound over time: unpredictable costs at scale, limited control over model behaviour and versioning, and data-sovereignty concerns that are increasingly difficult to reconcile with European regulation. Running your own AI stack on Proxmox eliminates all three while giving you bare-metal performance and full operational visibility.

Core Benefits of Self-Hosted AI

  • Cost control -- Fixed hardware CAPEX replaces variable per-token billing. Break-even typically occurs within 4-6 months of moderate usage.
  • Data privacy -- All prompts, embeddings, and training data remain on-premises. No third-party data processing agreements required.
  • Operational sovereignty -- You control model versions, quantisation, context windows, and rate limits. No upstream deprecations or policy changes.
  • Latency -- Inference requests stay on the local network. P95 latency under 200ms is achievable for 7B-parameter models.

The stack we will build spans ten layers: LLM inference, vector storage, RAG orchestration, agent frameworks, experiment tracking, training pipelines, observability, and production deployment patterns. Every component runs inside Proxmox VMs or LXC containers, orchestrated with Docker Compose and monitored through Prometheus and Grafana.

The reference hardware for this guide is a single Proxmox node with 128 GB RAM, two NVIDIA A4000 GPUs (16 GB VRAM each), and 2 TB NVMe storage. The architecture scales horizontally by adding nodes to a Proxmox cluster, but the single-node configuration is sufficient for teams of up to ten engineers running concurrent inference and training workloads.

LLM Deployment with Ollama and vLLM

Local LLM inference is the foundation of the stack. We run two complementary engines: Ollama for rapid prototyping and interactive use, and vLLM for high-throughput production serving with continuous batching and PagedAttention.

GPU Passthrough Configuration

Proxmox requires IOMMU groups to be enabled at the kernel level before GPUs can be passed through to VMs. Edit the GRUB configuration, then update the VFIO module bindings.

# /etc/default/grub -- enable IOMMU
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

# Update GRUB and reboot
update-grub
reboot

# Verify IOMMU groups
dmesg | grep -e DMAR -e IOMMU

# Identify GPU PCI IDs
lspci -nn | grep NVIDIA
# Example output:
# 01:00.0 VGA compatible controller [0300]: NVIDIA ... [10de:2560]
# 01:00.1 Audio device [0403]: NVIDIA ... [10de:228e]

# /etc/modules -- load VFIO modules
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

# /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:2560,10de:228e disable_vga=1

After rebooting, create a VM with the GPU assigned. In the Proxmox web UI, navigate to Hardware, Add PCI Device, and select the GPU. Alternatively, use the CLI:

# Create a VM for LLM inference
qm create 200 \
  --name llm-inference \
  --memory 32768 \
  --cores 8 \
  --scsihw virtio-scsi-single \
  --net0 virtio,bridge=vmbr0 \
  --ide2 local:iso/ubuntu-22.04-server.iso,media=cdrom \
  --scsi0 local-lvm:64,ssd=1

# Attach GPU via PCI passthrough
qm set 200 -hostpci0 01:00,pcie=1,x-vga=0

Ollama Setup

# Install Ollama inside the VM
curl -fsSL https://ollama.com/install.sh | sh

# Pull models
ollama pull llama3.1:70b-instruct-q4_K_M
ollama pull mistral:7b-instruct-v0.3-q5_K_M
ollama pull codellama:34b-instruct-q4_K_M

# Verify GPU utilisation
nvidia-smi

# Test inference
ollama run llama3.1:70b-instruct-q4_K_M "Explain IOMMU in two sentences."

vLLM Production Server

# docker-compose.vllm.yml
version: "3.9"
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    command: >
      --model meta-llama/Meta-Llama-3.1-8B-Instruct
      --tensor-parallel-size 1
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --enforce-eager
      --dtype float16
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Performance Benchmarks (Single A4000, 16 GB VRAM)

Llama 3.1 8B Q4_K_M -- 42 tok/s generation, 1.8s TTFT

Mistral 7B Q5_K_M -- 48 tok/s generation, 1.5s TTFT

CodeLlama 34B Q4_K_M -- 18 tok/s generation, 3.2s TTFT

Llama 3.1 70B Q4_K_M (2x A4000) -- 12 tok/s, 5.1s TTFT

Vector Database for RAG Pipelines

Retrieval-augmented generation requires a high-performance vector store. We deploy Qdrant as the primary engine for its HNSW index performance and gRPC API, with ChromaDB available as a lightweight alternative for development.

# docker-compose.vectordb.yml
version: "3.9"
services:
  qdrant:
    image: qdrant/qdrant:v1.12.1
    ports:
      - "6333:6333"   # REST API
      - "6334:6334"   # gRPC
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      - QDRANT__SERVICE__GRPC_PORT=6334
      - QDRANT__STORAGE__STORAGE_PATH=/qdrant/storage
      - QDRANT__STORAGE__OPTIMIZERS__MEMMAP_THRESHOLD_KB=20480

  chromadb:
    image: chromadb/chroma:0.5.23
    ports:
      - "8100:8000"
    volumes:
      - chroma_data:/chroma/chroma
    environment:
      - ANONYMIZED_TELEMETRY=false
      - IS_PERSISTENT=true

volumes:
  qdrant_data:
  chroma_data:

Embedding Generation

# embedding_service.py
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance,
    VectorParams,
    PointStruct,
    OptimizersConfigDiff,
)
import hashlib
import uuid

class EmbeddingService:
    """Manages embedding generation and vector storage."""

    def __init__(
        self,
        model_name: str = "BAAI/bge-large-en-v1.5",
        qdrant_url: str = "http://localhost:6333",
    ):
        self.model = SentenceTransformer(model_name)
        self.client = QdrantClient(url=qdrant_url)
        self.dimension = self.model.get_sentence_embedding_dimension()

    def create_collection(self, name: str) -> None:
        """Create a vector collection with HNSW index."""
        self.client.create_collection(
            collection_name=name,
            vectors_config=VectorParams(
                size=self.dimension,
                distance=Distance.COSINE,
            ),
            optimizers_config=OptimizersConfigDiff(
                indexing_threshold=20000,
                memmap_threshold=50000,
            ),
        )

    def embed_and_store(
        self,
        collection: str,
        documents: list[dict],
    ) -> int:
        """Embed documents and upsert into Qdrant."""
        texts = [doc["text"] for doc in documents]
        embeddings = self.model.encode(
            texts,
            batch_size=64,
            show_progress_bar=True,
            normalize_embeddings=True,
        )

        points = [
            PointStruct(
                id=str(uuid.uuid5(
                    uuid.NAMESPACE_DNS,
                    hashlib.sha256(doc["text"].encode()).hexdigest(),
                )),
                vector=emb.tolist(),
                payload={
                    "text": doc["text"],
                    "source": doc.get("source", ""),
                    "chunk_index": doc.get("chunk_index", 0),
                    "metadata": doc.get("metadata", {}),
                },
            )
            for doc, emb in zip(documents, embeddings)
        ]

        self.client.upsert(
            collection_name=collection,
            points=points,
        )
        return len(points)

    def search(
        self,
        collection: str,
        query: str,
        top_k: int = 5,
    ) -> list[dict]:
        """Semantic search with score threshold."""
        query_vector = self.model.encode(
            query, normalize_embeddings=True
        ).tolist()

        results = self.client.search(
            collection_name=collection,
            query_vector=query_vector,
            limit=top_k,
            score_threshold=0.72,
        )
        return [
            {
                "text": hit.payload["text"],
                "source": hit.payload.get("source", ""),
                "score": round(hit.score, 4),
            }
            for hit in results
        ]

Collection Design Patterns

  • Per-domain collections-- Separate collections for documentation, code, and conversation history to enable domain-specific retrieval tuning.
  • Payload indexing-- Create payload indexes on frequently filtered fields (source, timestamp) to accelerate filtered search.
  • Snapshot strategy-- Schedule nightly Qdrant snapshots to MinIO for disaster recovery. Retention: 7 daily, 4 weekly.

RAG Pipeline Architecture

The retrieval-augmented generation pipeline connects document ingestion, chunking, embedding, retrieval, and synthesis into a single chain. We use LangChain as the orchestration layer with custom components for each stage.

Document Ingestion and Chunking

# ingestion.py
from langchain_community.document_loaders import (
    DirectoryLoader,
    PyPDFLoader,
    TextLoader,
    UnstructuredMarkdownLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path


def load_documents(source_dir: str) -> list:
    """Load documents from a directory with format detection."""
    loaders = {
        ".pdf": (PyPDFLoader, {}),
        ".txt": (TextLoader, {"encoding": "utf-8"}),
        ".md": (UnstructuredMarkdownLoader, {}),
    }

    documents = []
    for ext, (loader_cls, kwargs) in loaders.items():
        loader = DirectoryLoader(
            source_dir,
            glob=f"**/*{ext}",
            loader_cls=loader_cls,
            loader_kwargs=kwargs,
            show_progress=True,
        )
        documents.extend(loader.load())

    return documents


def chunk_documents(
    documents: list,
    chunk_size: int = 1024,
    chunk_overlap: int = 128,
) -> list:
    """Split documents using recursive character splitting."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
        length_function=len,
    )

    chunks = splitter.split_documents(documents)

    # Enrich metadata
    for i, chunk in enumerate(chunks):
        chunk.metadata["chunk_index"] = i
        chunk.metadata["char_count"] = len(chunk.page_content)
        source = Path(chunk.metadata.get("source", ""))
        chunk.metadata["filename"] = source.name

    return chunks

Retrieval Chain

# rag_chain.py
from langchain_community.llms import VLLMOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


def build_rag_chain(retriever, model_url: str = "http://localhost:8000/v1"):
    """Build a RAG chain with source attribution."""

    llm = VLLMOpenAI(
        openai_api_base=model_url,
        openai_api_key="not-needed",
        model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
        temperature=0.1,
        max_tokens=2048,
    )

    template = ChatPromptTemplate.from_messages([
        (
            "system",
            "You are a technical assistant. Answer based strictly on "
            "the provided context. If the context does not contain "
            "enough information, state that explicitly. Cite sources "
            "using [Source: filename] notation.\n\n"
            "Context:\n{context}",
        ),
        ("human", "{question}"),
    ])

    def format_docs(docs):
        return "\n\n---\n\n".join(
            f"[Source: {d.metadata.get('filename', 'unknown')}]\n"
            f"{d.page_content}"
            for d in docs
        )

    chain = (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough(),
        }
        | template
        | llm
        | StrOutputParser()
    )

    return chain

RAG Quality Metrics

  • Retrieval precision@5-- Fraction of top-5 retrieved chunks that are relevant to the query. Target: above 0.75.
  • Answer faithfulness-- Measures whether the generated answer is supported by the retrieved context. Evaluated with LLM-as-judge using a separate grading model.
  • Latency budget-- End-to-end RAG query should complete in under 3 seconds: retrieval 200ms, LLM generation 2500ms, overhead 300ms.

Agent Framework with LangGraph

Autonomous agents extend RAG pipelines with tool use, planning, and multi-step reasoning. We use LangGraph for stateful agent orchestration with explicit control flow, replacing the implicit ReAct loops of earlier frameworks.

# agent_graph.py
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_core.messages import HumanMessage, AIMessage
from typing import TypedDict, Annotated, Sequence
import operator


class AgentState(TypedDict):
    """State shared across all nodes in the agent graph."""
    messages: Annotated[Sequence[HumanMessage | AIMessage], operator.add]
    tool_calls_count: int
    max_iterations: int


def create_infrastructure_agent(llm, tools: list) -> StateGraph:
    """Create an infrastructure management agent."""

    tool_node = ToolNode(tools)

    def should_continue(state: AgentState) -> str:
        """Decide whether to use tools or finish."""
        last_message = state["messages"][-1]
        if state["tool_calls_count"] >= state["max_iterations"]:
            return "end"
        if hasattr(last_message, "tool_calls") and last_message.tool_calls:
            return "tools"
        return "end"

    def call_model(state: AgentState) -> dict:
        """Invoke the LLM with current state."""
        response = llm.invoke(state["messages"])
        return {"messages": [response]}

    def increment_counter(state: AgentState) -> dict:
        """Track tool invocation count for safety."""
        return {"tool_calls_count": state["tool_calls_count"] + 1}

    # Build the graph
    workflow = StateGraph(AgentState)

    workflow.add_node("agent", call_model)
    workflow.add_node("tools", tool_node)
    workflow.add_node("counter", increment_counter)

    workflow.set_entry_point("agent")

    workflow.add_conditional_edges(
        "agent",
        should_continue,
        {"tools": "counter", "end": END},
    )
    workflow.add_edge("counter", "tools")
    workflow.add_edge("tools", "agent")

    return workflow.compile()

Tool Registration

# tools.py
from langchain_core.tools import tool
import subprocess
import requests


@tool
def check_proxmox_node_status(node: str) -> str:
    """Check resource utilisation of a Proxmox node."""
    result = subprocess.run(
        ["pvesh", "get", f"/nodes/{node}/status", "--output-format=json"],
        capture_output=True,
        text=True,
        timeout=10,
    )
    return result.stdout


@tool
def query_prometheus(query: str, endpoint: str = "http://prometheus:9090") -> str:
    """Execute a PromQL query against the monitoring stack."""
    response = requests.get(
        f"{endpoint}/api/v1/query",
        params={"query": query},
        timeout=10,
    )
    return response.json()


@tool
def restart_docker_service(service_name: str, compose_dir: str) -> str:
    """Restart a Docker Compose service."""
    result = subprocess.run(
        ["docker", "compose", "-f", f"{compose_dir}/docker-compose.yml",
         "restart", service_name],
        capture_output=True,
        text=True,
        timeout=60,
    )
    return f"stdout: {result.stdout}\nstderr: {result.stderr}"

Multi-Agent Communication Patterns

  • Supervisor pattern-- A coordinator agent delegates tasks to specialist agents (infrastructure, database, deployment) and synthesises their outputs into a unified response.
  • Message bus-- Agents communicate through a shared Redis Streams channel, enabling asynchronous task hand-offs and event-driven activation.
  • State checkpointing-- LangGraph persists agent state to PostgreSQL after each node execution, enabling recovery from failures without replaying the entire graph.

MLflow Experiment Platform

MLflow provides experiment tracking, model versioning, and artifact management. We deploy it with a PostgreSQL backend and MinIO for S3-compatible artifact storage, all running on Docker within the Proxmox VM.

# docker-compose.mlflow.yml
version: "3.9"
services:
  mlflow-db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    volumes:
      - mlflow_pg_data:/var/lib/postgresql/data
    secrets:
      - db_password

  minio:
    image: minio/minio:RELEASE.2025-01-20
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER_FILE: /run/secrets/minio_user
      MINIO_ROOT_PASSWORD_FILE: /run/secrets/minio_password
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - minio_data:/data
    secrets:
      - minio_user
      - minio_password

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    depends_on:
      - mlflow-db
      - minio
    ports:
      - "5000:5000"
    command: >
      mlflow server
      --host 0.0.0.0
      --port 5000
      --backend-store-uri postgresql://mlflow:${DB_PASSWORD}@mlflow-db:5432/mlflow
      --default-artifact-root s3://mlflow-artifacts/
    environment:
      AWS_ACCESS_KEY_ID_FILE: /run/secrets/minio_user
      AWS_SECRET_ACCESS_KEY_FILE: /run/secrets/minio_password
      MLFLOW_S3_ENDPOINT_URL: http://minio:9000
    secrets:
      - db_password
      - minio_user
      - minio_password

secrets:
  db_password:
    file: ./secrets/db_password.txt
  minio_user:
    file: ./secrets/minio_user.txt
  minio_password:
    file: ./secrets/minio_password.txt

volumes:
  mlflow_pg_data:
  minio_data:

Experiment Tracking Integration

# tracking.py
import mlflow
from mlflow.models import infer_signature
import time


def track_rag_experiment(
    chain,
    test_queries: list[str],
    experiment_name: str = "rag-pipeline",
) -> str:
    """Track RAG pipeline evaluation as an MLflow experiment."""

    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run() as run:
        # Log pipeline parameters
        mlflow.log_params({
            "embedding_model": "BAAI/bge-large-en-v1.5",
            "llm_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
            "chunk_size": 1024,
            "chunk_overlap": 128,
            "retrieval_top_k": 5,
            "temperature": 0.1,
        })

        # Evaluate on test queries
        latencies = []
        for query in test_queries:
            start = time.perf_counter()
            response = chain.invoke(query)
            elapsed = time.perf_counter() - start
            latencies.append(elapsed)

        # Log metrics
        mlflow.log_metrics({
            "avg_latency_s": sum(latencies) / len(latencies),
            "p95_latency_s": sorted(latencies)[
                int(len(latencies) * 0.95)
            ],
            "p99_latency_s": sorted(latencies)[
                int(len(latencies) * 0.99)
            ],
            "num_queries": len(test_queries),
        })

        # Log the chain as a model
        signature = infer_signature(
            test_queries[0],
            chain.invoke(test_queries[0]),
        )
        mlflow.langchain.log_model(
            chain,
            artifact_path="rag_chain",
            signature=signature,
            registered_model_name="rag-pipeline-v1",
        )

        return run.info.run_id

Training Infrastructure

Fine-tuning foundation models on domain-specific data requires careful GPU resource allocation. With two A4000 GPUs in our Proxmox node, we partition resources: one GPU dedicated to inference serving, the other available for training jobs. For larger fine-tuning runs, we schedule training during off-peak hours and temporarily reassign both GPUs.

GPU Resource Allocation

# Proxmox GPU assignment strategy
# VM 200: LLM Inference  -- GPU 0 (01:00.0) -- always running
# VM 201: Training        -- GPU 1 (02:00.0) -- on-demand

# Create the training VM
qm create 201 \
  --name llm-training \
  --memory 65536 \
  --cores 16 \
  --scsi0 local-lvm:128,ssd=1 \
  --net0 virtio,bridge=vmbr0

# Attach second GPU
qm set 201 -hostpci0 02:00,pcie=1

# Snapshot before training (safety net)
qm snapshot 201 pre-training-$(date +%Y%m%d)

# Start the training VM
qm start 201

LoRA/QLoRA Fine-Tuning Pipeline

# finetune.py
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch
import mlflow


def run_qlora_finetune(
    base_model: str = "meta-llama/Meta-Llama-3.1-8B-Instruct",
    dataset_path: str = "./data/training_pairs.jsonl",
    output_dir: str = "./checkpoints",
    num_epochs: int = 3,
    learning_rate: float = 2e-4,
) -> str:
    """Fine-tune a model with QLoRA and track with MLflow."""

    # 4-bit quantisation config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
    model = prepare_model_for_kbit_training(model)

    # LoRA configuration
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
    )
    model = get_peft_model(model, lora_config)

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable:,} / {total:,} "
          f"({100 * trainable / total:.2f}%)")

    tokenizer = AutoTokenizer.from_pretrained(base_model)
    tokenizer.pad_token = tokenizer.eos_token

    dataset = load_dataset("json", data_files=dataset_path, split="train")

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        logging_steps=10,
        save_strategy="epoch",
        bf16=True,
        optim="paged_adamw_8bit",
        gradient_checkpointing=True,
        max_grad_norm=0.3,
        report_to="mlflow",
    )

    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment("llm-finetuning")

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        args=training_args,
        max_seq_length=2048,
        dataset_text_field="text",
    )

    with mlflow.start_run() as run:
        mlflow.log_params({
            "base_model": base_model,
            "lora_r": 16,
            "lora_alpha": 32,
            "quantization": "QLoRA-4bit-nf4",
            "trainable_params": trainable,
        })
        trainer.train()
        trainer.save_model(f"{output_dir}/final")

        return run.info.run_id

Distributed Training Considerations

For models that exceed single-GPU memory even with QLoRA, configure DeepSpeed ZeRO Stage 2 across both GPUs. Set NCCL_P2P_DISABLE=1 when GPUs are in separate IOMMU groups to avoid PCIe peer-to-peer transfer failures. Monitor GPU memory fragmentation with nvidia-smi dmon and restart training VMs between runs to reclaim fragmented VRAM.

Observability and Monitoring

LLM workloads require specialised monitoring beyond standard infrastructure metrics. We instrument the stack with Prometheus custom metrics for token throughput, inference latency distributions, and error classification, visualised through purpose-built Grafana dashboards.

Custom Prometheus Metrics

# metrics.py
from prometheus_client import (
    Counter,
    Histogram,
    Gauge,
    start_http_server,
)
import time
from functools import wraps

# Token throughput
TOKENS_GENERATED = Counter(
    "llm_tokens_generated_total",
    "Total tokens generated by the LLM",
    ["model", "endpoint"],
)

TOKENS_INPUT = Counter(
    "llm_tokens_input_total",
    "Total input tokens processed",
    ["model", "endpoint"],
)

# Latency distributions with LLM-specific buckets
INFERENCE_LATENCY = Histogram(
    "llm_inference_duration_seconds",
    "Time to complete an inference request",
    ["model", "request_type"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 30.0],
)

TIME_TO_FIRST_TOKEN = Histogram(
    "llm_time_to_first_token_seconds",
    "Time from request receipt to first token generation",
    ["model"],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0],
)

# Error tracking
INFERENCE_ERRORS = Counter(
    "llm_inference_errors_total",
    "Total inference errors",
    ["model", "error_type"],
)

# Resource utilisation
GPU_MEMORY_USED = Gauge(
    "gpu_memory_used_bytes",
    "GPU memory currently in use",
    ["gpu_id"],
)

ACTIVE_REQUESTS = Gauge(
    "llm_active_requests",
    "Number of currently processing inference requests",
    ["model"],
)


def track_inference(model_name: str, request_type: str = "completion"):
    """Decorator to instrument LLM inference calls."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            ACTIVE_REQUESTS.labels(model=model_name).inc()
            start = time.perf_counter()
            try:
                result = func(*args, **kwargs)
                elapsed = time.perf_counter() - start
                INFERENCE_LATENCY.labels(
                    model=model_name,
                    request_type=request_type,
                ).observe(elapsed)
                return result
            except Exception as e:
                INFERENCE_ERRORS.labels(
                    model=model_name,
                    error_type=type(e).__name__,
                ).inc()
                raise
            finally:
                ACTIVE_REQUESTS.labels(model=model_name).dec()
        return wrapper
    return decorator

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "vllm"
    static_configs:
      - targets: ["vllm:8000"]
    metrics_path: /metrics

  - job_name: "llm-metrics"
    static_configs:
      - targets: ["llm-metrics:9090"]

  - job_name: "qdrant"
    static_configs:
      - targets: ["qdrant:6333"]
    metrics_path: /metrics

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "nvidia-gpu"
    static_configs:
      - targets: ["nvidia-dcgm-exporter:9400"]

rule_files:
  - "alerts/llm_alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Alert Rules

# alerts/llm_alerts.yml
groups:
  - name: llm_alerts
    rules:
      - alert: HighInferenceLatency
        expr: >
          histogram_quantile(0.95,
            rate(llm_inference_duration_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM inference P95 latency exceeds 5s"

      - alert: HighErrorRate
        expr: >
          rate(llm_inference_errors_total[5m])
          / rate(llm_tokens_generated_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate exceeds 5%"

      - alert: GPUMemoryExhaustion
        expr: >
          gpu_memory_used_bytes / gpu_memory_total_bytes > 0.95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory utilisation above 95%"

Grafana Dashboard Panels

  • Token throughput-- rate(llm_tokens_generated_total[5m]) by model, displayed as a time-series graph with per-model colour coding.
  • Latency heatmap-- histogram_quantile over llm_inference_duration_seconds at P50, P95, and P99 percentiles.
  • GPU utilisation-- DCGM exporter metrics for GPU compute utilisation, memory bandwidth, and temperature per device.
  • Request queue depth-- llm_active_requests gauge showing concurrent request pressure per model.

Production Deployment Patterns

Deploying LLMs to production requires strategies for zero-downtime updates, safe rollbacks, and traffic management. We implement blue-green deployments at the Docker Compose level with an Nginx reverse proxy handling traffic switching.

Blue-Green Model Deployments

# deploy.sh -- Blue-green deployment for vLLM
#!/usr/bin/env bash
set -euo pipefail

ACTIVE_SLOT=$(cat /opt/llm/active_slot)  # "blue" or "green"
NEW_SLOT=$( [ "$ACTIVE_SLOT" = "blue" ] && echo "green" || echo "blue" )

echo "Deploying to $NEW_SLOT slot (current: $ACTIVE_SLOT)"

# Start the new model version in the inactive slot
docker compose -f "docker-compose.vllm-$NEW_SLOT.yml" up -d

# Wait for health check
echo "Waiting for $NEW_SLOT to become healthy..."
for i in $(seq 1 60); do
    if curl -sf "http://vllm-$NEW_SLOT:8000/health" > /dev/null 2>&1; then
        echo "$NEW_SLOT is healthy after ${i}s"
        break
    fi
    sleep 1
done

# Run smoke tests against the new slot
python3 /opt/llm/smoke_test.py --endpoint "http://vllm-$NEW_SLOT:8000/v1"
if [ $? -ne 0 ]; then
    echo "Smoke tests failed. Rolling back."
    docker compose -f "docker-compose.vllm-$NEW_SLOT.yml" down
    exit 1
fi

# Switch traffic via Nginx
sed -i "s/vllm-$ACTIVE_SLOT/vllm-$NEW_SLOT/g" /etc/nginx/conf.d/llm.conf
nginx -t && nginx -s reload

# Record the switch
echo "$NEW_SLOT" > /opt/llm/active_slot
echo "Traffic switched to $NEW_SLOT"

# Drain and stop the old slot after a grace period
sleep 30
docker compose -f "docker-compose.vllm-$ACTIVE_SLOT.yml" down
echo "Old slot $ACTIVE_SLOT stopped. Deployment complete."

A/B Testing with Traffic Splitting

# nginx-ab-test.conf
upstream llm_backend {
    # 90% traffic to the current production model
    server vllm-blue:8000 weight=9;
    # 10% traffic to the candidate model
    server vllm-green:8000 weight=1;
}

server {
    listen 443 ssl;
    server_name llm.internal.example.com;

    ssl_certificate     /etc/ssl/certs/llm.pem;
    ssl_certificate_key /etc/ssl/private/llm.key;

    location /v1/ {
        proxy_pass http://llm_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Request-ID $request_id;
        proxy_read_timeout 120s;

        # Rate limiting
        limit_req zone=llm_zone burst=20 nodelay;
        limit_req_status 429;
    }

    location /health {
        proxy_pass http://llm_backend;
    }
}

# Rate limit zone definition (in http block)
# limit_req_zone $binary_remote_addr zone=llm_zone:10m rate=30r/m;

Rollback Procedure

# rollback.sh -- Immediate rollback to previous model version
#!/usr/bin/env bash
set -euo pipefail

ACTIVE_SLOT=$(cat /opt/llm/active_slot)
PREVIOUS_SLOT=$( [ "$ACTIVE_SLOT" = "blue" ] && echo "green" || echo "blue" )

echo "Rolling back from $ACTIVE_SLOT to $PREVIOUS_SLOT"

# Verify the previous slot image is still available
if ! docker compose -f "docker-compose.vllm-$PREVIOUS_SLOT.yml" config > /dev/null 2>&1; then
    echo "ERROR: Previous slot config not found. Manual intervention required."
    exit 1
fi

# Start the previous version
docker compose -f "docker-compose.vllm-$PREVIOUS_SLOT.yml" up -d

# Wait for health
for i in $(seq 1 60); do
    curl -sf "http://vllm-$PREVIOUS_SLOT:8000/health" > /dev/null 2>&1 && break
    sleep 1
done

# Switch traffic
sed -i "s/vllm-$ACTIVE_SLOT/vllm-$PREVIOUS_SLOT/g" /etc/nginx/conf.d/llm.conf
nginx -t && nginx -s reload
echo "$PREVIOUS_SLOT" > /opt/llm/active_slot

# Stop the failed version
docker compose -f "docker-compose.vllm-$ACTIVE_SLOT.yml" down

echo "Rollback complete. Active slot: $PREVIOUS_SLOT"

Cost Analysis: Self-Hosted vs Cloud APIs

The economic case for self-hosting depends on sustained utilisation. Below is a detailed TCO comparison based on processing 2 million tokens per day (roughly 1,000 RAG queries with 2,000-token contexts).

Hardware CAPEX (One-Time)

Proxmox server (128GB RAM, 2x NVMe) -- EUR 2,800

2x NVIDIA A4000 (16GB VRAM each) -- EUR 2,200

10GbE networking -- EUR 300

UPS and rack -- EUR 500

Total CAPEX -- EUR 5,800

Monthly OPEX Comparison

Self-hosted (monthly):

Electricity (350W avg, EUR 0.30/kWh) -- EUR 76

Internet / bandwidth -- EUR 30

Maintenance / sysadmin (2h/month) -- EUR 150

Monthly total -- EUR 256

Cloud API (monthly, 2M tok/day):

GPT-4o (input + output) -- EUR 1,800

Embedding API (ada-002) -- EUR 120

Vector DB managed (Pinecone) -- EUR 230

Monthly total -- EUR 2,150

Break-Even Analysis

Monthly savings: EUR 2,150 - EUR 256 = EUR 1,894. Break-even point: EUR 5,800 / EUR 1,894 = 3.06 months. After the break-even point, self-hosting saves approximately EUR 22,700 per year at this utilisation level.

At lower utilisation (500K tokens/day), the cloud API cost drops to approximately EUR 540/month, pushing break-even to 20 months. The crossover point where self-hosting becomes economically favourable is roughly 800K tokens per day sustained.

# cost_calculator.py
"""TCO calculator for self-hosted vs cloud AI infrastructure."""

from dataclasses import dataclass


@dataclass
class SelfHostedCost:
    capex_hardware: float = 5800.0       # EUR one-time
    monthly_electricity: float = 76.0     # EUR
    monthly_bandwidth: float = 30.0       # EUR
    monthly_maintenance: float = 150.0    # EUR (sysadmin hours)
    hardware_lifetime_months: int = 48    # 4-year depreciation

    @property
    def monthly_opex(self) -> float:
        return (
            self.monthly_electricity
            + self.monthly_bandwidth
            + self.monthly_maintenance
        )

    @property
    def monthly_depreciation(self) -> float:
        return self.capex_hardware / self.hardware_lifetime_months

    @property
    def monthly_tco(self) -> float:
        return self.monthly_opex + self.monthly_depreciation


@dataclass
class CloudAPICost:
    daily_tokens: int = 2_000_000
    input_price_per_1k: float = 0.0025   # EUR per 1K input tokens
    output_price_per_1k: float = 0.01    # EUR per 1K output tokens
    input_ratio: float = 0.7             # 70% of tokens are input
    monthly_embedding: float = 120.0     # EUR
    monthly_vector_db: float = 230.0     # EUR

    @property
    def monthly_llm_cost(self) -> float:
        monthly_tokens = self.daily_tokens * 30
        input_tokens = monthly_tokens * self.input_ratio
        output_tokens = monthly_tokens * (1 - self.input_ratio)
        input_cost = (input_tokens / 1000) * self.input_price_per_1k
        output_cost = (output_tokens / 1000) * self.output_price_per_1k
        return input_cost + output_cost

    @property
    def monthly_tco(self) -> float:
        return (
            self.monthly_llm_cost
            + self.monthly_embedding
            + self.monthly_vector_db
        )


def break_even_months(self_hosted: SelfHostedCost, cloud: CloudAPICost) -> float:
    """Calculate months until self-hosting pays for itself."""
    monthly_savings = cloud.monthly_tco - self_hosted.monthly_opex
    if monthly_savings <= 0:
        return float("inf")
    return self_hosted.capex_hardware / monthly_savings


if __name__ == "__main__":
    sh = SelfHostedCost()
    cl = CloudAPICost()

    print(f"Self-hosted monthly OPEX: EUR {sh.monthly_opex:.0f}")
    print(f"Self-hosted monthly TCO:  EUR {sh.monthly_tco:.0f}")
    print(f"Cloud API monthly TCO:    EUR {cl.monthly_tco:.0f}")
    print(f"Monthly savings:          EUR {cl.monthly_tco - sh.monthly_opex:.0f}")
    print(f"Break-even:               {break_even_months(sh, cl):.1f} months")

Conclusion

A self-hosted AI stack on Proxmox is not a weekend project -- it is a production platform that demands the same engineering rigour as any other critical infrastructure. But the return on that investment is substantial: full control over your models, your data, and your costs.

The architecture described here -- LLM inference with vLLM, vector storage with Qdrant, RAG orchestration with LangChain, agent frameworks with LangGraph, experiment tracking with MLflow, QLoRA fine-tuning, Prometheus/Grafana observability, and blue-green deployments -- provides a complete foundation. Each layer is independently replaceable and horizontally scalable.

Start with inference and RAG. Add training and agents as your requirements grow. Monitor everything from the beginning. The observability layer is not optional -- it is how you maintain confidence in a system whose behaviour is inherently probabilistic.