Production-Grade Proxmox Infrastructure -- Enterprise Reliability on Bare Metal

Why Self-Hosted Infrastructure

The prevailing narrative in infrastructure engineering defaults to cloud-first. AWS, Azure, and GCP offer undeniable convenience -- but that convenience comes at a cost that compounds rapidly once workloads stabilise. For predictable, long-running services the economics of bare-metal hosting are difficult to ignore: a dedicated Hetzner server with 64 GB RAM and 1 TB NVMe storage costs roughly EUR 45/month. The equivalent compute and storage on AWS EC2 with EBS would exceed EUR 350/month before factoring in data transfer fees.

The question is not whether bare metal is cheaper -- it plainly is for stable workloads -- but whether you can achieve enterprise-grade reliability without the managed services that cloud providers bundle into their pricing. This article documents exactly how: a Proxmox VE cluster running on Hetzner dedicated servers, delivering 99.99% measured uptime across seven production services over the past twelve months.

The architecture prioritises three principles: repeatability through infrastructure-as-code, defence-in-depth through layered security, and observability through comprehensive monitoring. Every configuration described below is version-controlled, every change is auditable, and every service is instrumented.

Cost Comparison: Bare Metal vs Cloud (Monthly)

--Hetzner AX41-NVMe (dedicated): EUR 45.00 -- 64 GB RAM, 512 GB NVMe x2, Ryzen 5 3600
--AWS equivalent (m5.xlarge + 1 TB gp3 + transfer): EUR 380.00+
--Annual savings: approx. EUR 4,000 per server node
--Trade-off: self-managed updates, monitoring, and disaster recovery

The goal of this project was never to replicate every AWS service. It was to prove that a small, carefully designed bare-metal cluster can run a portfolio of production services with reliability metrics that rival managed cloud offerings -- at roughly 12% of the cost.

Hardware Architecture

The cluster runs on Hetzner dedicated servers selected for their price-to-performance ratio and hardware reliability track record. Each node is provisioned with identical specifications to simplify live migration and capacity planning.

# Node Specifications (per server)
# ─────────────────────────────────────────────
Model:       Hetzner AX41-NVMe
CPU:         AMD Ryzen 5 3600 (6C/12T @ 3.6 GHz)
RAM:         64 GB DDR4 ECC
Storage:     2x 512 GB NVMe SSD (Samsung PM9A3)
Network:     1 Gbps dedicated uplink
Location:    Falkenstein, DE (FSN1-DC14)
OS:          Proxmox VE 8.1
Kernel:      6.5.13-3-pve

# Storage Layout
# ─────────────────────────────────────────────
# ZFS mirror across both NVMe drives
# - rpool:      root filesystem (50 GB)
# - datapool:   VM/CT storage (400 GB)
# - swap:       8 GB zvol on rpool

ZFS was chosen as the filesystem for its built-in data integrity verification, snapshot capabilities, and native integration with Proxmox. The mirror configuration across both NVMe drives provides redundancy against single-drive failure without sacrificing read performance. ZFS checksumming catches silent data corruption that would go undetected on ext4 or XFS.

# ZFS Pool Configuration
# ─────────────────────────────────────────────
zpool create -f -o ashift=12 \
  -O compression=lz4 \
  -O atime=off \
  -O xattr=sa \
  -O dnodesize=auto \
  -O normalization=formD \
  -O mountpoint=none \
  datapool mirror /dev/nvme0n1p3 /dev/nvme1n1p3

# Verify pool status
zpool status datapool
  pool: datapool
  state: ONLINE
  config:
    NAME                STATE   READ WRITE CKSUM
    datapool            ONLINE     0     0     0
      mirror-0          ONLINE     0     0     0
        nvme0n1p3       ONLINE     0     0     0
        nvme1n1p3       ONLINE     0     0     0

# ZFS dataset hierarchy
# datapool/
# ├── ct/          -- LXC container rootfs
# ├── vm/          -- QEMU disk images
# ├── backups/     -- local backup staging
# └── templates/   -- ISO and CT templates

Network Topology

Internet
  │
  ├── Cloudflare (DNS + WAF + CDN)
  │     │
  │     └── Cloudflare Tunnel (cloudflared)
  │           │
  ├── vmbr0 (Public Bridge)
  │     │   IP: 10.0.0.1/24
  │     │
  │     ├── VLAN 10: Production Services
  │     │     ├── CT 100: nginx-proxy
  │     │     ├── CT 101: portfolio-site
  │     │     ├── CT 102: open-edx
  │     │     └── CT 103: booking-system
  │     │
  │     ├── VLAN 20: Data + ML
  │     │     ├── CT 200: mlflow-server
  │     │     ├── CT 201: geoserver
  │     │     └── VM 202: ai-agents
  │     │
  │     ├── VLAN 30: Monitoring
  │     │     ├── CT 300: prometheus
  │     │     ├── CT 301: grafana
  │     │     └── CT 302: loki
  │     │
  │     └── VLAN 40: Management
  │           ├── Proxmox WebUI (:8006)
  │           └── PBS WebUI (:8007)
  │
  └── vmbr1 (Internal Bridge)
        │   IP: 192.168.100.1/24
        │   (inter-node cluster traffic)
        └── Corosync + migration traffic

Each VLAN is isolated at the bridge level within Proxmox. Traffic between VLANs passes through the nginx reverse proxy container on VLAN 10, which acts as the sole ingress point for external requests. Inter-node cluster communication (Corosync heartbeat and live migration) runs on a dedicated internal bridge that is never exposed to the public network.

Proxmox Cluster Design

The cluster is configured for high availability with automatic failover. Proxmox's built-in HA manager monitors service health through Corosync quorum and triggers live migration or restart on the surviving node when a failure is detected. Fencing ensures that a failed node is isolated before its workloads are restarted elsewhere, preventing split-brain scenarios.

# /etc/pve/corosync.conf (excerpt)
# ─────────────────────────────────────────────
totem {
  version: 2
  secauth: on
  cluster_name: prod-cluster
  transport: knet
  interface {
    ringnumber: 0
    bindnetaddr: 192.168.100.0
    mcastport: 5405
  }
  crypto_cipher: aes256
  crypto_hash: sha256
}

nodelist {
  node {
    ring0_addr: 192.168.100.1
    name: pve-node-01
    nodeid: 1
  }
  node {
    ring0_addr: 192.168.100.2
    name: pve-node-02
    nodeid: 2
  }
}

quorum {
  provider: corosync_votequorum
  expected_votes: 2
  two_node: 1
}

Resource pools partition the cluster into logical groups that match the VLAN segmentation. Each pool has CPU and memory limits to prevent a runaway container from starving other services. The HA group configuration assigns preferred nodes to each service while allowing migration to the partner node under failure conditions.

# HA Group Configuration
# ─────────────────────────────────────────────
# pvesh create /cluster/ha/groups
#   --group production
#   --nodes pve-node-01,pve-node-02
#   --nofailback 0
#   --restricted 0

# HA Resource Registration
pvesh create /cluster/ha/resources \
  --sid ct:100 \
  --group production \
  --max_restart 3 \
  --max_relocate 2 \
  --state started

pvesh create /cluster/ha/resources \
  --sid ct:101 \
  --group production \
  --max_restart 3 \
  --max_relocate 2 \
  --state started

# Verify HA status
pvesh get /cluster/ha/status/current
# ┌──────────┬────────┬────────────┬─────────┐
# │ sid      │ state  │ node       │ request │
# ├──────────┼────────┼────────────┼─────────┤
# │ ct:100   │ started│ pve-node-01│         │
# │ ct:101   │ started│ pve-node-01│         │
# │ ct:102   │ started│ pve-node-02│         │
# │ ct:200   │ started│ pve-node-01│         │
# │ ct:201   │ started│ pve-node-02│         │
# │ vm:202   │ started│ pve-node-01│         │
# └──────────┴────────┴────────────┴─────────┘

Backup scheduling is handled by Proxmox Backup Server (PBS) running on a separate partition. PBS deduplicates at the chunk level, meaning incremental backups of large VM disks complete in seconds when only a small percentage of blocks have changed. Full backup verification runs weekly to confirm restore integrity.

# Backup Schedule (/etc/pve/jobs.cfg excerpt)
# ─────────────────────────────────────────────
vzdump: daily-backup
  enabled 1
  schedule daily 02:00
  storage pbs-local
  mode snapshot
  compress zstd
  mailnotification always
  mailto admin@neurodatalab.ai
  all 1
  prune-backups keep-daily=7,keep-weekly=4,keep-monthly=6

# PBS Datastore Pruning Policy
# keep-last:    3
# keep-daily:   7
# keep-weekly:  4
# keep-monthly: 6
# keep-yearly:  1

Live Migration Procedure

Live migration moves a running container or VM between nodes with zero downtime. The process transfers memory pages iteratively until the delta is small enough for a final switchover that typically takes under 100ms.

--Pre-check: verify target node has sufficient resources
--Phase 1: bulk memory copy over internal bridge (vmbr1)
--Phase 2: iterative dirty-page sync (typically 2-3 rounds)
--Phase 3: final pause, last-page copy, resume on target (<100ms)
--Post-migration: ARP announcement to update network switches

Container and VM Strategy

Proxmox supports two virtualisation technologies: LXC containers (OS-level) and QEMU/KVM virtual machines (full hardware). The decision between them is driven by the workload's requirements for kernel isolation, resource overhead tolerance, and operational complexity.

LXC vs KVM Decision Matrix

Criterion              LXC Container    KVM Virtual Machine
───────────────────────────────────────────────────────────
Boot time              < 2 seconds      15-30 seconds
Memory overhead        ~20 MB           ~256 MB
Kernel isolation       Shared host      Full isolation
Storage efficiency     Thin provision   Thick/thin QCOW2
Live migration speed   Fast (no RAM)    Slower (RAM copy)
Docker inside          Nested (config)  Native support
Custom kernel          No               Yes
GPU passthrough        No               Yes (IOMMU)
───────────────────────────────────────────────────────────
Use case in cluster:
  LXC:  nginx, web apps, databases, monitoring
  KVM:  AI agents (Docker-in-Docker), GPU workloads

Six of the seven production services run as LXC containers. The exception is the AI agent service, which requires Docker Compose internally and therefore runs as a full KVM VM. LXC containers share the host kernel, eliminating the memory overhead of running separate kernels and reducing boot time to under two seconds.

# LXC Container Template (production base)
# /etc/pve/lxc/101.conf
# ─────────────────────────────────────────────
arch: amd64
cores: 2
memory: 2048
swap: 512
rootfs: datapool:ct/subvol-101-disk-0,size=20G
hostname: portfolio-site
nameserver: 1.1.1.1
searchdomain: internal.cluster
net0: name=eth0,bridge=vmbr0,tag=10,ip=10.0.10.101/24,gw=10.0.10.1
onboot: 1
startup: order=2,up=30,down=30
unprivileged: 1
features: nesting=1
lxc.apparmor.profile: generated
lxc.cap.drop:

# Resource Limits
lxc.cgroup2.memory.max: 2147483648
lxc.cgroup2.cpu.max: 200000 100000

# KVM VM Configuration (AI Agents)
# /etc/pve/qemu-server/202.conf
# ─────────────────────────────────────────────
agent: 1
balloon: 2048
boot: order=scsi0
cores: 4
cpu: host
memory: 8192
name: ai-agents
net0: virtio=XX:XX:XX:XX:XX:XX,bridge=vmbr0,tag=20
numa: 0
onboot: 1
ostype: l26
scsi0: datapool:vm-202-disk-0,iothread=1,size=80G
scsihw: virtio-scsi-single
serial0: socket
startup: order=5,up=60,down=60
vga: serial0

# Cloud-init drive for automated provisioning
ide2: datapool:vm-202-cloudinit,media=cdrom
ciuser: deploy
cipassword: [redacted]
sshkeys: ssh-ed25519%20AAAA...%20deploy@cluster
ipconfig0: ip=10.0.20.202/24,gw=10.0.20.1

Cloud-init integration allows new VMs to be provisioned from a template with a single command. The cloud-init drive injects SSH keys, network configuration, and an initial user -- eliminating manual setup entirely. Templates are updated monthly with the latest security patches and stored in the datapool templates dataset.

# Creating a VM from template with cloud-init
# ─────────────────────────────────────────────
qm clone 9000 203 --name new-service --full
qm set 203 --ciuser deploy
qm set 203 --sshkeys /root/.ssh/deploy_ed25519.pub
qm set 203 --ipconfig0 ip=10.0.20.203/24,gw=10.0.20.1
qm start 203

# Template creation workflow
qm create 9000 --memory 2048 --cores 2 --name ubuntu-template
qm importdisk 9000 ubuntu-24.04-cloudimg-amd64.img datapool
qm set 9000 --scsi0 datapool:vm-9000-disk-0
qm set 9000 --boot order=scsi0
qm set 9000 --ide2 datapool:cloudinit
qm set 9000 --agent enabled=1
qm template 9000

Network Architecture

External traffic never touches the server directly. All public requests are proxied through Cloudflare, which provides DNS resolution, DDoS mitigation, WAF rules, and CDN caching. The connection between Cloudflare and the cluster is secured via a Cloudflare Tunnel (formerly Argo Tunnel), which establishes an outbound-only encrypted connection from the server to Cloudflare's edge network. No inbound ports are opened on the host firewall for web traffic.

# Cloudflare Tunnel Configuration
# /etc/cloudflared/config.yml
# ─────────────────────────────────────────────
tunnel: a1b2c3d4-e5f6-7890-abcd-ef1234567890
credentials-file: /etc/cloudflared/credentials.json

ingress:
  - hostname: neurodatalab.ai
    service: http://10.0.10.101:3000
    originRequest:
      noTLSVerify: true

  - hostname: mlflow.neurodatalab.ai
    service: http://10.0.20.200:5000
    originRequest:
      noTLSVerify: true

  - hostname: geo.neurodatalab.ai
    service: http://10.0.20.201:8080
    originRequest:
      noTLSVerify: true

  - hostname: monitor.neurodatalab.ai
    service: http://10.0.30.301:3000
    originRequest:
      noTLSVerify: true

  - hostname: learn.neurodatalab.ai
    service: http://10.0.10.102:80
    originRequest:
      noTLSVerify: true

  - hostname: book.neurodatalab.ai
    service: http://10.0.10.103:8080
    originRequest:
      noTLSVerify: true

  # Catch-all rule (required)
  - service: http_status:404

Inside the cluster, Nginx acts as the reverse proxy for inter-service routing and TLS termination for internal HTTPS endpoints. Each service gets its own upstream block with health checks and connection limits.

# Nginx Reverse Proxy Configuration (excerpt)
# /etc/nginx/conf.d/services.conf
# ─────────────────────────────────────────────

# Rate limiting zone
limit_req_zone $binary_remote_addr zone=api:10m rate=30r/s;
limit_conn_zone $binary_remote_addr zone=conn:10m;

upstream portfolio {
    server 10.0.10.101:3000 max_fails=3 fail_timeout=30s;
    keepalive 16;
}

upstream mlflow {
    server 10.0.20.200:5000 max_fails=3 fail_timeout=30s;
    keepalive 8;
}

upstream geoserver {
    server 10.0.20.201:8080 max_fails=3 fail_timeout=30s;
    keepalive 8;
}

server {
    listen 80;
    server_name neurodatalab.ai;

    # Security headers
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;
    add_header Referrer-Policy "strict-origin-when-cross-origin" always;
    add_header Content-Security-Policy "default-src 'self'" always;

    location / {
        proxy_pass http://portfolio;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        limit_req zone=api burst=50 nodelay;
        limit_conn conn 100;
    }

    # Health check endpoint
    location /health {
        access_log off;
        return 200 "ok";
        add_header Content-Type text/plain;
    }
}

Firewall Rules (Host Level)

# UFW rules on each Proxmox node
# ─────────────────────────────────────────────
ufw default deny incoming
ufw default allow outgoing

# SSH (key-only, rate-limited)
ufw limit 22/tcp

# Proxmox WebUI (restricted to admin VPN)
ufw allow from 10.10.0.0/24 to any port 8006

# Corosync cluster communication
ufw allow from 192.168.100.0/24 to any port 5405

# PBS (internal only)
ufw allow from 192.168.100.0/24 to any port 8007

# Cloudflare tunnel (outbound only -- no rule needed)
# All web traffic arrives via tunnel, not direct inbound

ufw enable
ufw status verbose

The VLAN segmentation ensures that even if one service is compromised, lateral movement is restricted. A container on VLAN 10 cannot initiate connections to VLAN 20 or 30 unless explicitly permitted by iptables rules on the host. This microsegmentation is the network equivalent of the principle of least privilege.

The Seven Mission-Critical Services

Each service is containerised and isolated within its assigned VLAN. The following documents the production configuration of all seven services, including their resource allocations and inter-service dependencies.

Service Registry

01.Portfolio Site -- Next.js 14 application serving neurodatalab.ai. Runs in LXC CT 101, VLAN 10. 2 cores, 2 GB RAM. Rebuilt on push via webhook.
02.MLflow Tracking Server -- Experiment tracking and model registry. LXC CT 200, VLAN 20. 2 cores, 4 GB RAM. PostgreSQL backend, S3-compatible artifact store.
03.GeoServer -- OGC-compliant geospatial data server. LXC CT 201, VLAN 20. 2 cores, 4 GB RAM. Serves WMS/WFS layers for mapping applications.
04.Monitoring Stack -- Prometheus, Grafana, and Loki. Distributed across CTs 300-302, VLAN 30. 4 cores total, 6 GB RAM total. 90-day retention.
05.AI Agent Fleet -- Python-based autonomous agents with Docker Compose orchestration. KVM VM 202, VLAN 20. 4 cores, 8 GB RAM. Runs 12 containerised agents.
06.Open edX -- Learning management system. LXC CT 102, VLAN 10. 4 cores, 8 GB RAM. Tutor-based deployment with MySQL and Elasticsearch.
07.Booking System -- Appointment scheduling service. LXC CT 103, VLAN 10. 1 core, 1 GB RAM. Node.js backend with PostgreSQL.

# Docker Compose for AI Agent Fleet (VM 202)
# /opt/agents/docker-compose.yml
# ─────────────────────────────────────────────
version: "3.9"

services:
  agent-orchestrator:
    image: ghcr.io/neurodatalab/agent-orchestrator:latest
    restart: unless-stopped
    environment:
      - REDIS_URL=redis://redis:6379/0
      - POSTGRES_URL=postgresql://agents:****@postgres:5432/agents
      - MLFLOW_TRACKING_URI=http://10.0.20.200:5000
    depends_on:
      - redis
      - postgres
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 1G

  sentinel-agent:
    image: ghcr.io/neurodatalab/sentinel-agent:latest
    restart: unless-stopped
    environment:
      - ORCHESTRATOR_URL=http://agent-orchestrator:8000
      - ALERT_WEBHOOK=https://hooks.internal/alerts
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: 512M

  redis:
    image: redis:7-alpine
    restart: unless-stopped
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes --maxmemory 256mb

  postgres:
    image: postgres:16-alpine
    restart: unless-stopped
    environment:
      - POSTGRES_DB=agents
      - POSTGRES_USER=agents
      - POSTGRES_PASSWORD_FILE=/run/secrets/db_password
    volumes:
      - pg-data:/var/lib/postgresql/data
    secrets:
      - db_password

volumes:
  redis-data:
  pg-data:

secrets:
  db_password:
    file: ./secrets/db_password.txt

networks:
  default:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/16

The MLflow server uses a shared PostgreSQL backend for tracking metadata and a MinIO-compatible S3 bucket (running locally) for artifact storage. This configuration supports concurrent experiment tracking from multiple agents without contention, and artifacts are stored on the ZFS pool with automatic compression and snapshotting.

# MLflow Server Configuration (CT 200)
# /opt/mlflow/start.sh
# ─────────────────────────────────────────────
#!/bin/bash
mlflow server \
  --backend-store-uri postgresql://mlflow:****@localhost:5432/mlflow \
  --default-artifact-root s3://mlflow-artifacts/ \
  --host 0.0.0.0 \
  --port 5000 \
  --workers 4 \
  --gunicorn-opts "--timeout 120 --keep-alive 5"

# Environment variables for S3-compatible storage
export MLFLOW_S3_ENDPOINT_URL=http://10.0.20.205:9000
export AWS_ACCESS_KEY_ID=mlflow-access
export AWS_SECRET_ACCESS_KEY=****

Monitoring and Observability

Observability is non-negotiable for self-hosted infrastructure. Without the managed monitoring that cloud providers offer, every failure mode must be anticipated and instrumented. The monitoring stack runs on dedicated containers in VLAN 30, isolated from production traffic to ensure that monitoring remains operational even during production incidents.

# Prometheus Configuration (CT 300)
# /etc/prometheus/prometheus.yml
# ─────────────────────────────────────────────
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

scrape_configs:
  # Proxmox node metrics
  - job_name: "proxmox-nodes"
    static_configs:
      - targets:
          - "192.168.100.1:9100"
          - "192.168.100.2:9100"
    relabel_configs:
      - source_labels: [__address__]
        regex: "192.168.100.1:.*"
        target_label: instance
        replacement: "pve-node-01"
      - source_labels: [__address__]
        regex: "192.168.100.2:.*"
        target_label: instance
        replacement: "pve-node-02"

  # Container metrics via cAdvisor
  - job_name: "containers"
    static_configs:
      - targets:
          - "10.0.10.100:8080"
    metrics_path: /metrics

  # Nginx metrics
  - job_name: "nginx"
    static_configs:
      - targets:
          - "10.0.10.100:9113"

  # ZFS metrics (custom exporter)
  - job_name: "zfs"
    static_configs:
      - targets:
          - "192.168.100.1:9134"
          - "192.168.100.2:9134"
    scrape_interval: 30s

  # Blackbox probes (endpoint availability)
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://www.neurodatalab.ai
          - https://mlflow.neurodatalab.ai
          - https://geo.neurodatalab.ai
          - https://learn.neurodatalab.ai
          - https://book.neurodatalab.ai
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.0.30.300:9115

# Alerting Rules
# /etc/prometheus/rules/critical.yml
# ─────────────────────────────────────────────
groups:
  - name: infrastructure
    rules:
      - alert: NodeDown
        expr: up{job="proxmox-nodes"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Proxmox node {{ $labels.instance }} is down"
          description: "Node has been unreachable for more than 2 minutes."

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% for 5 minutes."

      - alert: ZFSPoolDegraded
        expr: zfs_pool_health{state!="ONLINE"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool {{ $labels.pool }} is degraded"
          description: "Pool health state is {{ $labels.state }}."

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: ServiceDown
        expr: probe_success{job="blackbox-http"} == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "Blackbox probe has failed for 3 minutes."

      - alert: HighCPULoad
        expr: node_load15 / count(node_cpu_seconds_total{mode="idle"}) by (instance) > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Sustained high CPU load on {{ $labels.instance }}"

      - alert: BackupFailed
        expr: time() - pbs_last_successful_backup_time > 172800
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Backup has not completed in 48 hours"

Log aggregation is handled by Loki, which receives logs from Promtail agents running on each container and node. Loki's label-based indexing keeps storage costs low while enabling fast querying through Grafana's Explore interface. Logs are retained for 90 days with automatic compaction.

# Promtail Configuration (deployed on each CT/VM)
# /etc/promtail/config.yml
# ─────────────────────────────────────────────
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://10.0.30.302:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          host: ${HOSTNAME}
          __path__: /var/log/syslog

  - job_name: containers
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          host: ${HOSTNAME}
          __path__: /var/lib/docker/containers/*/*-json.log
    pipeline_stages:
      - docker: {}
      - timestamp:
          source: time
          format: RFC3339Nano

Grafana Dashboards

--Cluster Overview: node health, resource utilisation, HA status, migration events
--ZFS Health: pool status, scrub history, IO latency, compression ratios
--Service Uptime: blackbox probe results, response times, error rates per endpoint
--Nginx Traffic: requests/sec, latency percentiles, upstream health, error codes
--Backup Status: last backup time, size, dedup ratio, verification results
--Alert History: firing alerts, resolution times, escalation patterns

Backup and Disaster Recovery

The backup strategy follows the 3-2-1 rule: three copies of data, on two different media types, with one copy offsite. Proxmox Backup Server handles the first two requirements with on-node snapshots and deduplicated backups to the PBS datastore. Offsite replication pushes encrypted backups to a secondary Hetzner storage box via rsync over SSH.

Recovery Objectives

--RPO (Recovery Point Objective): 24 hours for all services, 1 hour for databases
--RTO (Recovery Time Objective): 15 minutes for container restore, 30 minutes for full VM
--Full disaster recovery (new node): 4 hours from bare metal to production
--Tested quarterly with documented runbooks for each scenario

# ZFS Snapshot Schedule (automated via cron)
# /etc/cron.d/zfs-snapshots
# ─────────────────────────────────────────────
# Hourly snapshots (keep 24)
0 * * * * root zfs snapshot -r datapool@auto-hourly-$(date +\%Y\%m\%d-\%H\%M)

# Daily cleanup -- keep 24 hourly, 7 daily, 4 weekly
15 0 * * * root /opt/scripts/zfs-prune.sh

# ZFS scrub (weekly, Sunday 03:00)
0 3 * * 0 root zpool scrub datapool

# ─────────────────────────────────────────────
# /opt/scripts/zfs-prune.sh
#!/bin/bash
set -euo pipefail

# Remove hourly snapshots older than 24 hours
zfs list -t snapshot -o name -H | \
  grep "auto-hourly" | \
  while read snap; do
    snap_date=$(echo "$snap" | grep -oP '\d{8}-\d{4}')
    snap_epoch=$(date -d "${snap_date:0:8} ${snap_date:9:2}:${snap_date:11:2}" +%s)
    now_epoch=$(date +%s)
    age=$(( (now_epoch - snap_epoch) / 3600 ))
    if [ "$age" -gt 24 ]; then
      zfs destroy "$snap"
    fi
  done

# Offsite Replication Script
# /opt/scripts/offsite-backup.sh
# ─────────────────────────────────────────────
#!/bin/bash
set -euo pipefail

REMOTE_HOST="uXXXXXX.your-storagebox.de"
REMOTE_PATH="/backups/proxmox"
PBS_DATASTORE="/mnt/pbs-datastore"
LOG="/var/log/offsite-backup.log"

echo "[$(date)] Starting offsite replication" >> "$LOG"

rsync -avz --delete \
  --bwlimit=50000 \
  -e "ssh -p 23 -i /root/.ssh/storagebox_ed25519" \
  "$PBS_DATASTORE/" \
  "$REMOTE_HOST:$REMOTE_PATH/" \
  >> "$LOG" 2>&1

STATUS=$?
if [ $STATUS -eq 0 ]; then
  echo "[$(date)] Offsite replication completed successfully" >> "$LOG"
else
  echo "[$(date)] ERROR: Offsite replication failed (exit $STATUS)" >> "$LOG"
  # Trigger alert via Alertmanager webhook
  curl -s -X POST http://10.0.30.300:9093/api/v1/alerts \
    -H "Content-Type: application/json" \
    -d '[{"labels":{"alertname":"OffsiteBackupFailed","severity":"critical"}}]'
fi

Restore procedures are tested quarterly using a documented runbook. The test involves restoring a randomly selected container and VM to a temporary environment, verifying data integrity, and measuring actual recovery time against the RTO targets. Results are logged and any deviations trigger updates to the runbook or backup configuration.

# Restore Procedure Example (Container)
# ─────────────────────────────────────────────
# 1. List available backups
proxmox-backup-client list --repository pbs@localhost:datastore1

# 2. Restore container to temporary ID
pct restore 9101 \
  pbs:backup/ct/101/2026-01-25T02:00:00Z \
  --storage datapool \
  --unique true \
  --force true

# 3. Start restored container on isolated VLAN
pct set 9101 -net0 name=eth0,bridge=vmbr0,tag=99
pct start 9101

# 4. Verify application health
curl -f http://10.0.99.101:3000/health || echo "Health check failed"

# 5. Validate data integrity
pct exec 9101 -- pg_dump -U app appdb | md5sum
# Compare with production checksum

# 6. Document results and cleanup
pct stop 9101
pct destroy 9101

Security Hardening

Security on self-hosted infrastructure requires a defence-in-depth approach. There is no managed security group or cloud-native WAF operating by default. Every layer must be explicitly configured, audited, and maintained. The security model for this cluster operates across five layers: edge (Cloudflare), network (firewall + VLANs), host (OS hardening), container (isolation + AppArmor), and application (authentication + input validation).

# SSH Hardening (/etc/ssh/sshd_config)
# ─────────────────────────────────────────────
Port 22
Protocol 2
PermitRootLogin prohibit-password
PubkeyAuthentication yes
PasswordAuthentication no
ChallengeResponseAuthentication no
UsePAM yes
X11Forwarding no
MaxAuthTries 3
MaxSessions 5
ClientAliveInterval 300
ClientAliveCountMax 2
LoginGraceTime 30

# Only allow specific users
AllowUsers deploy admin

# Use strong key exchange and ciphers
KexAlgorithms curve25519-sha256@libssh.org,diffie-hellman-group16-sha512
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com

# Fail2ban Configuration
# /etc/fail2ban/jail.local
# ─────────────────────────────────────────────
[DEFAULT]
bantime = 3600
findtime = 600
maxretry = 3
backend = systemd
banaction = ufw

[sshd]
enabled = true
port = 22
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 86400

[proxmox]
enabled = true
port = 8006
filter = proxmox
logpath = /var/log/daemon.log
maxretry = 3
bantime = 3600

# Custom Proxmox filter
# /etc/fail2ban/filter.d/proxmox.conf
[Definition]
failregex = pvedaemon\[.*authentication failure; rhost=<HOST>
ignoreregex =

Cloudflare WAF rules provide the first line of defence against common web attacks. The rules are configured to block known attack patterns while allowing legitimate traffic through. Rate limiting at the Cloudflare edge prevents volumetric attacks from reaching the origin server.

# Cloudflare WAF Rules (via API / Dashboard)
# ─────────────────────────────────────────────

# Rule 1: Block known bad bots
# Expression: (cf.client.bot) and not (cf.bot_management.verified_bot)
# Action: Block

# Rule 2: Challenge suspicious regions
# Expression: (ip.geoip.country in {"CN" "RU" "KP"})
#   and not (cf.bot_management.verified_bot)
# Action: Managed Challenge

# Rule 3: Rate limit API endpoints
# Expression: (http.request.uri.path contains "/api/")
# Rate: 100 requests per 10 seconds per IP
# Action: Block for 60 seconds

# Rule 4: Block SQL injection patterns
# Expression: (http.request.uri.query contains "UNION SELECT")
#   or (http.request.uri.query contains "1=1")
#   or (http.request.uri.query contains "DROP TABLE")
# Action: Block

# Rule 5: Require HTTPS
# Expression: (not ssl)
# Action: Redirect to HTTPS

Container Isolation Measures

--All LXC containers run unprivileged (user namespace mapping)
--AppArmor profiles enforce mandatory access control on each container
--Capability dropping removes unnecessary Linux capabilities (CAP_SYS_ADMIN, etc.)
--Seccomp profiles restrict available system calls to application requirements
--Read-only root filesystems where possible (mounted tmpfs for runtime data)
--Resource limits (cgroup v2) prevent CPU/memory exhaustion attacks

# Automated Security Updates (unattended-upgrades)
# /etc/apt/apt.conf.d/50unattended-upgrades
# ─────────────────────────────────────────────
Unattended-Upgrade::Allowed-Origins {
  "${distro_id}:${distro_codename}";
  "${distro_id}:${distro_codename}-security";
  "${distro_id}ESMApps:${distro_codename}-apps-security";
};

Unattended-Upgrade::AutoFixInterruptedDpkg "true";
Unattended-Upgrade::Remove-Unused-Kernel-Packages "true";
Unattended-Upgrade::Remove-Unused-Dependencies "true";
Unattended-Upgrade::Automatic-Reboot "false";
Unattended-Upgrade::Mail "admin@neurodatalab.ai";
Unattended-Upgrade::MailReport "on-change";

# Patching schedule (manual for Proxmox kernel updates)
# ─────────────────────────────────────────────
# Weekly:  apt security updates (automated)
# Monthly: Proxmox VE updates (manual, rolling per node)
# Quarterly: Full system audit + kernel update if needed

Operational Results

After twelve months in production, the cluster has delivered on its core promise: enterprise-grade reliability at a fraction of cloud cost. The following metrics are derived from Prometheus data covering January 2025 through January 2026.

Uptime Metrics (12-Month Rolling)

Service                  Uptime %    Downtime (total)
───────────────────────────────────────────────────────
Portfolio Site           99.997%     1m 34s
MLflow Server            99.993%     3m 41s
GeoServer                99.991%     4m 43s
Monitoring Stack         99.999%     0m 32s
AI Agent Fleet           99.982%     9m 28s
Open edX                 99.989%     5m 47s
Booking System           99.996%     2m 06s
───────────────────────────────────────────────────────
Cluster Average          99.992%     3m 57s
Target                   99.990%     5m 15s

Incidents (12 months):
  P1 (service down):     2 events
  P2 (degraded):         5 events
  P3 (minor):            11 events

Root causes:
  - Planned maintenance: 4 (rolling updates, zero downtime)
  - Hardware: 0
  - Software bug: 1 (Proxmox HA race condition, patched)
  - Network: 1 (Hetzner upstream, 4 minutes)
  - Human error: 1 (misconfigured VLAN tag, 5 minutes)

Cost Analysis (Annual)

Category                 Bare Metal     Cloud Equivalent
───────────────────────────────────────────────────────
Compute (2 nodes)        EUR 1,080      EUR 9,120
Storage (2 TB usable)    included       EUR 2,400
Bandwidth (10 TB/mo)     included       EUR 1,200
Backup storage           EUR 120        EUR 480
Cloudflare Pro           EUR 240        N/A (ALB: EUR 360)
Domain + SSL             EUR 12         EUR 12
───────────────────────────────────────────────────────
Total Annual             EUR 1,452      EUR 13,572
───────────────────────────────────────────────────────
Annual Savings:          EUR 12,120 (89.3% reduction)

Note: Cloud estimate based on AWS eu-central-1 pricing
for equivalent compute, storage, and transfer. Does not
include managed Kubernetes or RDS surcharges that would
be required for equivalent service isolation.

The performance benchmarks demonstrate that bare-metal hardware consistently outperforms equivalent cloud instances, particularly for IO-intensive workloads. NVMe storage on dedicated hardware delivers 10-15x the IOPS of cloud-attached block storage at the same price point, and CPU performance is predictable without the noisy-neighbour effects common in shared cloud environments.

Performance Benchmarks

Metric                   Bare Metal     AWS m5.xlarge
───────────────────────────────────────────────────────
Sequential Read          3,200 MB/s     250 MB/s (gp3)
Sequential Write         2,800 MB/s     250 MB/s (gp3)
Random 4K IOPS (read)    620,000        16,000 (gp3)
Random 4K IOPS (write)   540,000        16,000 (gp3)
P99 Latency (read)       0.12ms         1.8ms
Memory Bandwidth         38 GB/s        21 GB/s
Network Latency (local)  0.05ms         0.3ms

Portfolio Site (p95):    42ms           68ms
MLflow API (p95):        18ms           31ms
GeoServer WMS (p95):     95ms           180ms

The operational overhead of self-managed infrastructure averages approximately four hours per month, broken down into: monitoring review (1 hour), security updates (1 hour), backup verification (30 minutes), capacity planning (30 minutes), and documentation updates (1 hour). This is manageable for a single operator and represents a small fraction of the cost savings compared to managed cloud services.

The infrastructure has successfully survived two unplanned events: a Hetzner network maintenance window that lasted four minutes (during which Cloudflare served cached content for the portfolio site), and a Proxmox HA race condition during a rolling update that was resolved by the HA manager within 90 seconds. Both incidents validated the architectural decisions around Cloudflare caching, HA configuration, and automated failover.

Lessons Learned

--ZFS mirror is non-negotiable. The data integrity guarantees alone justify the storage overhead. Silent corruption on a single drive would have caused undetected data loss without checksumming.
--Cloudflare Tunnel eliminates entire attack surface categories. No inbound ports means no port scanning, no direct DDoS, and no accidental exposure of management interfaces.
--Monitoring must be on a separate VLAN. Early in the project, a runaway log volume from a misconfigured application saturated the monitoring container and caused alert blindness. VLAN isolation with QoS prevents this.
--Test restores quarterly or they are worthless. A backup that has never been restored is a hypothesis, not a guarantee. Every quarterly test has revealed at least one minor improvement opportunity.
--LXC for everything possible, KVM only when necessary. The resource savings from LXC containers compound across a cluster. Reserve KVM for workloads that genuinely require kernel isolation or Docker-in-Docker.
--Document every runbook as if future-you has no memory. Incident response under pressure is not the time to figure out which PBS datastore contains the most recent backup of a specific container.

Self-hosted infrastructure is not for every workload or every team. It requires discipline in operations, security, and documentation that managed cloud services abstract away. But for stable, predictable workloads where the operator has the skills and commitment to maintain the platform, the economics are compelling and the control is liberating. This cluster proves that 99.99% uptime is achievable on consumer-grade hardware with open-source software -- at roughly one-tenth the cost of equivalent cloud infrastructure.