Why Self-Hosted Infrastructure
The prevailing narrative in infrastructure engineering defaults to cloud-first. AWS, Azure, and GCP offer undeniable convenience -- but that convenience comes at a cost that compounds rapidly once workloads stabilise. For predictable, long-running services the economics of bare-metal hosting are difficult to ignore: a dedicated Hetzner server with 64 GB RAM and 1 TB NVMe storage costs roughly EUR 45/month. The equivalent compute and storage on AWS EC2 with EBS would exceed EUR 350/month before factoring in data transfer fees.
The question is not whether bare metal is cheaper -- it plainly is for stable workloads -- but whether you can achieve enterprise-grade reliability without the managed services that cloud providers bundle into their pricing. This article documents exactly how: a Proxmox VE cluster running on Hetzner dedicated servers, delivering 99.99% measured uptime across seven production services over the past twelve months.
The architecture prioritises three principles: repeatability through infrastructure-as-code, defence-in-depth through layered security, and observability through comprehensive monitoring. Every configuration described below is version-controlled, every change is auditable, and every service is instrumented.
Cost Comparison: Bare Metal vs Cloud (Monthly)
- --Hetzner AX41-NVMe (dedicated): EUR 45.00 -- 64 GB RAM, 512 GB NVMe x2, Ryzen 5 3600
- --AWS equivalent (m5.xlarge + 1 TB gp3 + transfer): EUR 380.00+
- --Annual savings: approx. EUR 4,000 per server node
- --Trade-off: self-managed updates, monitoring, and disaster recovery
The goal of this project was never to replicate every AWS service. It was to prove that a small, carefully designed bare-metal cluster can run a portfolio of production services with reliability metrics that rival managed cloud offerings -- at roughly 12% of the cost.
Hardware Architecture
The cluster runs on Hetzner dedicated servers selected for their price-to-performance ratio and hardware reliability track record. Each node is provisioned with identical specifications to simplify live migration and capacity planning.
# Node Specifications (per server)
# ─────────────────────────────────────────────
Model: Hetzner AX41-NVMe
CPU: AMD Ryzen 5 3600 (6C/12T @ 3.6 GHz)
RAM: 64 GB DDR4 ECC
Storage: 2x 512 GB NVMe SSD (Samsung PM9A3)
Network: 1 Gbps dedicated uplink
Location: Falkenstein, DE (FSN1-DC14)
OS: Proxmox VE 8.1
Kernel: 6.5.13-3-pve
# Storage Layout
# ─────────────────────────────────────────────
# ZFS mirror across both NVMe drives
# - rpool: root filesystem (50 GB)
# - datapool: VM/CT storage (400 GB)
# - swap: 8 GB zvol on rpoolZFS was chosen as the filesystem for its built-in data integrity verification, snapshot capabilities, and native integration with Proxmox. The mirror configuration across both NVMe drives provides redundancy against single-drive failure without sacrificing read performance. ZFS checksumming catches silent data corruption that would go undetected on ext4 or XFS.
# ZFS Pool Configuration
# ─────────────────────────────────────────────
zpool create -f -o ashift=12 \
-O compression=lz4 \
-O atime=off \
-O xattr=sa \
-O dnodesize=auto \
-O normalization=formD \
-O mountpoint=none \
datapool mirror /dev/nvme0n1p3 /dev/nvme1n1p3
# Verify pool status
zpool status datapool
pool: datapool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
datapool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1p3 ONLINE 0 0 0
nvme1n1p3 ONLINE 0 0 0
# ZFS dataset hierarchy
# datapool/
# ├── ct/ -- LXC container rootfs
# ├── vm/ -- QEMU disk images
# ├── backups/ -- local backup staging
# └── templates/ -- ISO and CT templatesNetwork Topology
Internet
│
├── Cloudflare (DNS + WAF + CDN)
│ │
│ └── Cloudflare Tunnel (cloudflared)
│ │
├── vmbr0 (Public Bridge)
│ │ IP: 10.0.0.1/24
│ │
│ ├── VLAN 10: Production Services
│ │ ├── CT 100: nginx-proxy
│ │ ├── CT 101: portfolio-site
│ │ ├── CT 102: open-edx
│ │ └── CT 103: booking-system
│ │
│ ├── VLAN 20: Data + ML
│ │ ├── CT 200: mlflow-server
│ │ ├── CT 201: geoserver
│ │ └── VM 202: ai-agents
│ │
│ ├── VLAN 30: Monitoring
│ │ ├── CT 300: prometheus
│ │ ├── CT 301: grafana
│ │ └── CT 302: loki
│ │
│ └── VLAN 40: Management
│ ├── Proxmox WebUI (:8006)
│ └── PBS WebUI (:8007)
│
└── vmbr1 (Internal Bridge)
│ IP: 192.168.100.1/24
│ (inter-node cluster traffic)
└── Corosync + migration trafficEach VLAN is isolated at the bridge level within Proxmox. Traffic between VLANs passes through the nginx reverse proxy container on VLAN 10, which acts as the sole ingress point for external requests. Inter-node cluster communication (Corosync heartbeat and live migration) runs on a dedicated internal bridge that is never exposed to the public network.
Proxmox Cluster Design
The cluster is configured for high availability with automatic failover. Proxmox's built-in HA manager monitors service health through Corosync quorum and triggers live migration or restart on the surviving node when a failure is detected. Fencing ensures that a failed node is isolated before its workloads are restarted elsewhere, preventing split-brain scenarios.
# /etc/pve/corosync.conf (excerpt)
# ─────────────────────────────────────────────
totem {
version: 2
secauth: on
cluster_name: prod-cluster
transport: knet
interface {
ringnumber: 0
bindnetaddr: 192.168.100.0
mcastport: 5405
}
crypto_cipher: aes256
crypto_hash: sha256
}
nodelist {
node {
ring0_addr: 192.168.100.1
name: pve-node-01
nodeid: 1
}
node {
ring0_addr: 192.168.100.2
name: pve-node-02
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
expected_votes: 2
two_node: 1
}Resource pools partition the cluster into logical groups that match the VLAN segmentation. Each pool has CPU and memory limits to prevent a runaway container from starving other services. The HA group configuration assigns preferred nodes to each service while allowing migration to the partner node under failure conditions.
# HA Group Configuration
# ─────────────────────────────────────────────
# pvesh create /cluster/ha/groups
# --group production
# --nodes pve-node-01,pve-node-02
# --nofailback 0
# --restricted 0
# HA Resource Registration
pvesh create /cluster/ha/resources \
--sid ct:100 \
--group production \
--max_restart 3 \
--max_relocate 2 \
--state started
pvesh create /cluster/ha/resources \
--sid ct:101 \
--group production \
--max_restart 3 \
--max_relocate 2 \
--state started
# Verify HA status
pvesh get /cluster/ha/status/current
# ┌──────────┬────────┬────────────┬─────────┐
# │ sid │ state │ node │ request │
# ├──────────┼────────┼────────────┼─────────┤
# │ ct:100 │ started│ pve-node-01│ │
# │ ct:101 │ started│ pve-node-01│ │
# │ ct:102 │ started│ pve-node-02│ │
# │ ct:200 │ started│ pve-node-01│ │
# │ ct:201 │ started│ pve-node-02│ │
# │ vm:202 │ started│ pve-node-01│ │
# └──────────┴────────┴────────────┴─────────┘Backup scheduling is handled by Proxmox Backup Server (PBS) running on a separate partition. PBS deduplicates at the chunk level, meaning incremental backups of large VM disks complete in seconds when only a small percentage of blocks have changed. Full backup verification runs weekly to confirm restore integrity.
# Backup Schedule (/etc/pve/jobs.cfg excerpt)
# ─────────────────────────────────────────────
vzdump: daily-backup
enabled 1
schedule daily 02:00
storage pbs-local
mode snapshot
compress zstd
mailnotification always
mailto admin@neurodatalab.ai
all 1
prune-backups keep-daily=7,keep-weekly=4,keep-monthly=6
# PBS Datastore Pruning Policy
# keep-last: 3
# keep-daily: 7
# keep-weekly: 4
# keep-monthly: 6
# keep-yearly: 1Live Migration Procedure
Live migration moves a running container or VM between nodes with zero downtime. The process transfers memory pages iteratively until the delta is small enough for a final switchover that typically takes under 100ms.
- --Pre-check: verify target node has sufficient resources
- --Phase 1: bulk memory copy over internal bridge (vmbr1)
- --Phase 2: iterative dirty-page sync (typically 2-3 rounds)
- --Phase 3: final pause, last-page copy, resume on target (<100ms)
- --Post-migration: ARP announcement to update network switches
Container and VM Strategy
Proxmox supports two virtualisation technologies: LXC containers (OS-level) and QEMU/KVM virtual machines (full hardware). The decision between them is driven by the workload's requirements for kernel isolation, resource overhead tolerance, and operational complexity.
LXC vs KVM Decision Matrix
Criterion LXC Container KVM Virtual Machine ─────────────────────────────────────────────────────────── Boot time < 2 seconds 15-30 seconds Memory overhead ~20 MB ~256 MB Kernel isolation Shared host Full isolation Storage efficiency Thin provision Thick/thin QCOW2 Live migration speed Fast (no RAM) Slower (RAM copy) Docker inside Nested (config) Native support Custom kernel No Yes GPU passthrough No Yes (IOMMU) ─────────────────────────────────────────────────────────── Use case in cluster: LXC: nginx, web apps, databases, monitoring KVM: AI agents (Docker-in-Docker), GPU workloads
Six of the seven production services run as LXC containers. The exception is the AI agent service, which requires Docker Compose internally and therefore runs as a full KVM VM. LXC containers share the host kernel, eliminating the memory overhead of running separate kernels and reducing boot time to under two seconds.
# LXC Container Template (production base)
# /etc/pve/lxc/101.conf
# ─────────────────────────────────────────────
arch: amd64
cores: 2
memory: 2048
swap: 512
rootfs: datapool:ct/subvol-101-disk-0,size=20G
hostname: portfolio-site
nameserver: 1.1.1.1
searchdomain: internal.cluster
net0: name=eth0,bridge=vmbr0,tag=10,ip=10.0.10.101/24,gw=10.0.10.1
onboot: 1
startup: order=2,up=30,down=30
unprivileged: 1
features: nesting=1
lxc.apparmor.profile: generated
lxc.cap.drop:
# Resource Limits
lxc.cgroup2.memory.max: 2147483648
lxc.cgroup2.cpu.max: 200000 100000# KVM VM Configuration (AI Agents)
# /etc/pve/qemu-server/202.conf
# ─────────────────────────────────────────────
agent: 1
balloon: 2048
boot: order=scsi0
cores: 4
cpu: host
memory: 8192
name: ai-agents
net0: virtio=XX:XX:XX:XX:XX:XX,bridge=vmbr0,tag=20
numa: 0
onboot: 1
ostype: l26
scsi0: datapool:vm-202-disk-0,iothread=1,size=80G
scsihw: virtio-scsi-single
serial0: socket
startup: order=5,up=60,down=60
vga: serial0
# Cloud-init drive for automated provisioning
ide2: datapool:vm-202-cloudinit,media=cdrom
ciuser: deploy
cipassword: [redacted]
sshkeys: ssh-ed25519%20AAAA...%20deploy@cluster
ipconfig0: ip=10.0.20.202/24,gw=10.0.20.1Cloud-init integration allows new VMs to be provisioned from a template with a single command. The cloud-init drive injects SSH keys, network configuration, and an initial user -- eliminating manual setup entirely. Templates are updated monthly with the latest security patches and stored in the datapool templates dataset.
# Creating a VM from template with cloud-init
# ─────────────────────────────────────────────
qm clone 9000 203 --name new-service --full
qm set 203 --ciuser deploy
qm set 203 --sshkeys /root/.ssh/deploy_ed25519.pub
qm set 203 --ipconfig0 ip=10.0.20.203/24,gw=10.0.20.1
qm start 203
# Template creation workflow
qm create 9000 --memory 2048 --cores 2 --name ubuntu-template
qm importdisk 9000 ubuntu-24.04-cloudimg-amd64.img datapool
qm set 9000 --scsi0 datapool:vm-9000-disk-0
qm set 9000 --boot order=scsi0
qm set 9000 --ide2 datapool:cloudinit
qm set 9000 --agent enabled=1
qm template 9000Network Architecture
External traffic never touches the server directly. All public requests are proxied through Cloudflare, which provides DNS resolution, DDoS mitigation, WAF rules, and CDN caching. The connection between Cloudflare and the cluster is secured via a Cloudflare Tunnel (formerly Argo Tunnel), which establishes an outbound-only encrypted connection from the server to Cloudflare's edge network. No inbound ports are opened on the host firewall for web traffic.
# Cloudflare Tunnel Configuration
# /etc/cloudflared/config.yml
# ─────────────────────────────────────────────
tunnel: a1b2c3d4-e5f6-7890-abcd-ef1234567890
credentials-file: /etc/cloudflared/credentials.json
ingress:
- hostname: neurodatalab.ai
service: http://10.0.10.101:3000
originRequest:
noTLSVerify: true
- hostname: mlflow.neurodatalab.ai
service: http://10.0.20.200:5000
originRequest:
noTLSVerify: true
- hostname: geo.neurodatalab.ai
service: http://10.0.20.201:8080
originRequest:
noTLSVerify: true
- hostname: monitor.neurodatalab.ai
service: http://10.0.30.301:3000
originRequest:
noTLSVerify: true
- hostname: learn.neurodatalab.ai
service: http://10.0.10.102:80
originRequest:
noTLSVerify: true
- hostname: book.neurodatalab.ai
service: http://10.0.10.103:8080
originRequest:
noTLSVerify: true
# Catch-all rule (required)
- service: http_status:404Inside the cluster, Nginx acts as the reverse proxy for inter-service routing and TLS termination for internal HTTPS endpoints. Each service gets its own upstream block with health checks and connection limits.
# Nginx Reverse Proxy Configuration (excerpt)
# /etc/nginx/conf.d/services.conf
# ─────────────────────────────────────────────
# Rate limiting zone
limit_req_zone $binary_remote_addr zone=api:10m rate=30r/s;
limit_conn_zone $binary_remote_addr zone=conn:10m;
upstream portfolio {
server 10.0.10.101:3000 max_fails=3 fail_timeout=30s;
keepalive 16;
}
upstream mlflow {
server 10.0.20.200:5000 max_fails=3 fail_timeout=30s;
keepalive 8;
}
upstream geoserver {
server 10.0.20.201:8080 max_fails=3 fail_timeout=30s;
keepalive 8;
}
server {
listen 80;
server_name neurodatalab.ai;
# Security headers
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy "default-src 'self'" always;
location / {
proxy_pass http://portfolio;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_set_header Connection "";
limit_req zone=api burst=50 nodelay;
limit_conn conn 100;
}
# Health check endpoint
location /health {
access_log off;
return 200 "ok";
add_header Content-Type text/plain;
}
}Firewall Rules (Host Level)
# UFW rules on each Proxmox node # ───────────────────────────────────────────── ufw default deny incoming ufw default allow outgoing # SSH (key-only, rate-limited) ufw limit 22/tcp # Proxmox WebUI (restricted to admin VPN) ufw allow from 10.10.0.0/24 to any port 8006 # Corosync cluster communication ufw allow from 192.168.100.0/24 to any port 5405 # PBS (internal only) ufw allow from 192.168.100.0/24 to any port 8007 # Cloudflare tunnel (outbound only -- no rule needed) # All web traffic arrives via tunnel, not direct inbound ufw enable ufw status verbose
The VLAN segmentation ensures that even if one service is compromised, lateral movement is restricted. A container on VLAN 10 cannot initiate connections to VLAN 20 or 30 unless explicitly permitted by iptables rules on the host. This microsegmentation is the network equivalent of the principle of least privilege.
The Seven Mission-Critical Services
Each service is containerised and isolated within its assigned VLAN. The following documents the production configuration of all seven services, including their resource allocations and inter-service dependencies.
Service Registry
- 01.Portfolio Site -- Next.js 14 application serving neurodatalab.ai. Runs in LXC CT 101, VLAN 10. 2 cores, 2 GB RAM. Rebuilt on push via webhook.
- 02.MLflow Tracking Server -- Experiment tracking and model registry. LXC CT 200, VLAN 20. 2 cores, 4 GB RAM. PostgreSQL backend, S3-compatible artifact store.
- 03.GeoServer -- OGC-compliant geospatial data server. LXC CT 201, VLAN 20. 2 cores, 4 GB RAM. Serves WMS/WFS layers for mapping applications.
- 04.Monitoring Stack -- Prometheus, Grafana, and Loki. Distributed across CTs 300-302, VLAN 30. 4 cores total, 6 GB RAM total. 90-day retention.
- 05.AI Agent Fleet -- Python-based autonomous agents with Docker Compose orchestration. KVM VM 202, VLAN 20. 4 cores, 8 GB RAM. Runs 12 containerised agents.
- 06.Open edX -- Learning management system. LXC CT 102, VLAN 10. 4 cores, 8 GB RAM. Tutor-based deployment with MySQL and Elasticsearch.
- 07.Booking System -- Appointment scheduling service. LXC CT 103, VLAN 10. 1 core, 1 GB RAM. Node.js backend with PostgreSQL.
# Docker Compose for AI Agent Fleet (VM 202)
# /opt/agents/docker-compose.yml
# ─────────────────────────────────────────────
version: "3.9"
services:
agent-orchestrator:
image: ghcr.io/neurodatalab/agent-orchestrator:latest
restart: unless-stopped
environment:
- REDIS_URL=redis://redis:6379/0
- POSTGRES_URL=postgresql://agents:****@postgres:5432/agents
- MLFLOW_TRACKING_URI=http://10.0.20.200:5000
depends_on:
- redis
- postgres
deploy:
resources:
limits:
cpus: "1.0"
memory: 1G
sentinel-agent:
image: ghcr.io/neurodatalab/sentinel-agent:latest
restart: unless-stopped
environment:
- ORCHESTRATOR_URL=http://agent-orchestrator:8000
- ALERT_WEBHOOK=https://hooks.internal/alerts
deploy:
resources:
limits:
cpus: "0.5"
memory: 512M
redis:
image: redis:7-alpine
restart: unless-stopped
volumes:
- redis-data:/data
command: redis-server --appendonly yes --maxmemory 256mb
postgres:
image: postgres:16-alpine
restart: unless-stopped
environment:
- POSTGRES_DB=agents
- POSTGRES_USER=agents
- POSTGRES_PASSWORD_FILE=/run/secrets/db_password
volumes:
- pg-data:/var/lib/postgresql/data
secrets:
- db_password
volumes:
redis-data:
pg-data:
secrets:
db_password:
file: ./secrets/db_password.txt
networks:
default:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16The MLflow server uses a shared PostgreSQL backend for tracking metadata and a MinIO-compatible S3 bucket (running locally) for artifact storage. This configuration supports concurrent experiment tracking from multiple agents without contention, and artifacts are stored on the ZFS pool with automatic compression and snapshotting.
# MLflow Server Configuration (CT 200)
# /opt/mlflow/start.sh
# ─────────────────────────────────────────────
#!/bin/bash
mlflow server \
--backend-store-uri postgresql://mlflow:****@localhost:5432/mlflow \
--default-artifact-root s3://mlflow-artifacts/ \
--host 0.0.0.0 \
--port 5000 \
--workers 4 \
--gunicorn-opts "--timeout 120 --keep-alive 5"
# Environment variables for S3-compatible storage
export MLFLOW_S3_ENDPOINT_URL=http://10.0.20.205:9000
export AWS_ACCESS_KEY_ID=mlflow-access
export AWS_SECRET_ACCESS_KEY=****Monitoring and Observability
Observability is non-negotiable for self-hosted infrastructure. Without the managed monitoring that cloud providers offer, every failure mode must be anticipated and instrumented. The monitoring stack runs on dedicated containers in VLAN 30, isolated from production traffic to ensure that monitoring remains operational even during production incidents.
# Prometheus Configuration (CT 300)
# /etc/prometheus/prometheus.yml
# ─────────────────────────────────────────────
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
# Proxmox node metrics
- job_name: "proxmox-nodes"
static_configs:
- targets:
- "192.168.100.1:9100"
- "192.168.100.2:9100"
relabel_configs:
- source_labels: [__address__]
regex: "192.168.100.1:.*"
target_label: instance
replacement: "pve-node-01"
- source_labels: [__address__]
regex: "192.168.100.2:.*"
target_label: instance
replacement: "pve-node-02"
# Container metrics via cAdvisor
- job_name: "containers"
static_configs:
- targets:
- "10.0.10.100:8080"
metrics_path: /metrics
# Nginx metrics
- job_name: "nginx"
static_configs:
- targets:
- "10.0.10.100:9113"
# ZFS metrics (custom exporter)
- job_name: "zfs"
static_configs:
- targets:
- "192.168.100.1:9134"
- "192.168.100.2:9134"
scrape_interval: 30s
# Blackbox probes (endpoint availability)
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://www.neurodatalab.ai
- https://mlflow.neurodatalab.ai
- https://geo.neurodatalab.ai
- https://learn.neurodatalab.ai
- https://book.neurodatalab.ai
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.0.30.300:9115# Alerting Rules
# /etc/prometheus/rules/critical.yml
# ─────────────────────────────────────────────
groups:
- name: infrastructure
rules:
- alert: NodeDown
expr: up{job="proxmox-nodes"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Proxmox node {{ $labels.instance }} is down"
description: "Node has been unreachable for more than 2 minutes."
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for 5 minutes."
- alert: ZFSPoolDegraded
expr: zfs_pool_health{state!="ONLINE"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "ZFS pool {{ $labels.pool }} is degraded"
description: "Pool health state is {{ $labels.state }}."
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}:{{ $labels.mountpoint }}"
- alert: ServiceDown
expr: probe_success{job="blackbox-http"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "Blackbox probe has failed for 3 minutes."
- alert: HighCPULoad
expr: node_load15 / count(node_cpu_seconds_total{mode="idle"}) by (instance) > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "Sustained high CPU load on {{ $labels.instance }}"
- alert: BackupFailed
expr: time() - pbs_last_successful_backup_time > 172800
for: 1h
labels:
severity: critical
annotations:
summary: "Backup has not completed in 48 hours"Log aggregation is handled by Loki, which receives logs from Promtail agents running on each container and node. Loki's label-based indexing keeps storage costs low while enabling fast querying through Grafana's Explore interface. Logs are retained for 90 days with automatic compaction.
# Promtail Configuration (deployed on each CT/VM)
# /etc/promtail/config.yml
# ─────────────────────────────────────────────
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://10.0.30.302:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
host: ${HOSTNAME}
__path__: /var/log/syslog
- job_name: containers
static_configs:
- targets:
- localhost
labels:
job: docker
host: ${HOSTNAME}
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
- docker: {}
- timestamp:
source: time
format: RFC3339NanoGrafana Dashboards
- --Cluster Overview: node health, resource utilisation, HA status, migration events
- --ZFS Health: pool status, scrub history, IO latency, compression ratios
- --Service Uptime: blackbox probe results, response times, error rates per endpoint
- --Nginx Traffic: requests/sec, latency percentiles, upstream health, error codes
- --Backup Status: last backup time, size, dedup ratio, verification results
- --Alert History: firing alerts, resolution times, escalation patterns
Backup and Disaster Recovery
The backup strategy follows the 3-2-1 rule: three copies of data, on two different media types, with one copy offsite. Proxmox Backup Server handles the first two requirements with on-node snapshots and deduplicated backups to the PBS datastore. Offsite replication pushes encrypted backups to a secondary Hetzner storage box via rsync over SSH.
Recovery Objectives
- --RPO (Recovery Point Objective): 24 hours for all services, 1 hour for databases
- --RTO (Recovery Time Objective): 15 minutes for container restore, 30 minutes for full VM
- --Full disaster recovery (new node): 4 hours from bare metal to production
- --Tested quarterly with documented runbooks for each scenario
# ZFS Snapshot Schedule (automated via cron)
# /etc/cron.d/zfs-snapshots
# ─────────────────────────────────────────────
# Hourly snapshots (keep 24)
0 * * * * root zfs snapshot -r datapool@auto-hourly-$(date +\%Y\%m\%d-\%H\%M)
# Daily cleanup -- keep 24 hourly, 7 daily, 4 weekly
15 0 * * * root /opt/scripts/zfs-prune.sh
# ZFS scrub (weekly, Sunday 03:00)
0 3 * * 0 root zpool scrub datapool
# ─────────────────────────────────────────────
# /opt/scripts/zfs-prune.sh
#!/bin/bash
set -euo pipefail
# Remove hourly snapshots older than 24 hours
zfs list -t snapshot -o name -H | \
grep "auto-hourly" | \
while read snap; do
snap_date=$(echo "$snap" | grep -oP '\d{8}-\d{4}')
snap_epoch=$(date -d "${snap_date:0:8} ${snap_date:9:2}:${snap_date:11:2}" +%s)
now_epoch=$(date +%s)
age=$(( (now_epoch - snap_epoch) / 3600 ))
if [ "$age" -gt 24 ]; then
zfs destroy "$snap"
fi
done# Offsite Replication Script
# /opt/scripts/offsite-backup.sh
# ─────────────────────────────────────────────
#!/bin/bash
set -euo pipefail
REMOTE_HOST="uXXXXXX.your-storagebox.de"
REMOTE_PATH="/backups/proxmox"
PBS_DATASTORE="/mnt/pbs-datastore"
LOG="/var/log/offsite-backup.log"
echo "[$(date)] Starting offsite replication" >> "$LOG"
rsync -avz --delete \
--bwlimit=50000 \
-e "ssh -p 23 -i /root/.ssh/storagebox_ed25519" \
"$PBS_DATASTORE/" \
"$REMOTE_HOST:$REMOTE_PATH/" \
>> "$LOG" 2>&1
STATUS=$?
if [ $STATUS -eq 0 ]; then
echo "[$(date)] Offsite replication completed successfully" >> "$LOG"
else
echo "[$(date)] ERROR: Offsite replication failed (exit $STATUS)" >> "$LOG"
# Trigger alert via Alertmanager webhook
curl -s -X POST http://10.0.30.300:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{"labels":{"alertname":"OffsiteBackupFailed","severity":"critical"}}]'
fiRestore procedures are tested quarterly using a documented runbook. The test involves restoring a randomly selected container and VM to a temporary environment, verifying data integrity, and measuring actual recovery time against the RTO targets. Results are logged and any deviations trigger updates to the runbook or backup configuration.
# Restore Procedure Example (Container)
# ─────────────────────────────────────────────
# 1. List available backups
proxmox-backup-client list --repository pbs@localhost:datastore1
# 2. Restore container to temporary ID
pct restore 9101 \
pbs:backup/ct/101/2026-01-25T02:00:00Z \
--storage datapool \
--unique true \
--force true
# 3. Start restored container on isolated VLAN
pct set 9101 -net0 name=eth0,bridge=vmbr0,tag=99
pct start 9101
# 4. Verify application health
curl -f http://10.0.99.101:3000/health || echo "Health check failed"
# 5. Validate data integrity
pct exec 9101 -- pg_dump -U app appdb | md5sum
# Compare with production checksum
# 6. Document results and cleanup
pct stop 9101
pct destroy 9101Security Hardening
Security on self-hosted infrastructure requires a defence-in-depth approach. There is no managed security group or cloud-native WAF operating by default. Every layer must be explicitly configured, audited, and maintained. The security model for this cluster operates across five layers: edge (Cloudflare), network (firewall + VLANs), host (OS hardening), container (isolation + AppArmor), and application (authentication + input validation).
# SSH Hardening (/etc/ssh/sshd_config)
# ─────────────────────────────────────────────
Port 22
Protocol 2
PermitRootLogin prohibit-password
PubkeyAuthentication yes
PasswordAuthentication no
ChallengeResponseAuthentication no
UsePAM yes
X11Forwarding no
MaxAuthTries 3
MaxSessions 5
ClientAliveInterval 300
ClientAliveCountMax 2
LoginGraceTime 30
# Only allow specific users
AllowUsers deploy admin
# Use strong key exchange and ciphers
KexAlgorithms curve25519-sha256@libssh.org,diffie-hellman-group16-sha512
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com# Fail2ban Configuration
# /etc/fail2ban/jail.local
# ─────────────────────────────────────────────
[DEFAULT]
bantime = 3600
findtime = 600
maxretry = 3
backend = systemd
banaction = ufw
[sshd]
enabled = true
port = 22
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 86400
[proxmox]
enabled = true
port = 8006
filter = proxmox
logpath = /var/log/daemon.log
maxretry = 3
bantime = 3600
# Custom Proxmox filter
# /etc/fail2ban/filter.d/proxmox.conf
[Definition]
failregex = pvedaemon\[.*authentication failure; rhost=<HOST>
ignoreregex =Cloudflare WAF rules provide the first line of defence against common web attacks. The rules are configured to block known attack patterns while allowing legitimate traffic through. Rate limiting at the Cloudflare edge prevents volumetric attacks from reaching the origin server.
# Cloudflare WAF Rules (via API / Dashboard)
# ─────────────────────────────────────────────
# Rule 1: Block known bad bots
# Expression: (cf.client.bot) and not (cf.bot_management.verified_bot)
# Action: Block
# Rule 2: Challenge suspicious regions
# Expression: (ip.geoip.country in {"CN" "RU" "KP"})
# and not (cf.bot_management.verified_bot)
# Action: Managed Challenge
# Rule 3: Rate limit API endpoints
# Expression: (http.request.uri.path contains "/api/")
# Rate: 100 requests per 10 seconds per IP
# Action: Block for 60 seconds
# Rule 4: Block SQL injection patterns
# Expression: (http.request.uri.query contains "UNION SELECT")
# or (http.request.uri.query contains "1=1")
# or (http.request.uri.query contains "DROP TABLE")
# Action: Block
# Rule 5: Require HTTPS
# Expression: (not ssl)
# Action: Redirect to HTTPSContainer Isolation Measures
- --All LXC containers run unprivileged (user namespace mapping)
- --AppArmor profiles enforce mandatory access control on each container
- --Capability dropping removes unnecessary Linux capabilities (CAP_SYS_ADMIN, etc.)
- --Seccomp profiles restrict available system calls to application requirements
- --Read-only root filesystems where possible (mounted tmpfs for runtime data)
- --Resource limits (cgroup v2) prevent CPU/memory exhaustion attacks
# Automated Security Updates (unattended-upgrades)
# /etc/apt/apt.conf.d/50unattended-upgrades
# ─────────────────────────────────────────────
Unattended-Upgrade::Allowed-Origins {
"${distro_id}:${distro_codename}";
"${distro_id}:${distro_codename}-security";
"${distro_id}ESMApps:${distro_codename}-apps-security";
};
Unattended-Upgrade::AutoFixInterruptedDpkg "true";
Unattended-Upgrade::Remove-Unused-Kernel-Packages "true";
Unattended-Upgrade::Remove-Unused-Dependencies "true";
Unattended-Upgrade::Automatic-Reboot "false";
Unattended-Upgrade::Mail "admin@neurodatalab.ai";
Unattended-Upgrade::MailReport "on-change";
# Patching schedule (manual for Proxmox kernel updates)
# ─────────────────────────────────────────────
# Weekly: apt security updates (automated)
# Monthly: Proxmox VE updates (manual, rolling per node)
# Quarterly: Full system audit + kernel update if neededOperational Results
After twelve months in production, the cluster has delivered on its core promise: enterprise-grade reliability at a fraction of cloud cost. The following metrics are derived from Prometheus data covering January 2025 through January 2026.
Uptime Metrics (12-Month Rolling)
Service Uptime % Downtime (total) ─────────────────────────────────────────────────────── Portfolio Site 99.997% 1m 34s MLflow Server 99.993% 3m 41s GeoServer 99.991% 4m 43s Monitoring Stack 99.999% 0m 32s AI Agent Fleet 99.982% 9m 28s Open edX 99.989% 5m 47s Booking System 99.996% 2m 06s ─────────────────────────────────────────────────────── Cluster Average 99.992% 3m 57s Target 99.990% 5m 15s Incidents (12 months): P1 (service down): 2 events P2 (degraded): 5 events P3 (minor): 11 events Root causes: - Planned maintenance: 4 (rolling updates, zero downtime) - Hardware: 0 - Software bug: 1 (Proxmox HA race condition, patched) - Network: 1 (Hetzner upstream, 4 minutes) - Human error: 1 (misconfigured VLAN tag, 5 minutes)
Cost Analysis (Annual)
Category Bare Metal Cloud Equivalent ─────────────────────────────────────────────────────── Compute (2 nodes) EUR 1,080 EUR 9,120 Storage (2 TB usable) included EUR 2,400 Bandwidth (10 TB/mo) included EUR 1,200 Backup storage EUR 120 EUR 480 Cloudflare Pro EUR 240 N/A (ALB: EUR 360) Domain + SSL EUR 12 EUR 12 ─────────────────────────────────────────────────────── Total Annual EUR 1,452 EUR 13,572 ─────────────────────────────────────────────────────── Annual Savings: EUR 12,120 (89.3% reduction) Note: Cloud estimate based on AWS eu-central-1 pricing for equivalent compute, storage, and transfer. Does not include managed Kubernetes or RDS surcharges that would be required for equivalent service isolation.
The performance benchmarks demonstrate that bare-metal hardware consistently outperforms equivalent cloud instances, particularly for IO-intensive workloads. NVMe storage on dedicated hardware delivers 10-15x the IOPS of cloud-attached block storage at the same price point, and CPU performance is predictable without the noisy-neighbour effects common in shared cloud environments.
Performance Benchmarks
Metric Bare Metal AWS m5.xlarge ─────────────────────────────────────────────────────── Sequential Read 3,200 MB/s 250 MB/s (gp3) Sequential Write 2,800 MB/s 250 MB/s (gp3) Random 4K IOPS (read) 620,000 16,000 (gp3) Random 4K IOPS (write) 540,000 16,000 (gp3) P99 Latency (read) 0.12ms 1.8ms Memory Bandwidth 38 GB/s 21 GB/s Network Latency (local) 0.05ms 0.3ms Portfolio Site (p95): 42ms 68ms MLflow API (p95): 18ms 31ms GeoServer WMS (p95): 95ms 180ms
The operational overhead of self-managed infrastructure averages approximately four hours per month, broken down into: monitoring review (1 hour), security updates (1 hour), backup verification (30 minutes), capacity planning (30 minutes), and documentation updates (1 hour). This is manageable for a single operator and represents a small fraction of the cost savings compared to managed cloud services.
The infrastructure has successfully survived two unplanned events: a Hetzner network maintenance window that lasted four minutes (during which Cloudflare served cached content for the portfolio site), and a Proxmox HA race condition during a rolling update that was resolved by the HA manager within 90 seconds. Both incidents validated the architectural decisions around Cloudflare caching, HA configuration, and automated failover.
Lessons Learned
- --ZFS mirror is non-negotiable. The data integrity guarantees alone justify the storage overhead. Silent corruption on a single drive would have caused undetected data loss without checksumming.
- --Cloudflare Tunnel eliminates entire attack surface categories. No inbound ports means no port scanning, no direct DDoS, and no accidental exposure of management interfaces.
- --Monitoring must be on a separate VLAN. Early in the project, a runaway log volume from a misconfigured application saturated the monitoring container and caused alert blindness. VLAN isolation with QoS prevents this.
- --Test restores quarterly or they are worthless. A backup that has never been restored is a hypothesis, not a guarantee. Every quarterly test has revealed at least one minor improvement opportunity.
- --LXC for everything possible, KVM only when necessary. The resource savings from LXC containers compound across a cluster. Reserve KVM for workloads that genuinely require kernel isolation or Docker-in-Docker.
- --Document every runbook as if future-you has no memory. Incident response under pressure is not the time to figure out which PBS datastore contains the most recent backup of a specific container.
Self-hosted infrastructure is not for every workload or every team. It requires discipline in operations, security, and documentation that managed cloud services abstract away. But for stable, predictable workloads where the operator has the skills and commitment to maintain the platform, the economics are compelling and the control is liberating. This cluster proves that 99.99% uptime is achievable on consumer-grade hardware with open-source software -- at roughly one-tenth the cost of equivalent cloud infrastructure.