Homelab Architecture — Brandon Woodward

Executive Summary

A self-hosted homelab running 20+ production-grade services on commodity hardware, demonstrating Infrastructure-as-Code at every layer. Terraform provisions compute, Ansible configures services, a Makefile orchestrates the lifecycle. A five-node Proxmox VE cluster provides hypervisor compute; Ceph delivers distributed block storage; TrueNAS manages NAS/NFS; a Talos Linux Kubernetes cluster runs containerized workloads. All external traffic terminates at a single Traefik reverse proxy with automatic TLS via Let's Encrypt DNS-01 through Cloudflare. Authentik provides SSO for all admin interfaces. Single operator.

01Design Goals

Single-operator model

Every service must be deployable and maintainable without manual intervention beyond a single make command. No snowflake configuration.

Infrastructure-as-Code first

The entire cluster is reproducible from code. If all nodes were wiped, terraform apply + Ansible playbooks would restore every service.

Separation of concerns

Each service runs in its own LXC container or VM with dedicated disk, CPU, and memory. No shared container boundaries except where architecturally required.

Defense in depth

External traffic is gated through Cloudflare, then Traefik, then Authentik SSO for admin surfaces. Services without application auth sit behind Authentik.

Hardware realism

All nodes are commodity x86 mini PCs (~$100 used). The design accounts for the overcommit ratios this forces using memory ballooning and CPU weight scheduling.

02Compute & Virtualization

Cluster Topology

Five Proxmox VE 8.x nodes on a flat Layer 2 network (192.168.86.0/24), managed as a single cluster with shared corosync quorum and a distributed Ceph storage pool.

Node	IP	Primary Workloads
`pve1` (thinkcentre1)	`192.168.86.29`	Primary management, Authentik, Traefik
`pve2` (thinkcentre2)	`192.168.86.30`	ARR stack, SDR scanner, K8s worker-0
`pve3` (thinkcentre3)	`192.168.86.31`	Pwnagotchi, K8s worker-1
`tower1`	`192.168.86.130`	TrueNAS VM, Home Assistant, K8s control plane
`zotac`	`192.168.86.147`	Zigbee2MQTT, K8s worker-2

Workload Types

LXC containers are the primary unit of service isolation. Proxmox LXC uses Linux namespaces and cgroups with no hardware virtualization overhead, making them lighter than full VMs while still providing strong isolation. Most services — Traefik, Authentik, monitoring, the ARR stack — run as LXC containers.

Full VMs are used only where bare-metal isolation is required: TrueNAS (kernel-level ZFS and raw disk passthrough), Home Assistant OS (USB device passthrough for Zigbee), and Talos Kubernetes nodes (immutable OS that manages its own kernel).

Docker Compose inside LXC is used for multi-container services (ARR stack, Authentik, monitoring, Mailcow). This avoids nested VM overhead while preserving Docker's inter-container networking model.

Resource Allocation Philosophy

CFS CPU weight scheduling (cpuunits) expresses priority under contention. Traefik (weight 2048) gets twice the CPU time of the ARR stack (weight 1024) at saturation. When cores are idle, weights are irrelevant.

VMs support memory ballooning — the hypervisor reclaims idle RAM down to a guaranteed floor. LXC containers are hard-limited by cgroups. The ThinkCentre nodes run at roughly 2–3× CPU overcommit and 2× RAM overcommit, which works in practice because most services idle the majority of the time.

Service	Type	CPU Weight	Tier
Traefik	LXC	2048	Critical
TrueNAS	VM	1500	High
K8s Control Plane	VM	1200	High
Authentik	LXC	1200	High
K8s Workers, ARR, Plex, Jellyfin	mixed	1024	Normal
Monitoring, OpenClaw	LXC	800	Low
Recipe Site, WireGuard, Libby Alert	LXC	512	Minimal

03Storage Architecture

Storage is segmented into three tiers with distinct purpose and failure modes.

Local LVM

Per-node, high-speed

Each Proxmox node has a primary SSD formatted as LVM. All LXC container disks and VM OS disks live here. Fast. Not replicated — if a node dies, containers on it are offline until recovery.

Ceph Pool

Distributed, 3× replicated

Six OSDs (three SSDs + three HDDs) across multiple nodes. Backs Kubernetes VM disks, enabling VM live migration. Mixed SSD/HDD causes OSD latency divergence under heavy write workloads — monitored via Grafana.

TrueNAS Scale

NFS, ZFS, NAS

Full VM on tower1 with raw disk passthrough. ZFS provides checksumming, transparent compression, and native snapshots. Exports NFS shares to the ARR stack, Plex, and Jellyfin.

Known failure mode: NFS cascade

Ceph slow ops under heavy write I/O → TrueNAS ZFS write stall → NFS hangs → Proxmox host NFS mount stalls → LXC processes block on file I/O. NFS is mounted at the Proxmox host level and bind-mounted into LXCs — recovery requires action on the host node, not inside the container. Grafana alerts on ceph health != HEALTH_OK.

04Networking & Traffic Flow

All nodes connect to a flat 192.168.86.0/24 LAN via a managed switch behind a Google Nest WiFi Pro router. The Nest handles NAT, DHCP, DNS forwarding, and hairpin NAT. VLAN segmentation is deferred until VLAN-aware APs replace the Nest — the design is structured so VLANs can be added without touching service configs.

External Traffic Path

External client

Resolves *.woodhead.tech via Cloudflare DNS

↓

Cloudflare DNS

Returns public WAN IP (updated every 5 min by DDNS cron)

↓

Google Nest (router)

Port forward: 80/443 → 192.168.86.20, 51820 UDP → 192.168.86.39

↓

Traefik LXC 192.168.86.20
TLS termination (wildcard *.woodhead.tech), SNI routing to backend

↓

Backend service

Plaintext HTTP on the LAN. Authentik forwardAuth if required.

Two DNS records are intentionally not proxied through Cloudflare's CDN:

*.woodhead.tech — DNS-only so Traefik terminates TLS and receives the client's real IP for logging.
wg.woodhead.tech — DNS-only; Cloudflare's proxy infrastructure does not forward UDP, which WireGuard requires.

The apex domain woodhead.tech and www.woodhead.tech are proxied — they point to Cloudflare Pages (this site), not the homelab.

TLS Certificates

Traefik obtains a wildcard certificate (*.woodhead.tech) from Let's Encrypt via DNS-01 challenge. The flow: Traefik creates a _acme-challenge TXT record via the Cloudflare API → Let's Encrypt validates against public resolvers → certificate issued and stored in /etc/traefik/acme.json (mode 0600) → auto-renewed 30 days before expiry.

DNS-01 is preferred over HTTP-01 because it supports wildcard certificates and works regardless of whether ports 80/443 are currently reachable. One wildcard cert covers all subdomains at zero per-service overhead.

Internal Routing

Traefik watches /etc/traefik/dynamic/*.yml and hot-reloads on file changes — no restart required. Each file defines a router (host-matching rule + middleware chain) and a service (backend IP:port). Routes requiring authentication include the authentik@file middleware, which implements Authentik's forward-auth protocol: every request is pre-authorized against the Authentik outpost before reaching the backend.

05Security Model

Cloudflare

DDoS protection, IP reputation filtering, rate limiting at the edge. Proxied records benefit from Cloudflare's anycast network.

Google Nest NAT

Stateful NAT drops unsolicited inbound connections. Only explicitly port-forwarded ports are reachable from WAN.

Traefik

TLS termination; all HTTP is redirected to HTTPS. Routes without a configured backend return 404. Dashboard is protected by Authentik.

Authentik SSO

ForwardAuth middleware delegates authentication for all protected services. OIDC-compatible; backed by PostgreSQL + Redis.

WireGuard (LXC 208, 192.168.86.39) provides a VPN tunnel for remote management with split-tunnel routing — only 10.10.0.0/24 routes through the tunnel, avoiding conflicts on machines already on 192.168.86.0/24.

The ARR stack routes all download traffic through Gluetun (a WireGuard client to a commercial VPN provider). SABnzbd runs inside Gluetun's network namespace — if the VPN tunnel drops, SABnzbd loses connectivity entirely rather than leaking traffic over the plain WAN connection.

06Service Catalog

Service	LXC/VM	IP	Auth	Purpose
Infrastructure
Traefik	LXC 200	`.20`	Authentik	Reverse proxy, TLS termination, ingress for all HTTPS
Authentik	LXC 207	`.28`	—	SSO identity provider, OIDC, forwardAuth middleware
WireGuard	LXC 208	`.39`	—	Remote management VPN tunnel (UDP 51820)
Mailcow	LXC 212	`.34`	Own auth	Email stack — Postfix, Dovecot, Rspamd, ClamAV, webmail
PXE Server	LXC 213	`.35`	—	Network boot for bare-metal provisioning
Media Pipeline
Sonarr / Radarr / Prowlarr / Bazarr / Overseerr / SABnzbd	LXC 202	`.22`	Authentik	Full ARR stack — indexing, downloading (via VPN), library management
Plex	LXC 203	`.23`	Plex auth	Media server with iGPU hardware transcoding (Intel Quick Sync)
Jellyfin	LXC 204	`.24`	Own auth	Media server with iGPU hardware transcoding (VAAPI)
TrueNAS Scale	VM 300	`.40`	Authentik	NAS — ZFS pool, NFS exports for /media to ARR/Plex/Jellyfin
Observability
Prometheus + Grafana + Alertmanager	LXC 205	`.25`	Authentik	Metrics, dashboards, Discord + SMS alerts. Scrapes Proxmox, Traefik, Blackbox, Dexcom glucose
AlertMind	LXC 205	`.25`	—	AI-powered alert triage — enriches Alertmanager webhooks via Claude API, posts to Discord
Piboard	Pi 3B	`.131`	—	Go dashboard on a Raspberry Pi with Waveshare 5" display; polls Prometheus via SSE
Smart Home
Home Assistant OS	VM 301	`.41`	Own auth	Smart home controller; USB passthrough, Zigbee, automations
Zigbee2MQTT	LXC 214	`.36`	—	Zigbee coordinator on zotac, publishes MQTT to HAOS
Apps
Recipe Site	LXC 201	`.21`	—	Go + SQLite recipe app with nginx frontend
Kanboard	LXC 211	`.33`	Authentik	Project management / task queue
SDR Scanner	LXC 210	`.32`	Authentik	Trunk Recorder + rdio-scanner, RTL-SDR V4, decodes SNO911 P25 Phase II radio

07Kubernetes Cluster

Talos Linux — an immutable, API-driven Kubernetes OS with no SSH, no shell, and no package manager. All management goes through talosctl and kubectl. This eliminates an entire class of configuration drift: there is no /etc to hand-edit, no packages to patch manually, no cron jobs to manage. An upgrade is: generate new machine configs → apply via talosctl → wait for rolling restart.

Role	IP	Host Node	CPU / RAM
API VIP	`192.168.86.100`	— (distributed)	—
Control plane	`192.168.86.101`	tower1	2 cores / 4 GB
Worker-0	`192.168.86.111`	thinkcentre2	4 cores / 8 GB
Worker-1	`192.168.86.112`	thinkcentre3	4 cores / 8 GB
Worker-2	`192.168.86.113`	zotac	4 cores / 8 GB

The VIP (192.168.86.100) is managed by Talos's built-in virtual IP feature — the control plane node with the lowest uptime holds it. Workers always connect to the VIP, not the control plane's physical IP.

MetalLB runs in L2 mode, advertising a pool of IPs (192.168.86.150–199) via ARP. Services of type LoadBalancer receive an IP from this pool and are directly reachable on the LAN without a separate load balancer appliance.

K8s VM disks are on the Ceph pool (3× replicated), allowing VM live migration between Proxmox nodes without unmounting storage.

08Automation Toolchain

Terraform

Provision

Creates all VMs and LXC containers via the bpg/proxmox provider. Each LXC/VM is a separate .tf file. API token auth scoped to minimum required permissions.

Ansible

Configure

Idempotent playbooks install and configure services inside provisioned containers. Sensitive values are passed as environment variables at run time — nothing in the repo.

Makefile

Orchestrate

Top-level Makefile wraps all Terraform and Ansible operations into named targets. The full deploy sequence for a new cluster is documented and executable start-to-finish.

talosctl / kubectl

Kubernetes

Cluster management via the Talos and Kubernetes APIs. Generated configs live in talos/_out/. No SSH, no shell access to K8s nodes.

Service group management organizes containers into dependency-aware groups (core, storage, security, media, observability, etc.). The group playbooks enforce:

always_on groups (core, storage) refuse stop operations unconditionally.
Dependency blocking: stopping storage while media is running fails with an error — the operator must stop dependents first.
Hardware-bound groups (special) are excluded from bulk operations; their members must be managed individually.

09Operational Considerations

Known Failure Modes

NFS cascade

Ceph slow ops → TrueNAS write stall → NFS hang → LXC process freeze. Mitigated by Grafana alerting on ceph health and a documented OSD restart runbook.

DNS dependency

Internal clients depend on external DNS (Cloudflare) for name resolution. During internet outages, *.woodhead.tech is unreachable internally. Planned mitigation: AdGuard Home as an internal DNS resolver with split-horizon records.

No offsite backup

Proxmox backup jobs write to TrueNAS on the same physical site. A fire, flood, or theft loses both the cluster and the backups. B2/S3 offsite replication is planned.

Patching Strategy

Layer	Command	Safety
Proxmox nodes	`make patch-proxmox`	Serial — one node at a time
LXC containers	`make patch-lxc`	Parallel across all LXCs
Docker images	`make patch-docker`	Brief per-service restart
Talos OS	`talosctl upgrade`	Rolling, one node at a time
Raspberry Pi	`make patch-pi`	Parallel

10Roadmap

Item	Status	Notes
VLAN segmentation	Deferred	Requires replacing Google Nest with VLAN-aware APs
Velero K8s backup	Planned	etcd + persistent volume snapshots to object storage
Offsite backup (B2)	Planned	TrueNAS → Backblaze B2
Dedicated firewall (OPNsense)	Planned	IDS/IPS, inter-VLAN ACLs
K8s observability stack	Planned	kube-state-metrics, node-exporter DaemonSet, Loki
Certificate expiry alerts	Planned	Prometheus ssl_cert_not_after exporter
Terraform state — remaining VMs	Blocked	bpg/proxmox API timeout on complex disk configs

11Hardware Inventory

Device	CPU	RAM	Storage	Role
ThinkCentre Tiny M710q (×3)	i5-7500T (4c)	8 GB	256 GB SSD	Proxmox nodes pve1/2/3
Custom tower (tower1)	i7 (6c+)	32 GB	1 TB SSD + HDDs	TrueNAS, K8s control plane
Zotac ZBOX	Ryzen 5	16 GB	500 GB SSD	K8s worker-2, Zigbee
Raspberry Pi 3B (×3)	ARM Cortex-A53 (4c)	1 GB	16–32 GB SD	Piboard dashboard, Klipper ×2

Total provisioned resources (all services running)

~31vCPUs allocated

~42 GBRAM allocated

~171 GBlocal-lvm

~250 GBCeph raw (3× replicated)