homelab

Self-Healing Infrastructure: How an AI Agent Manages My Home Server

Nathan Broadbent

03 Feb 2026 — 2 min read

I can't believe I have a "self-healing" server now. My AI agent can run any SSH, Terraform, Ansible, kubectl commands and fix infrastructure issues before I even know there's a problem.

Here's how the stack works.

The Core Idea

Everything is code, and an AI agent watches over it all.

Infrastructure defined in Terraform and Ansible (no manual changes)
Apps run in Kubernetes (K3s)
An AI agent (OpenClaw) monitors health, reads logs, and can execute fixes
Problems often get resolved before I even notice them

The Stack

Architecture diagram showing OpenClaw AI Agent at the top, connecting to monitoring layer (Gatus, Loki, Grafana), then to Infrastructure as Code (Terraform, Ansible, K3s), and finally to Proxmox bare metal

Layer 1: Proxmox (Hypervisor)

The foundation. Proxmox runs on bare metal, hosting VMs and LXC containers. ZFS provides storage with snapshots and replication.

Layer 2: Infrastructure as Code

Terraform: Defines VMs, LXCs, DNS records, storage
Ansible: Configures everything inside the VMs (packages, services, settings)
Git repo: Single source of truth - no manual SSH changes allowed

Layer 3: Kubernetes (K3s)

Lightweight Kubernetes running 40+ apps: Home Assistant, Gitea, monitoring tools, custom applications. ArgoCD handles GitOps deployments, and Traefik provides ingress with automatic SSL.

Layer 4: Monitoring

Gatus: Health checks for all services (HTTP, TCP, DNS)
Loki: Centralized log aggregation
Grafana: Dashboards and visualization

Layer 5: OpenClaw (The Brain)

This is where it gets interesting. An AI agent running in an LXC container with:

SSH access to all infrastructure
Ability to run kubectl, terraform, ansible, gh commands
Scheduled health dashboard checks
Log reading when issues are detected
Can create PRs, apply fixes, restart services

How Self-Healing Works

Detection: Gatus checks fail, or scheduled audit finds an issue
Investigation: OpenClaw reads logs via Loki, checks pod status
Diagnosis: Identifies root cause (OOM, config error, network issue, etc.)
Fix: Applies appropriate remedy - restart a pod, fix config, apply Terraform changes
Verification: Confirms the fix worked
Documentation: Logs the incident and resolution

Example Fixes

Pod crash loop → Check logs → Fix config → Restart
Certificate expiring → Trigger cert-manager renewal
Disk filling up → Clean old backups → Add alert threshold
Service unreachable → Check ingress → Fix routing

Key Design Principles

1. Everything is Code

No manual changes via SSH or web UIs. If it's not in Git, it doesn't exist. This means full audit trail of every change, easy rollback via git revert, and reproducible from scratch.

2. AI as Operator, Not Owner

OpenClaw has access but follows strict rules: can fix known issue patterns autonomously, asks before making significant changes, documents everything it does, and human remains in control.

3. Defense in Depth

Health checks catch issues early. Logs provide investigation context. Multiple alert channels (Telegram, email). Scheduled audits catch drift.

4. Fail Safe, Not Fail Secure

Services should degrade gracefully. Prefer availability over perfect consistency. AI can restart things but can't delete data.

Public Repository

I've published a sanitized version of my Infrastructure as Code setup:

GitHub: ndbroadbent/homeserver-terraform-ansible-public

It includes Terraform modules for Proxmox VMs/LXCs, Ansible roles for common services, K3s application manifests, and example configurations.

Getting Started

If you want to build something similar:

Start with IaC: Get Terraform/Ansible managing your infra first
Add monitoring: Gatus is simple and effective for health checks
Centralize logs: Loki + Promtail is lightweight
Add the AI layer: OpenClaw connects everything together

The AI layer is the force multiplier - it turns your monitoring from "alert and wait for human" to "detect, diagnose, and fix."