Self-Healing Infrastructure: How an AI Agent Manages My Home Server

I can't believe I have a "self-healing" server now. My AI agent can run any SSH, Terraform, Ansible, kubectl commands and fix infrastructure issues before I even know there's a problem.

Here's how the stack works.

The Core Idea

Everything is code, and an AI agent watches over it all.

  • Infrastructure defined in Terraform and Ansible (no manual changes)
  • Apps run in Kubernetes (K3s)
  • An AI agent (OpenClaw) monitors health, reads logs, and can execute fixes
  • Problems often get resolved before I even notice them

The Stack

┌─────────────────────────────────────────────────────────┐
│                      OpenClaw (AI Agent)                │
│   Monitors health, reads logs, runs commands, fixes     │
└─────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
   ┌─────────┐        ┌──────────┐        ┌─────────┐
   │  Gatus  │        │   Loki   │        │ Grafana │
   │ Health  │        │   Logs   │        │  Dash   │
   └─────────┘        └──────────┘        └─────────┘
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
   ┌─────────┐        ┌──────────┐        ┌─────────┐
   │Terraform│        │ Ansible  │        │   K3s   │
   │  Infra  │        │  Config  │        │  Apps   │
   └─────────┘        └──────────┘        └─────────┘
                            │
                            ▼
                    ┌──────────────┐
                    │   Proxmox    │
                    │  (Bare Metal)│
                    └──────────────┘

Layer 1: Proxmox (Hypervisor)

The foundation. Proxmox runs on bare metal, hosting VMs and LXC containers. ZFS provides storage with snapshots and replication.

Layer 2: Infrastructure as Code

  • Terraform: Defines VMs, LXCs, DNS records, storage
  • Ansible: Configures everything inside the VMs (packages, services, settings)
  • Git repo: Single source of truth - no manual SSH changes allowed

Layer 3: Kubernetes (K3s)

Lightweight Kubernetes running 40+ apps: Home Assistant, Gitea, monitoring tools, custom applications. ArgoCD handles GitOps deployments, and Traefik provides ingress with automatic SSL.

Layer 4: Monitoring

  • Gatus: Health checks for all services (HTTP, TCP, DNS)
  • Loki: Centralized log aggregation
  • Grafana: Dashboards and visualization

Layer 5: OpenClaw (The Brain)

This is where it gets interesting. An AI agent running in an LXC container with:

  • SSH access to all infrastructure
  • Ability to run kubectl, terraform, ansible, gh commands
  • Scheduled health dashboard checks
  • Log reading when issues are detected
  • Can create PRs, apply fixes, restart services

How Self-Healing Works

  1. Detection: Gatus checks fail, or scheduled audit finds an issue
  2. Investigation: OpenClaw reads logs via Loki, checks pod status
  3. Diagnosis: Identifies root cause (OOM, config error, network issue, etc.)
  4. Fix: Applies appropriate remedy - restart a pod, fix config, apply Terraform changes
  5. Verification: Confirms the fix worked
  6. Documentation: Logs the incident and resolution

Example Fixes

  • Pod crash loop → Check logs → Fix config → Restart
  • Certificate expiring → Trigger cert-manager renewal
  • Disk filling up → Clean old backups → Add alert threshold
  • Service unreachable → Check ingress → Fix routing

Key Design Principles

1. Everything is Code

No manual changes via SSH or web UIs. If it's not in Git, it doesn't exist. This means full audit trail of every change, easy rollback via git revert, and reproducible from scratch.

2. AI as Operator, Not Owner

OpenClaw has access but follows strict rules: can fix known issue patterns autonomously, asks before making significant changes, documents everything it does, and human remains in control.

3. Defense in Depth

Health checks catch issues early. Logs provide investigation context. Multiple alert channels (Telegram, email). Scheduled audits catch drift.

4. Fail Safe, Not Fail Secure

Services should degrade gracefully. Prefer availability over perfect consistency. AI can restart things but can't delete data.

Public Repository

I've published a sanitized version of my Infrastructure as Code setup:

GitHub: ndbroadbent/homeserver-terraform-ansible-public

It includes Terraform modules for Proxmox VMs/LXCs, Ansible roles for common services, K3s application manifests, and example configurations.

Getting Started

If you want to build something similar:

  1. Start with IaC: Get Terraform/Ansible managing your infra first
  2. Add monitoring: Gatus is simple and effective for health checks
  3. Centralize logs: Loki + Promtail is lightweight
  4. Add the AI layer: OpenClaw connects everything together

The AI layer is the force multiplier - it turns your monitoring from "alert and wait for human" to "detect, diagnose, and fix."