Container Resource Management (2026-02-04)

← Back to Changelog


Overview

Implemented comprehensive memory limits and fixed critical Prometheus alerting issues across all Docker containers in the homelab.


Problem Identified

Broken Alerts

  • ContainerHighMemory alert showing +Inf% instead of actual percentages
  • Root Cause: Dividing by container_spec_memory_limit_bytes which returns ~8 exabytes for unlimited containers
  • Risk: No resource isolation between services, potential for OOM kills

Solution Implemented

Memory Limits Added (ProxMoxBox - 8GB RAM)

ContainerLimitUsage% UsedStrategy
Prometheus768 MB213 MB28%2.5x buffer for time-series growth
Grafana512 MB390 MB76%Moderate buffer for caching
Loki512 MB158 MB31%Room for log accumulation
cAdvisor512 MB163 MB32%Metrics collection overhead
Minecraft-Server5 GB3.93 GB79%JVM heap 4G + overhead
Dockhand256 MB151 MB59%Management UI
Uptime Kuma256 MB179 MB70%Monitoring tool
Homepage-Dashboard256 MB104 MB40%Static dashboard
Homebox256 MB28 MB11%Lightweight app
Alertmanager128 MB17 MB13%Alert routing
Promtail128 MB45 MB35%Log shipper
Node Exporter64 MB15 MB23%Metrics exporter
Alertmanager-Discord64 MB1.3 MB2%Webhook bridge

Total Allocated: 9.5 GB on 8 GB host (1.19x overcommit - very safe with monitoring)

Memory Limits Added (Raspberry Pi 5 - 8GB RAM)

ContainerLimitNote
Pi-hole512 MBCritical DNS service
Tailscale256 MBVPN client
Promtail128 MBLog collector
Nebula-sync128 MBPi-hole sync
Node Exporter64 MBMetrics exporter
Mealie1 GBAlready configured

Note: Raspberry Pi OS doesn’t support memory cgroup accounting by default. Limits configured for documentation and future-proofing.


Prometheus Alert Rules Fixed

Old (Broken)

- alert: ContainerHighMemory
  expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90
  # Result: +Inf% for unlimited containers

New (Fixed)

Alert 1: Containers WITH limits (shows percentage)

- alert: ContainerHighMemory
  expr: |
    ((container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100) > 90
    and
    container_spec_memory_limit_bytes < 107374182400  # Filter: limit < 100GB
  description: "Container {{ $labels.name }} memory usage is {{ printf \"%.1f\" $value }}% of configured limit"

Alert 2: Containers WITHOUT limits (shows absolute GB)

- alert: ContainerHighMemoryAbsolute
  expr: |
    (container_memory_usage_bytes / 1073741824) > 4  # Over 4GB
    and
    container_spec_memory_limit_bytes >= 107374182400
  severity: info
  description: "Container {{ $labels.name }} is using {{ printf \"%.2f\" $value }}GB (no limit configured)"

Security Improvements (Bonus)

ServiceChangeBenefit
DockhandAdded :ro to docker.sockPrevents write access to Docker API
Uptime KumaAdded :ro to docker.sockPrevents write access to Docker API
cAdvisorRemoved /dev/kmsg deviceNot essential, causes startup errors
Minecraft-ServerReduced JVM heap 6G→4GBetter resource sharing

Results

  • Alert Accuracy: 100% actionable (no more +Inf%)
  • Resource Isolation: Proper limits prevent runaway usage
  • Monitoring: Grafana dashboards now show meaningful percentages
  • Stability: 7 days monitoring with 0 false positives, 2 legitimate warnings

GitOps Updates

Commits

  • e266a08 - ProxMoxBox memory limits (7 files, 146 insertions)
  • e301826 - Pi5 memory limits (3 files, 82 insertions)

Files Modified

  • /opt/monitoring/docker-compose.yaml
  • /opt/homelab-tools/compose.yaml
  • /opt/homepage/docker-compose.yaml
  • /opt/minecraft/docker-compose.yaml
  • /opt/uptimekuma/docker-compose.yaml
  • /opt/dockhand/docker-compose.yml
  • /opt/pi5-stacks/infra/docker-compose.yaml
  • /opt/pi5-stacks/nebula-sync/docker-compose.yaml
  • /opt/node-exporter/docker-compose.yaml (Pi5)
  • /opt/monitoring/prometheus/alerts.yml

Technical Challenges Solved

  1. PromQL Syntax: Can’t use < in label selectors, had to use and with separate comparison
  2. Minecraft Tuning: JVM heap + overhead requires container limit > heap size
  3. cAdvisor Device Access: /dev/kmsg not always available in containers
  4. Pi5 Kernel Limitations: Memory cgroups disabled by default (documented, not blocking)

Portfolio Documentation

Created comprehensive writeup published to portfolio: https://jhathcock-sys.github.io/me/projects/container-resource-management/

Commit: 0ddba8b - 3,700+ word writeup demonstrating production monitoring and infrastructure optimization skills