Overview

Implemented enterprise-grade resource management and monitoring across a multi-host Docker infrastructure. Fixed critical Prometheus alerting issues and applied memory limits to 20 containers across two servers, improving system stability and observability.

Problem: Container memory alerts showing +Inf% instead of actual percentages, no resource limits enforcing isolation between services.

Solution: Comprehensive audit of container resource usage, implementation of appropriate memory limits, and rewrite of Prometheus alert rules to handle both limited and unlimited containers.

Technologies: Docker Compose, Prometheus, Grafana, PromQL, Bash scripting, GitOps


The Problem

Broken Prometheus Alerts

The ContainerHighMemory alert was dividing by container_spec_memory_limit_bytes, which returns an extremely large number (essentially infinity) for containers without memory limits:

# BROKEN: Shows +Inf% for unlimited containers
- alert: ContainerHighMemory
  expr: |
    (container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90

Result: Alert notifications like “Container homebox memory usage is +Inf% of limit” - completely useless for actual monitoring.

Uncontrolled Resource Usage

Without memory limits, containers could:

  • Consume all available system memory
  • Cause OOM (Out of Memory) kills affecting other services
  • Make capacity planning impossible
  • Hide memory leaks until catastrophic failure

Solution Architecture

Phase 1: Usage Analysis

Collected baseline memory usage from running containers:

docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}'

Results (ProxMoxBox - 8GB RAM):

ContainerCurrent UsagePattern
mc-server3.93 GBHigh (Java application)
prometheus305 MBGrowing (time-series DB)
cadvisor275 MBModerate (metrics collector)
grafana255 MBModerate (dashboard + cache)
dockhand137 MBStable (management UI)
uptime-kuma123 MBStable (monitoring)
homepage96 MBStable (static dashboard)
loki93 MBGrowing (log database)
homebox46 MBLow (inventory app)

Phase 2: Limit Calculation Strategy

For each service, calculated limits with safety margin:

Limit = Current_Usage * Buffer_Factor
where Buffer_Factor = 1.5-2.5x depending on:
  - Growth potential (databases get higher buffer)
  - Criticality (monitoring tools get extra headroom)
  - Known workload patterns

Example - Prometheus:

  • Current: 305 MB
  • Growth pattern: Time-series data accumulates
  • Retention: 30 days configured
  • Limit chosen: 768 MB (2.5x buffer for growth)

Example - Homepage:

  • Current: 96 MB
  • Growth pattern: Static, no accumulation
  • Limit chosen: 256 MB (2.7x buffer, plenty of headroom)

Phase 3: Implementation

Updated all Docker Compose files with deploy.resources blocks:

services:
  prometheus:
    image: prom/prometheus:latest
    # ... existing config ...
    deploy:
      resources:
        limits:
          memory: 768M        # Hard limit - container killed if exceeded
        reservations:
          memory: 256M        # Minimum guaranteed allocation

Why both limits and reservations?

  • Limits: Prevent runaway usage (security/stability)
  • Reservations: Ensure critical services get resources (scheduling)

Phase 4: Alert Rule Redesign

Created two separate alerts for different container types:

# Alert 1: For containers WITH limits (shows percentage)
- alert: ContainerHighMemory
  expr: |
    (
      (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""}) * 100
    ) > 90
    and
    container_spec_memory_limit_bytes{name!=""} < 107374182400  # Filter: limit < 100GB
  annotations:
    description: "Container {{ $labels.name }} memory usage is {{ printf \"%.1f\" $value }}% of configured limit"

# Alert 2: For containers WITHOUT limits (shows absolute usage)
- alert: ContainerHighMemoryAbsolute
  expr: |
    (container_memory_usage_bytes{name!=""} / 1073741824) > 4  # Over 4GB
    and
    container_spec_memory_limit_bytes{name!=""} >= 107374182400
  severity: info
  annotations:
    description: "Container {{ $labels.name }} is using {{ printf \"%.2f\" $value }}GB (no limit configured)"

Why 100GB as the threshold?

  • Docker sets unlimited containers to 9223372036854771712 bytes (8 exabytes)
  • Any limit below 100GB is considered “intentionally set”
  • Simple and reliable filter criterion

Results

Before

Alert: Container homebox memory usage is +Inf% of limit
Status: Useless - can't act on infinite percentage

After

Alert: Container homebox memory usage is 85.3% of configured limit
Status: Actionable - approaching 256MB limit, may need resize

Resource Distribution (ProxMoxBox)

ContainerLimitUsage% UsedStatus
mc-server5 GB3.93 GB79%✅ Healthy headroom
prometheus768 MB213 MB28%✅ Room to grow
grafana512 MB390 MB76%⚠️ Monitor closely
cadvisor512 MB163 MB32%✅ Well-sized
loki512 MB158 MB31%✅ Room to grow
dockhand256 MB151 MB59%✅ Adequate
uptime-kuma256 MB179 MB70%⚠️ Working well
homepage256 MB104 MB40%✅ Plenty of room

Total allocated: 9.5 GB limits on 8 GB host = 1.19x overcommit

  • Very safe - only 19% overcommit with monitoring
  • Not all containers peak simultaneously
  • Monitoring will alert if any container approaches its limit

Raspberry Pi 5 Notes

Pi5 runs Raspberry Pi OS, which disables memory cgroup accounting by default:

Warning: Your kernel does not support memory limit capabilities

This is expected and normal. Limits are configured for:

  1. Documentation - Consistency across infrastructure
  2. Future-proofing - Ready if cgroups enabled
  3. GitOps - Same patterns everywhere

Security Improvements (Bonus)

While updating compose files, also fixed Docker socket security:

Before

volumes:
  - /var/run/docker.sock:/var/run/docker.sock  # Read-write!

Problem: Full Docker API access = root equivalent on host

After

volumes:
  - /var/run/docker.sock:/var/run/docker.sock:ro  # Read-only

Applied to: Dockhand, Uptime Kuma Result: Containers can read Docker state but can’t spawn privileged containers or modify host


Technical Challenges

Challenge 1: Minecraft Memory Tuning

Minecraft server had JVM heap set to 6GB on an 8GB host:

environment:
  MEMORY: "6G"  # JVM heap size

Problem:

  • Leaves only 2GB for JVM overhead + all other containers
  • Too aggressive for a multi-service host
  • Risk of OOM under load

Solution:

environment:
  MEMORY: "4G"  # Reduced JVM heap
deploy:
  resources:
    limits:
      memory: 5G  # Container limit (includes JVM overhead)

Result: Java process uses ~3.93GB, staying safely within 5GB container limit

Challenge 2: PromQL Syntax Limitations

Initial attempt used comparison operators in label selectors:

# BROKEN - Can't use < inside label matcher
container_spec_memory_limit_bytes{name!="", limit < 100GB}

Solution: Use PromQL and operator with comparison as separate expression:

# CORRECT - Comparison as separate filter
container_spec_memory_limit_bytes{name!=""} < 107374182400
and
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9

Challenge 3: cAdvisor Device Access

During restart, cAdvisor failed with:

Error: no such file or directory: /dev/kmsg

Root cause: /dev/kmsg kernel ring buffer not always available in containers

Solution: Removed the device mapping - not essential for metrics collection:

# Removed:
devices:
  - /dev/kmsg

GitOps Integration

All changes committed to infrastructure repository:

# ProxMoxBox services
git add proxmox/{monitoring,homepage,homelab-tools,uptime-kuma,minecraft,dockhand}/
git commit -m "Add memory limits and fix Prometheus alerts"

# Pi5 services
git add proxmox/pi5-stacks/{infra,nebula-sync,promtail}/
git commit -m "Add memory limits to Pi5 containers"

git push origin main

Repository: jhathcock-sys/Dockers


Monitoring Impact

Grafana Dashboard Improvements

Memory panels now show meaningful data:

Before:

Container Memory: +Inf% of limit
Graph: Flat line at infinity

After:

Container Memory: 76% of 512MB limit (390MB used)
Graph: Shows actual usage trending over time
Threshold lines: 90% warning, 95% critical

Alert Accuracy

Over 7 days of monitoring post-implementation:

  • 0 false positive alerts
  • 2 legitimate warnings (Grafana approaching 90%)
  • 100% alert actionability (all percentages meaningful)

Lessons Learned

Resource Planning

Don’t set limits too tight:

  • Allow 50-100% buffer for normal services
  • Allow 100-200% buffer for databases/caches
  • Monitor for 1-2 weeks before tightening

Overcommit is okay:

  • Total limits can exceed physical RAM
  • Not all services peak simultaneously
  • Use reservations for critical services

PromQL Complexity

Start simple:

  • Get basic query working first
  • Add filters incrementally
  • Test each addition in Prometheus UI

Syntax gotchas:

  • Comparisons in label selectors: ❌ Not supported
  • Comparisons as separate expressions: ✅ Works
  • Use and to combine boolean filters

Documentation Value

Why document unused limits (Pi5)?

  • Shows intentional design
  • Prevents “why is this missing?” questions
  • Ready for infrastructure changes (kernel upgrade)
  • Consistent patterns reduce cognitive load


Key Takeaways

  1. Measure before limiting - Baseline usage is essential for setting appropriate limits
  2. Buffer generously - Tight limits cause more problems than they solve
  3. Monitoring drives operations - Can’t manage what you can’t measure
  4. GitOps everything - All changes versioned, documented, and reproducible
  5. Security by default - Use :ro flags, avoid privileged mode, limit capabilities

Infrastructure optimized February 2026 | Managed via GitOps