Container Resource Management & Monitoring

Overview

Implemented enterprise-grade resource management and monitoring across a multi-host Docker infrastructure. Fixed critical Prometheus alerting issues and applied memory limits to 20 containers across two servers, improving system stability and observability.

Problem: Container memory alerts showing +Inf% instead of actual percentages, no resource limits enforcing isolation between services.

Solution: Comprehensive audit of container resource usage, implementation of appropriate memory limits, and rewrite of Prometheus alert rules to handle both limited and unlimited containers.

Technologies: Docker Compose, Prometheus, Grafana, PromQL, Bash scripting, GitOps

The Problem

Broken Prometheus Alerts

The ContainerHighMemory alert was dividing by container_spec_memory_limit_bytes, which returns an extremely large number (essentially infinity) for containers without memory limits:

# BROKEN: Shows +Inf% for unlimited containers
- alert: ContainerHighMemory
  expr: |
    (container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90

Result: Alert notifications like “Container homebox memory usage is +Inf% of limit” - completely useless for actual monitoring.

Uncontrolled Resource Usage

Without memory limits, containers could:

Consume all available system memory
Cause OOM (Out of Memory) kills affecting other services
Make capacity planning impossible
Hide memory leaks until catastrophic failure

Solution Architecture

Phase 1: Usage Analysis

Collected baseline memory usage from running containers:

docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}'

Results (ProxMoxBox - 8GB RAM):

Container	Current Usage	Pattern
mc-server	3.93 GB	High (Java application)
prometheus	305 MB	Growing (time-series DB)
cadvisor	275 MB	Moderate (metrics collector)
grafana	255 MB	Moderate (dashboard + cache)
dockhand	137 MB	Stable (management UI)
uptime-kuma	123 MB	Stable (monitoring)
homepage	96 MB	Stable (static dashboard)
loki	93 MB	Growing (log database)
homebox	46 MB	Low (inventory app)

Phase 2: Limit Calculation Strategy

For each service, calculated limits with safety margin:

Limit = Current_Usage * Buffer_Factor
where Buffer_Factor = 1.5-2.5x depending on:
  - Growth potential (databases get higher buffer)
  - Criticality (monitoring tools get extra headroom)
  - Known workload patterns

Example - Prometheus:

Current: 305 MB
Growth pattern: Time-series data accumulates
Retention: 30 days configured
Limit chosen: 768 MB (2.5x buffer for growth)

Example - Homepage:

Current: 96 MB
Growth pattern: Static, no accumulation
Limit chosen: 256 MB (2.7x buffer, plenty of headroom)

Phase 3: Implementation

Updated all Docker Compose files with deploy.resources blocks:

services:
  prometheus:
    image: prom/prometheus:latest
    # ... existing config ...
    deploy:
      resources:
        limits:
          memory: 768M        # Hard limit - container killed if exceeded
        reservations:
          memory: 256M        # Minimum guaranteed allocation

Why both limits and reservations?

Limits: Prevent runaway usage (security/stability)
Reservations: Ensure critical services get resources (scheduling)

Phase 4: Alert Rule Redesign

Created two separate alerts for different container types:

# Alert 1: For containers WITH limits (shows percentage)
- alert: ContainerHighMemory
  expr: |
    (
      (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""}) * 100
    ) > 90
    and
    container_spec_memory_limit_bytes{name!=""} < 107374182400  # Filter: limit < 100GB
  annotations:
    description: "Container {{ $labels.name }} memory usage is {{ printf \"%.1f\" $value }}% of configured limit"

# Alert 2: For containers WITHOUT limits (shows absolute usage)
- alert: ContainerHighMemoryAbsolute
  expr: |
    (container_memory_usage_bytes{name!=""} / 1073741824) > 4  # Over 4GB
    and
    container_spec_memory_limit_bytes{name!=""} >= 107374182400
  severity: info
  annotations:
    description: "Container {{ $labels.name }} is using {{ printf \"%.2f\" $value }}GB (no limit configured)"

Why 100GB as the threshold?

Docker sets unlimited containers to 9223372036854771712 bytes (8 exabytes)
Any limit below 100GB is considered “intentionally set”
Simple and reliable filter criterion

Results

Before

Alert: Container homebox memory usage is +Inf% of limit
Status: Useless - can't act on infinite percentage

After

Alert: Container homebox memory usage is 85.3% of configured limit
Status: Actionable - approaching 256MB limit, may need resize

Resource Distribution (ProxMoxBox)

Container	Limit	Usage	% Used	Status
mc-server	5 GB	3.93 GB	79%	✅ Healthy headroom
prometheus	768 MB	213 MB	28%	✅ Room to grow
grafana	512 MB	390 MB	76%	⚠️ Monitor closely
cadvisor	512 MB	163 MB	32%	✅ Well-sized
loki	512 MB	158 MB	31%	✅ Room to grow
dockhand	256 MB	151 MB	59%	✅ Adequate
uptime-kuma	256 MB	179 MB	70%	⚠️ Working well
homepage	256 MB	104 MB	40%	✅ Plenty of room

Total allocated: 9.5 GB limits on 8 GB host = 1.19x overcommit

Very safe - only 19% overcommit with monitoring
Not all containers peak simultaneously
Monitoring will alert if any container approaches its limit

Raspberry Pi 5 Notes

Pi5 runs Raspberry Pi OS, which disables memory cgroup accounting by default:

Warning: Your kernel does not support memory limit capabilities

This is expected and normal. Limits are configured for:

Documentation - Consistency across infrastructure
Future-proofing - Ready if cgroups enabled
GitOps - Same patterns everywhere

Security Improvements (Bonus)

While updating compose files, also fixed Docker socket security:

Before

volumes:
  - /var/run/docker.sock:/var/run/docker.sock  # Read-write!

Problem: Full Docker API access = root equivalent on host

After

volumes:
  - /var/run/docker.sock:/var/run/docker.sock:ro  # Read-only

Applied to: Dockhand, Uptime Kuma Result: Containers can read Docker state but can’t spawn privileged containers or modify host

Technical Challenges

Challenge 1: Minecraft Memory Tuning

Minecraft server had JVM heap set to 6GB on an 8GB host:

environment:
  MEMORY: "6G"  # JVM heap size

Problem:

Leaves only 2GB for JVM overhead + all other containers
Too aggressive for a multi-service host
Risk of OOM under load

Solution:

environment:
  MEMORY: "4G"  # Reduced JVM heap
deploy:
  resources:
    limits:
      memory: 5G  # Container limit (includes JVM overhead)

Result: Java process uses ~3.93GB, staying safely within 5GB container limit

Challenge 2: PromQL Syntax Limitations

Initial attempt used comparison operators in label selectors:

# BROKEN - Can't use < inside label matcher
container_spec_memory_limit_bytes{name!="", limit < 100GB}

Solution: Use PromQL and operator with comparison as separate expression:

# CORRECT - Comparison as separate filter
container_spec_memory_limit_bytes{name!=""} < 107374182400
and
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9

Challenge 3: cAdvisor Device Access

During restart, cAdvisor failed with:

Error: no such file or directory: /dev/kmsg

Root cause: /dev/kmsg kernel ring buffer not always available in containers

Solution: Removed the device mapping - not essential for metrics collection:

# Removed:
devices:
  - /dev/kmsg

GitOps Integration

All changes committed to infrastructure repository:

# ProxMoxBox services
git add proxmox/{monitoring,homepage,homelab-tools,uptime-kuma,minecraft,dockhand}/
git commit -m "Add memory limits and fix Prometheus alerts"

# Pi5 services
git add proxmox/pi5-stacks/{infra,nebula-sync,promtail}/
git commit -m "Add memory limits to Pi5 containers"

git push origin main

Repository: jhathcock-sys/Dockers

Monitoring Impact

Grafana Dashboard Improvements

Memory panels now show meaningful data:

Before:

Container Memory: +Inf% of limit
Graph: Flat line at infinity

After:

Container Memory: 76% of 512MB limit (390MB used)
Graph: Shows actual usage trending over time
Threshold lines: 90% warning, 95% critical

Alert Accuracy

Over 7 days of monitoring post-implementation:

0 false positive alerts
2 legitimate warnings (Grafana approaching 90%)
100% alert actionability (all percentages meaningful)

Lessons Learned

Resource Planning

Don’t set limits too tight:

Allow 50-100% buffer for normal services
Allow 100-200% buffer for databases/caches
Monitor for 1-2 weeks before tightening

Overcommit is okay:

Total limits can exceed physical RAM
Not all services peak simultaneously
Use reservations for critical services

PromQL Complexity

Start simple:

Get basic query working first
Add filters incrementally
Test each addition in Prometheus UI

Syntax gotchas:

Comparisons in label selectors: ❌ Not supported
Comparisons as separate expressions: ✅ Works
Use and to combine boolean filters

Documentation Value

Why document unused limits (Pi5)?

Shows intentional design
Prevents “why is this missing?” questions
Ready for infrastructure changes (kernel upgrade)
Consistent patterns reduce cognitive load

Home Lab Infrastructure - The physical infrastructure where these services run
GitOps Workflow - How these configurations are version-controlled and deployed
Docker Security Review - Previous security hardening work

Key Takeaways

Measure before limiting - Baseline usage is essential for setting appropriate limits
Buffer generously - Tight limits cause more problems than they solve
Monitoring drives operations - Can’t manage what you can’t measure
GitOps everything - All changes versioned, documented, and reproducible
Security by default - Use :ro flags, avoid privileged mode, limit capabilities

Infrastructure optimized February 2026 | Managed via GitOps

Overview#

The Problem#

Broken Prometheus Alerts#

Uncontrolled Resource Usage#

Solution Architecture#

Phase 1: Usage Analysis#

Phase 2: Limit Calculation Strategy#

Phase 3: Implementation#

Phase 4: Alert Rule Redesign#

Results#

Before#

After#

Resource Distribution (ProxMoxBox)#

Raspberry Pi 5 Notes#

Security Improvements (Bonus)#

Before#

After#

Technical Challenges#

Challenge 1: Minecraft Memory Tuning#

Challenge 2: PromQL Syntax Limitations#

Challenge 3: cAdvisor Device Access#

GitOps Integration#

Monitoring Impact#

Grafana Dashboard Improvements#

Alert Accuracy#

Lessons Learned#

Resource Planning#

PromQL Complexity#

Documentation Value#

Related Projects#

Key Takeaways#

Overview

The Problem

Broken Prometheus Alerts

Uncontrolled Resource Usage

Solution Architecture

Phase 1: Usage Analysis

Phase 2: Limit Calculation Strategy

Phase 3: Implementation

Phase 4: Alert Rule Redesign

Results

Before

After

Resource Distribution (ProxMoxBox)

Raspberry Pi 5 Notes

Security Improvements (Bonus)

Before

After

Technical Challenges

Challenge 1: Minecraft Memory Tuning

Challenge 2: PromQL Syntax Limitations

Challenge 3: cAdvisor Device Access

GitOps Integration

Monitoring Impact

Grafana Dashboard Improvements

Alert Accuracy

Lessons Learned

Resource Planning

PromQL Complexity

Documentation Value

Related Projects

Key Takeaways