Skip to content

Container Observability: Logging and Monitoring

Once your application is running in production (File 18), your focus shifts from deployment to observability. You need to know:

  1. Logging: What is the application doing right now? (Debug/Event data)

  2. Monitoring: Is the application healthy and performant? (Time-series metrics)

In a containerized world, you cannot SSH into a server and tail -f /var/log/syslog. Pods are ephemeral and distributed across many nodes. You need a centralized strategy.

1. Container Logging: The "Stdout" Standard

The Golden Rule: An application running in a container should always write its logs to Standard Output (stdout) and Standard Error (stderr).

  • Do not write to a file inside the container (e.g., /var/log/myapp.log). This data will be lost if the container crashes and is hard to extract.

  • Do configure your logger (Log4j, Winston, Python logging) to print to the console.

How Docker Handles Logs

When an app writes to stdout, the Docker daemon captures it. By default, Docker uses the json-file logging driver. It writes these streams to a JSON file on the host disk.

Viewing Logs:

docker logs my-container
docker logs -f my-container  # Follow (tail)

Log Rotation (Crucial Configuration)

The default Docker configuration does not rotate logs. If your app is chatty, the log file on the host can grow until it fills the entire disk, crashing the server.

Best Practice: Configure log rotation in /etc/docker/daemon.json.

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

2. Centralized Logging Architecture

In a cluster (Swarm or Kubernetes), looking at individual container logs is impossible. You need to aggregate them.

The Sidecar / DaemonSet Pattern:

  1. Collector: A logging agent runs on every node (as a DaemonSet in K8s). Common agents are Fluentd, Fluent Bit, or Filebeat.

  2. Scrape: The agent mounts the host's log directory (/var/lib/docker/containers) and "tails" all the log files.

  3. Forward: It enriches the logs with metadata (Pod name, Namespace, Node IP) and forwards them to a central backend.

  4. Backend: The logs are stored in a searchable database.

    • ELK Stack: Elasticsearch (Store), Logstash (Process), Kibana (Visualize).

    • EFK Stack: Elasticsearch, Fluentd, Kibana.

    • Cloud Native: Grafana Loki (optimized for logs).

3. Monitoring: Metrics and Time-Series

Logging tells you what happened. Monitoring tells you how the system is performing.

The Standard: Prometheus

Prometheus is the de-facto standard for container monitoring. It uses a pull model.

  1. Instrumentation: Your application exposes a /metrics HTTP endpoint. This endpoint returns data in a simple text format (e.g., http_requests_total 1024).

  2. Scraping: The Prometheus server is configured to "scrape" (poll) this endpoint every few seconds (e.g., every 15s).

  3. Storage: Prometheus stores this time-series data efficiently.

  4. Alerting: You define rules (e.g., "If http_500_errors > 5% for 5 minutes"). If triggered, Prometheus sends an alert to PagerDuty, Slack, or Email.

cAdvisor (Container Advisor)

For container-level metrics (CPU usage, Memory usage, Network I/O), you don't need to instrument your code. cAdvisor is a tool (often built into the Kubelet) that auto-discovers all containers on a node and exposes their resource usage metrics to Prometheus.

Visualization: Grafana

Prometheus collects data, but Grafana visualizes it. You connect Grafana to Prometheus and build dashboards.

  • Cluster Level: "How many nodes are up? What is the total CPU load?"

  • Pod Level: "Is this specific pod restarting frequently?"

  • App Level: "What is the 95th percentile latency of my API?"

4. Health Checks (Liveness vs. Readiness)

Orchestrators need to know the status of your application to make decisions. You provide this via Probes.

Liveness Probe ("Am I alive?")

  • Question: Is the application crashed or deadlocked?

  • Action: If this fails, Kubernetes restarts the container.

  • Example: A simple HTTP GET to /healthz.

Readiness Probe ("Am I ready to work?")

  • Question: Is the application ready to accept traffic? (e.g., Is the database connection established? Is the cache warmed up?)

  • Action: If this fails, Kubernetes stops sending traffic to this pod (removes it from the Service load balancer), but does not restart it.

  • Benefit: Prevents 502 errors during startup.

Example in Kubernetes YAML:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5