Container orchestration

Monitor Kubernetes

External monitoring for Kubernetes ingresses, services, and workloads

HTTPTCPSSLHeartbeatInterval: 1 minute

Why monitor Kubernetes

Kubernetes gives you liveness and readiness probes, which tell you whether a pod thinks it's healthy from the inside. That's necessary but not sufficient. An internal probe cannot see a misconfigured ingress, an expired TLS certificate, a DNS record pointing at a dead load balancer, or a CDN caching a 500. Every Kubernetes production outage post-mortem eventually lands on the same lesson: you need a check from outside the cluster, hitting the real user-facing URL, at the frequency your SLA requires. For services without public endpoints (workers, schedulers, batch jobs) the pattern is heartbeats. A CronJob or workload posts to a heartbeat URL on a schedule; if Hyperping doesn't receive the expected ping within the window, it alerts. This catches stuck workers that liveness probes can't: a worker process that's running but not processing the queue passes its internal health check but fails the heartbeat.

Setting up Kubernetes monitoring

  1. 1
    Expose a /healthz path on the ingress that returns 200 only when the pod can reach its own dependencies (DB, queue, downstream services).
  2. 2
    Monitor https://api.example.com/healthz with an HTTP monitor at 1-minute intervals from at least 3 regions.
  3. 3
    Add an SSL monitor on the same hostname. Cert-manager occasionally fails to renew: DNS-01 propagation issues, HTTP-01 challenge reachability, or an expired ACME account email are common culprits.
  4. 4
    For internal services without public ingresses, use Hyperping heartbeat monitors: have a CronJob POST to the heartbeat URL every 5 minutes.
  5. 5
    Alert on failure from 2 of 3 regions to avoid paging on single-region network blips between Hyperping and your cluster.

Example health endpoint

deployment.yaml
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
    - hosts: [api.example.com]
      secretName: api-tls
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /healthz
            pathType: Prefix
            backend:
              service:
                name: api
                port:
                  number: 80

What typically goes wrong with Kubernetes

Pod OOMKilled under real traffic
Memory limits that passed load tests fail under production traffic patterns. Pods restart, readiness probes fail briefly, external requests 502. Internal dashboards recover fast; external monitoring shows the user-visible outage duration.
CrashLoopBackOff cascading from a config change
A bad ConfigMap or Secret rollout puts every replica in backoff. Liveness probes mark pods unhealthy, the deployment never converges, and the ingress returns 503 until the rollback. External uptime catches this in seconds; kubectl rollout status takes 5+ minutes.
cert-manager failing to renew TLS
Renewal fails silently when a DNS-01 solver breaks, an HTTP-01 challenge becomes unreachable, the ACME account email expires, or a CRD changes. Let's Encrypt certificates currently expire every 90 days (and that window is shrinking toward 45 days through 2028). SSL monitoring is the only reliable catch.
Node pressure causing pod eviction
Disk or memory pressure on a node triggers eviction. The scheduler takes time to reschedule; traffic to that replica's share fails during the gap.
Service mesh sidecar not ready
Istio/Linkerd proxies start after the app container. If the readiness probe doesn't account for the sidecar, traffic gets sent to a replica that can't actually route it.

Frequently asked questions

Do I still need external monitoring if I have Prometheus?
Yes. Prometheus scrapes metrics from inside the cluster. If the ingress controller or the cluster network is broken, Prometheus may not even fire an alert. External monitoring proves the path from your users to your service is working.
How do I monitor a Kubernetes CronJob?
Use a heartbeat monitor. Add a curl to a Hyperping heartbeat URL at the end of the job script. If the job fails or never runs, the heartbeat is missed and you get paged.
Can Hyperping reach services inside a private cluster?
Only via a public ingress or a VPN tunnel. For fully private workloads, use heartbeat monitors so the workload pushes to Hyperping instead of Hyperping pulling from it.
Should I monitor every ingress or just the main API?
Every ingress with distinct SLA. A status page down is user-trust damage; an internal admin panel down is annoying but not urgent. Separate monitors let you route alerts accordingly.
Get started

Start monitoring in the next 5 minutes.

Stop letting customers discover your outages first. Set up monitoring, status pages, on-call, and alerts before your next coffee break.

14 days free trial. No card required.