Database

Monitor PostgreSQL

Monitor PostgreSQL availability, replication lag, and connection health

TCPHTTP (via health endpoint)HeartbeatInterval: 1 minute

Why monitor PostgreSQL

Databases have more silent failure modes than web apps. Nothing crashes, but writes start failing, or reads get slower, or replication drifts. Most of these never surface as "down" in any dashboard; they show up as user complaints about latency or lost data. The right model is not to monitor the database directly (exposing Postgres to a monitoring service is a security liability) but to monitor a dedicated /db-health endpoint from inside your app. That endpoint opens a real connection, runs a trivial query, and reports pass/fail. This monitors the actual path your traffic takes, including the connection pooler, SSL negotiation, auth, and DNS, which is always where the outage actually happens. Pair this with heartbeats on backup jobs and replica sync jobs. A silent backup failure discovered during a restore is how small companies die.

Setting up PostgreSQL monitoring

  1. 1
    Do not expose Postgres directly to the public internet, even for monitoring. Instead, expose a thin /db-health endpoint from your app that opens a real connection.
  2. 2
    The endpoint should run SELECT 1 with a 3-second timeout. Return 503 on any failure so monitors flag it correctly.
  3. 3
    Monitor the endpoint URL with an HTTP check at 1-minute intervals, with a 10-second request timeout.
  4. 4
    For managed Postgres (RDS, Supabase, Neon), add a second monitor on the provider status page. Provider-side incidents often precede your own errors.
  5. 5
    Add a heartbeat monitor on your backup job so a failed pg_dump pages you instead of being discovered during a restore.
  6. 6
    Alert on 2 consecutive failures to absorb brief connection blips during failover or autovacuum.

Example health endpoint

health/db.py
# Expose a /db-health endpoint that your app already serves.
# Monitor that URL, not the DB directly, so the check sees
# the real connection path (pooler, SSL, auth, DNS).

from fastapi import APIRouter, Response
import psycopg

router = APIRouter()

@router.get("/db-health")
async def db_health(response: Response):
    try:
        async with await psycopg.AsyncConnection.connect(
            DATABASE_URL, connect_timeout=3
        ) as conn:
            async with conn.cursor() as cur:
                await cur.execute("SELECT 1, pg_is_in_recovery()")
                row = await cur.fetchone()
        return {"status": "ok", "replica": row[1]}
    except Exception as e:
        response.status_code = 503
        return {"status": "down", "error": str(e)[:120]}

What typically goes wrong with PostgreSQL

Connection pool exhausted
A slow query pins connections; new requests queue until they time out. From inside the DB metrics look fine (the DB is busy, not down). From outside, every request 500s. This is the #1 symptom most teams ship Postgres monitoring to catch.
Idle-in-transaction bloat
An app that forgets to commit leaves transactions open. Vacuum cannot reclaim tuples, table bloat grows, queries slow down, p99 latency climbs over days. A latency monitor on the health endpoint catches the degradation weeks before users complain.
Replication lag exceeding RPO
Logical or streaming replication falls behind. If you fail over, you lose the lag window of data. Most managed providers expose lag in metrics; external heartbeats on both primary and replica catch drift the metrics miss.
Disk full
WAL growth during high write load, failed archive command, or a runaway query creating temp files fills the disk. The DB refuses writes, the app 500s. This is rarely graceful and almost always at 3am.
Failover partial success
Managed Postgres fails over; DNS updates, but the old connection pool in your app keeps hitting the old primary (now read-only). Writes fail with "cannot execute ... in a read-only transaction". External monitoring sees 503s immediately. Internal metrics may still look healthy for a minute.

Frequently asked questions

Can I monitor Postgres directly over TCP?
You can monitor that port 5432 accepts connections, but that does not verify auth, SSL, or query execution works. A /db-health HTTP endpoint is a stronger signal and does not require exposing Postgres publicly.
How do I monitor replication lag?
Expose a second endpoint (/replica-health) that runs SELECT pg_last_wal_replay_lsn() on the replica and compares against pg_current_wal_lsn() on the primary using pg_wal_lsn_diff(). Alert if lag exceeds your RPO in seconds.
Should I monitor RDS even though AWS does?
Yes. RDS monitors the database instance; it does not monitor the path from your app to the database (VPC, security groups, IAM auth for RDS, pooler). The failures you care about live on that path.
What interval is appropriate for a DB health check?
60 seconds. Shorter intervals create connection churn without catching issues any faster. Longer intervals let a connection pool stay exhausted for too long.

Related monitoring guides

Monitor Django
Monitor Django apps, Celery workers, and management commands
Get started

Start monitoring in the next 5 minutes.

Stop letting customers discover your outages first. Set up monitoring, status pages, on-call, and alerts before your next coffee break.

14 days free trial. No card required.