Serverless

Monitor AWS Lambda

Monitor Lambda functions fronted by API Gateway, CloudFront, or Function URLs

HTTPSSLKeywordHeartbeatInterval: 30 seconds

Why monitor AWS Lambda

Serverless shifts the failure surface. There's no server to SSH into and check, but there are still cold starts, concurrency limits, IAM drift, account-level quotas, integration timeouts, and the gateway layer between your users and your code. CloudWatch alarms fire on metrics Lambda emits; they don't fire on metrics Lambda can't emit because Lambda never ran. An API Gateway 5xx that never invokes the function shows zero Lambda errors. External uptime monitoring hits the same URL users hit. It sees cold starts, throttling, gateway errors, and DNS problems uniformly. The 30-second interval has a second benefit: it keeps the function warm, so real users don't pay for the cold start you just measured.

Setting up AWS Lambda monitoring

  1. 1
    Deploy the handler above as a dedicated /health function. Keep it separate from business logic so the health check survives main-app deploys.
  2. 2
    Expose via Function URL or API Gateway. Function URLs are cheaper and skip the API Gateway 29-second timeout.
  3. 3
    Configure provisioned concurrency (1 to 2) if cold starts matter for your use case. Otherwise expect sub-second cold starts for a simple handler, longer if initialization is heavy (large deps, Java/.NET, many security groups).
  4. 4
    Add https://[url]/health in Hyperping at 30-second intervals. This also keeps the function warm for free.
  5. 5
    Set alert threshold to 2 consecutive failures. Single failures are often throttling blips, not outages.
  6. 6
    For background Lambda jobs (SQS, scheduled), use heartbeat monitors: the handler POSTs to a Hyperping heartbeat URL at the end of each successful run.

Example health endpoint

handler.mjs
export const handler = async () => {
  const start = Date.now();

  try {
    // verify downstream dependencies
    await Promise.all([checkDynamo(), checkSecretsManager()]);
  } catch (err) {
    return {
      statusCode: 503,
      body: JSON.stringify({ status: 'degraded', error: err.message }),
    };
  }

  return {
    statusCode: 200,
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify({
      status: 'ok',
      cold: !global.__warm,
      latency: Date.now() - start,
    }),
  };
};

global.__warm = true;

What typically goes wrong with AWS Lambda

Reserved concurrency exhausted
A traffic spike exceeds the function's reserved concurrency. Excess invocations throttle and return 429. Only external monitoring hitting real traffic patterns catches this in real time. CloudWatch alarms on throttling fire after the fact.
VPC cold-start outliers
Since AWS Hyperplane (2019-2020) VPC cold starts are usually sub-second. But heavy init (large deps, Java/.NET, many security groups) still spikes p99 into the multi-second range, which eats into API Gateway's default 29s timeout. Monitor p99 latency, not just 2xx/5xx.
IAM role changes breaking production silently
Removing an unused-looking permission (like secretsmanager:GetSecretValue) breaks the first invocation that exercises the removed permission. The function returns 500, CloudWatch logs it, but no one notices until a user complains.
API Gateway 5xx without Lambda 5xx
Integration timeouts, WAF rejections, and misconfigured authorizers return 5xx at the gateway without ever invoking Lambda. CloudWatch Lambda metrics show zero errors while your API is down.
Cost-scaling caps triggered
Account-level concurrency limits or budget-enforced throttles cap invocations during a traffic spike. Looks identical to an outage from the outside.

Frequently asked questions

Does polling my Lambda every 30s cost a lot?
2,880 invocations per day, about 86K per month. At a 100ms duration that is negligible (Lambda free tier alone covers it for most accounts). The invocation cost is usually lower than the SLA cost of an undetected outage.
Should I use CloudWatch Synthetics instead?
CloudWatch Synthetics works but is AWS-internal. If the issue is an AWS regional problem, the monitor may fail for the same reason the service fails, and you never get alerted. External monitoring gives you an independent failure domain.
How do I monitor a Lambda triggered by SQS or EventBridge?
Use a heartbeat monitor. The handler ends with a POST to a Hyperping heartbeat URL. If the function stops being invoked (queue drained unexpectedly, schedule paused), the heartbeat is missed.
What alert threshold should I use?
Two consecutive failures with a 60 to 90 second confirmation window. Single failures are frequently transient (throttling, warm-up, network). Two failures inside two minutes is real.
Get started

Start monitoring in the next 5 minutes.

Stop letting customers discover your outages first. Set up monitoring, status pages, on-call, and alerts before your next coffee break.

14 days free trial. No card required.