Why Webhooks Silently Fail (And How to Stop It)
Most webhook infrastructure has no retry, no alerting, and no recovery path. Here's what goes wrong and how to build a system that never drops an event.
Webhooks are the backbone of modern SaaS integration. When Stripe confirms a payment, when GitHub pushes a commit, when Clerk creates a user — all of it arrives at your server via a POST request. But most teams discover the hard way that webhooks fail far more often than expected, and most of those failures go unnoticed.
The silent failure problem
When a webhook delivery fails, the sending provider usually retries a handful of times over a few hours, then gives up. If your server was down for a deployment, rate-limited, or returned a 5xx due to a bug, those events are gone. No alert fires. No ticket is created. The failure is invisible until a customer reports a problem — or until you notice stale data in your database days later.
The most dangerous category is what we call "partial delivery windows": your server is up but a downstream dependency (a database write, a third-party API call) fails mid-handler. The provider sees a 200 and considers the event delivered. You never retried it.
Common failure modes
1. Destination downtime Deployments, database maintenance, and infrastructure restarts all create windows where your endpoint returns 5xx or times out. Providers retry aggressively at first but typically give up within 24–72 hours.
2. Handler exceptions after a 200 Your framework returns 200 OK before the handler finishes processing. An exception thrown after the response is sent is invisible to the provider — they consider the event successfully delivered.
3. Signature verification failures A misconfigured secret, a clock skew on Stripe timestamps, or a middleware that reads the request body before your webhook handler all cause 400s that look like "bad events" but are actually infrastructure bugs.
4. Rate limiting your own endpoints When providers retry failed events, they often bunch retries together. If your endpoint has a rate limiter protecting it, the burst of retries causes further failures — a failure cascade.
5. Payload size surprises
Stripe's checkout.session.completed event can contain several kilobytes of nested data. If your endpoint has a body size limit that's too small, the event is silently truncated or rejected.
What "webhook reliability" actually means
A reliable webhook system has four properties:
Durable capture: every inbound event is written to persistent storage before any processing attempt begins. Acknowledgement (202) is sent immediately, independent of whether delivery to your downstream system succeeds.
Retries with backoff: failed deliveries are retried automatically on a schedule that doesn't overwhelm the destination. Exponential backoff with jitter prevents thundering herd.
Dead-letter queue (DLQ): events that exhaust all retry attempts are held — not discarded — in a DLQ. You can inspect the failure reason and replay the event once the root cause is fixed.
Alerting and observability: when events enter the DLQ, you hear about it. Slack notifications, email alerts, and an operator dashboard let teams respond before customers notice.
Building vs buying
You can build this yourself. You'll need a queue (BullMQ, SQS, or similar), a database to track event state, retry logic with backoff, DLQ storage, and alerting. Most teams that build this end up with a bespoke system that's poorly monitored and hard to operate.
Alternatively, a webhook reliability platform handles all of this. Your providers send to a stable ingest URL. The platform captures, verifies, retries, and alerts. Your application only receives events that have been successfully forwarded — from a single, observable pipeline.
Charon Gate is built exactly for this: stable ingest URLs, durable capture, retries with jitter, DLQ with reasons, and Slack/email alerts — so that when a webhook fails, you know before your customers do.