← All posts

Dead-Letter Queues for Webhooks: What They Are and Why You Need One

A dead-letter queue captures webhook events that have exhausted all retries. Here's how the pattern works, what to look for in a DLQ system, and how to recover from failures.

If you've worked with message queues, you've encountered the dead-letter queue (DLQ) pattern. A DLQ holds messages that couldn't be processed successfully — giving operators time to investigate and reprocess them safely. The same pattern applies to webhooks, but most teams don't realise they need it until after their first silent data loss incident.

What is a webhook DLQ?

When a webhook provider delivers an event to your endpoint and the delivery fails (timeout, 5xx, connection refused), the provider typically retries on a schedule. After a fixed number of retries — usually 3–10 over a period of hours or days — the provider gives up.

Without a DLQ, that event is permanently lost. With a DLQ, events that exhaust all retry attempts are captured in a holding area rather than discarded. You can inspect them, understand why they failed, and replay them once the underlying issue is resolved.

Why retries alone aren't enough

Retries handle transient failures — a brief deployment window, a temporary database blip. But some failures are persistent:

  • Your endpoint has a bug that causes it to crash on a specific event shape
  • Your database migration broke a foreign key constraint your handler relies on
  • A third-party API your handler calls has been down for six hours

In these cases, the event keeps failing on every retry. Without a DLQ, the event eventually falls off the retry schedule and is lost. With a DLQ, it's held safely while you fix the root cause — then replayed.

What a good webhook DLQ looks like

1. Failure reason taxonomy

Not all failures are equal. A connection_refused is different from an http_4xx_client_error, which is different from a signature_verification_failed. A good DLQ tags each event with a structured failure reason so you can triage at a glance — and write automation rules on specific failure classes.

2. Replay with original payload

Replaying an event should re-run the original payload through the full pipeline: re-verify the signature, re-forward to the destination, and track the result as a new delivery attempt. It should not modify the event, bypass verification, or silently swallow replay failures.

3. Alerting on DLQ entry

A DLQ that you check manually isn't reliable. Your DLQ should emit alerts — Slack notifications, email digests, or webhook calls to PagerDuty — when events start failing. Configurable thresholds (alert after 1 failure vs. alert after 10) prevent noise while catching real incidents.

4. Batch replay

When a destination recovers after a 4-hour outage, you may have hundreds of events in the DLQ. A batch replay capability — filtered by time window, endpoint, or failure reason — lets you recover without re-triggering each event individually.

5. Idempotency support

Replay means sending an event to your destination that it may have already received (if the original delivery partially succeeded before failing). Your DLQ should propagate idempotency keys and provider event IDs so that your destination handler can safely deduplicate.

Common DLQ failure reasons and what they mean

Reason What happened What to check
connection_refused Destination was unreachable Is the destination up? Are firewall rules correct?
http_5xx_server_error Destination returned 5xx Check destination logs for exceptions
http_4xx_client_error Destination returned 4xx Check payload shape, auth headers
timeout Delivery timed out Handler taking too long; check for blocking I/O
signature_verification_failed Signature check rejected the event Verify signing secret is correct on both sides
tls_error TLS handshake failed Check certificate validity, cipher compatibility

Getting started with Charon Gate's DLQ

Charon Gate captures every inbound webhook, retries failed deliveries with exponential backoff and jitter, and routes exhausted events to a DLQ automatically. The DLQ view in the dashboard shows:

  • Failure reason, attempt count, and last error body for each event
  • Grouped by endpoint and failure class for faster triage
  • One-click or bulk replay with delivery tracking
  • Slack or email alerts when events enter the DLQ

You don't build or operate the queue infrastructure — you get a stable ingest URL, connect your providers, and focus on your application logic.

Get started with Charon Gate

Durable webhook capture, automatic retries, DLQ, and replay — free tier, no credit card required.

Start free