Error Handling

All pipeline components share a common error handling module located in src/lib/error-handling/.

Overview

When an error occurs during message processing, the error handling policy classifies it and takes one of three actions:

  1. Retry — delay the message via me-pubsub-delayer for later redelivery

  2. Escalate — publish to the sending-mg-errors topic for investigation

  3. Drop — log and acknowledge silently (for known non-actionable errors)

For a visual flow, see the error handling diagram.

Retry with Exponential Backoff

Transient errors trigger a retry via the Pub/Sub Delay Service. The message is republished with an incremented retry count and a delay from the following schedule:

Retry # Delay

1

1 second

2

1 second

3

2 seconds

4

3 seconds

5

7 seconds

6

30 seconds

Configuration: RETRY_MESSAGE_DELAYS_SECONDS (default: [1, 1, 2, 3, 7, 30])

Service-Retryable Errors (Infinite Retry)

For infrastructure-level errors (HTTP 423, 429, 500, 502, 503, 504), the retry count is reset to 0 after exhausting the delay schedule. This means these errors are retried indefinitely — they represent transient infrastructure problems that are expected to resolve.

Message Expiry

Messages are considered expired when:

  • The retry count exceeds the number of configured delays and the error is not service-retryable, OR

  • The time since the original eventTime exceeds 36 hours (configurable via EXPIRED_MESSAGE_AFTER_HOURS)

Expired messages are logged and dropped. If the expiry error is not a permanent or internal normal error, it is also published to sending-mg-errors.

Error Classification

Retriable Errors

Errors that trigger the retry policy:

  • HTTP status codes: 423 (Locked), 429 (Too Many Requests), 500 (Internal Server Error), 502 (Bad Gateway), 503 (Service Unavailable), 504 (Gateway Timeout)

  • TopicPublishError — Pub/Sub publish failures

  • ClientClosedError — gRPC client closed

Permanent Errors

Errors that are logged and dropped silently (no retry, no escalation):

  • HTTP status codes: 404 (Not Found), 410 (Gone)

Internal Normal Behaviour Errors

Known non-actionable errors that are dropped silently:

  • EmptyContactListError — contact list is empty

  • MessageExpiredError — message exceeded its TTL

  • CampaignNotFoundErrors — campaign no longer exists

  • CampaignAbortedErrors — campaign was aborted

Unknown Errors

Any error not classified above is published to the sending-mg-errors topic for manual investigation.

Error Topic

All escalated errors are published to: sending-mg-errors (configurable via ERROR_TOPIC)