Sending Chain Pub/Sub Alerts
subscription sendingchain.pubsub-.*-sub.(length|age)
The subscription has too many items, likely the worker got stuck or too slow in processing it. This needs scaling or more in-depth look for the root cause depending on the scenario.
An age alert means that for at least the specified time some messages were not consumed, this can indicate that one of the workers are unable to process certain messages due to
syntax error type error or other bugs. Ideally such errors should cause the message to be error queued, so this should be rare. The logs (laas) should provide more information.
subscription sendingchain-mg-errors-sub.(length|age)
The message generator worker publishes a message to the error subscription if it cannot process it due to a fatal error (TypeError, Validation error, other bugs can cause this). Each message needs to be individually investigated. If you determine that error is transient, you can retry the meesage by using me-cli tool to automatically re-publish the message to the original topic.
not-sent-for-token-status has items older than an hour
This error indicates an issue with the token-status-updater in the me-delivery service. If you search in LaaS for "gap-me-delivery" entries and labels.k8s-pod/app = me-delivery-token-status-updater you very probable find an error like this: UnhandledPromiseRejectionWarning: Error: Failed to "modifyAckDeadline" for 125 message(s). Reason: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted
However a restart of the me-delivery-token-status-updater deployment resolves the problem.
You can check on subscription metrics if the restart resolves the problem by watching the graphs.
ME Scheduler error detected
err: {
"data": {
"data": {
"run_id": "20190729161000-9z7DhWpV-suite6web4-31861-s6",
"status": "error"
},
"replyCode": 0,
"replyText": "OK"
},
"message": "segment-run-failed",
"stack": "Error: segment-run-failed
at LaunchesQueueHandler._apiV2ContactListIdOrThrow (/app/server/lib/launches-queue-handler/index.js:130:21)
at process._tickCallback (internal/process/next_tick.js:68:7)",
"type": "Error"
}
Error is triggered from laas elastalert meaning the segment run failed. The run_id can be used to search the logs for the reason, in case there is no reason found and only a single campaign is impacted TCS should be notified of the failure on slack with the campaign details. The failed segment runs should be retried up to 5 times. In case more segments are failing an incident must be created (if there is none reported by the segmentation team already in the #war-room) and handled accordingly. Since the segmentation is now using the USS which is maintained by the segmentation team, such issues are most likely going to need the involvement of the segmentation team (#war-room, #team-segmentation, if no response on slack Pagerduty). We must support any investigation and contribute to the impact analysis. In any case it should be clear who is doing the investigation and an incident should not be abandoned assuming the segmentation team will pick it up.
To help the investigation, we can address the following questions:
-
all segments are affected or only specific ones?
-
what is the timeframe of the failures?
Failed scheduler jobs that are not retryable are marked as completed on failure due to missing support for failing jobs in bull v2 and removed after the completion so the only place to get the job data is from the logs.