Error Handling

Requirements

Install the the me-cli tool and configure it.

Subscriptions

Subscription Length

`sending-mg-errors-sub`

In case of errors in the msg-generator, you can dump the errors. Furthermore, in the attributes you can check the errName. In case of RedisError, you needto remove the redis key contained in the redisKey attributes from the redis instance. Either me-msg-generator or me-delivery.

$  me-cli pubsub subscribe dump -a -s sending-mg-errors-sub

The previous command is dumping all the messages to a file.

If all the errors are transient, or if you have some doubts, re-publish the messages to the right topic

$  me-cli pubsub publish -a -f <FILE> (FILE = path to the JSON file (otherwise you are asked for it))

The above command exploits the originTopic attribute for publishing messages to the right topic.

sending-tx-devicemapper-* & sending-batch-devicemapper-*

In case of an error in a devicemapper topic, where the threshold is exceeded on unacked message age, check if the messages are stuck in the queue or just being slowly processed due to heavy workload. Your error message should be something like:
- oldest unacked message age for ems-mobile-engage sending-tx-devicemapper-default-sub is above the threshold of 600.000 with a value of 1526.000
After making sure the messages are stuck, do the following steps to get the queue unstuck:
- Disable the aggregator workers for whichever generator is down (eg. me-msg-generator-aggregator-default (scale down to 0))
- Wait until the messages are consumed - Message count in the device mapper subscription is low
- Check how many workers/deployments are available and remember it
- Disable the worker for the me-msg-generator-tx-devicemapper-* (whichever is used)deployment (scale down to 0)
- Dump the messages with me-cli command: me-cli pubsub subscribe dump -a -s <subscription name> eg. sending-tx-devicemapper-default-sub
- Restart the workers for sending and aggregating which you previously disabled (put the same number of deployments as before)
If you want to dive in deeper on what exactly happened, some further investigation is needed. Things you can do to help:
- Look into google cloud pubsub metrics
- You can republish messages that were dumped, slowly, 1 by 1 or, batch of 10 to see if any of them get stuck (you can freely use tx-devicemapper-dg03 as this is worker for QA and won’t affect our other customers)
- Look for clues, things in common between messages(contactIds, campaignId, etc)