Error Handling
Requirements
Install the the me-cli tool and configure it.
Subscriptions
Subscription Length
sending-mg-errors-sub
In case of errors in the msg-generator, you can dump the errors. Furthermore, in the attributes you can check the errName. In case of RedisError, you needto remove the redis key contained in the redisKey attributes from the redis instance. Either me-msg-generator or me-delivery.
$ me-cli pubsub subscribe dump -a -s sending-mg-errors-sub
The previous command is dumping all the messages to a file.
If all the errors are transient, or if you have some doubts, re-publish the messages to the right topic
$ me-cli pubsub publish -a -f <FILE> (FILE = path to the JSON file (otherwise you are asked for it))
The above command exploits the originTopic attribute for publishing messages to the right topic.
sending-tx-devicemapper-* & sending-batch-devicemapper-*
-
In case of an error in a devicemapper topic, where the threshold is exceeded on unacked message age, check if the messages are stuck in the queue or just being slowly processed due to heavy workload. Your error message should be something like:
-
oldest unacked message age for ems-mobile-engage sending-tx-devicemapper-default-sub is above the threshold of 600.000 with a value of 1526.000
-
-
After making sure the messages are stuck, do the following steps to get the queue unstuck:
-
Disable the aggregator workers for whichever generator is down (eg. me-msg-generator-aggregator-default (scale down to 0))
-
Wait until the messages are consumed - Message count in the device mapper subscription is low
-
Check how many workers/deployments are available and remember it
-
Disable the worker for the
me-msg-generator-tx-devicemapper-*(whichever is used)deployment (scale down to 0) -
Dump the messages with me-cli command:
me-cli pubsub subscribe dump -a -s <subscription name>eg.sending-tx-devicemapper-default-sub -
Restart the workers for sending and aggregating which you previously disabled (put the same number of deployments as before)
-
-
If you want to dive in deeper on what exactly happened, some further investigation is needed. Things you can do to help:
-
Look into google cloud pubsub metrics
-
You can republish messages that were dumped, slowly, 1 by 1 or, batch of 10 to see if any of them get stuck (you can freely use
tx-devicemapper-dg03as this is worker for QA and won’t affect our other customers) -
Look for clues, things in common between messages(contactIds, campaignId, etc)
-