Operations & Error Handling

Here you find all the necessary information and references in connection with operations for the entire sending chain (MG and delivery), and in particular for the message generator.

Prerequisites

Some of the following descriptions assume that you have installed the Google Cloud SDK. A description how to install and basically use the Google Cloud SDK can be found here.

Do not forget to update your Google Cloud SDK installation by regularly executing the command

gcloud components update

Pub/Sub

This chapter describes different methods which are helpful if you need to handle issues related to Google Pub/Sub.

Select the Project

The project can simply be selected/changed using the following command:

gcloud config set project <ems-mobile-engage|ems-mobile-engage-staging>

Most of the Cloud SDK commands support the --project flag which can be used to specify the project that has to be used for the specific command.

List Topics or Subscriptions

It is possible to list the data of topics and subscriptions in different formats (the format and filter options are optional):

gcloud pubsub <topics/subscriptions> list --format="<format>" --filter="<filter>"

For example, to list all subscriptions in JSON format, which contain the string batch in the name, you just need to execute the following:

gcloud pubsub subscriptions list --format="json" --filter="name:batch"

If you want to have the result in a table, then just replace --format="json" by --format="table".

If you just want to have a part of the information in the result, then you can specify this in the format option. E.g.: --format="json(name)"

Dumping the Content of a Subscription

Sometimes it might be helpful to dump all, or some of the messages contained in a subscription. Having in mind that the messages are maybe later republished, this should be done in JSON format and contain at least the data and attributes.

When pulling the content of messages in a subscription in JSON format, then the data is provided base64 encoded. Only in table format, the output shows the decoded string.

The basic command to dump messages of a subscription looks like this:

gcloud pubsub subscriptions pull <subscription-name> --format="json(DATA,message.attributes)" --limit=<number-of-messages> --filter="<filter>" --auto-ack

Be careful when you use the --auto-ack option. It causes that all messages you pulled will not longer be available after the execution of the operation!

Example: The following example pulls 100 messages from the subscription named ak0001-sub which have a publish time greater or equal than 2021-05-30T13:26:00 and acknowledges the pulled messages.

gcloud pubsub subscriptions pull ak0001-sub --format="json(DATA,message.attributes)" --limit=100 --filter="message.publishTime>=2021-05-30T13:26:00" --auto-ack

To dump the messages to a file, just add a > <filename> to your command. E.g.:

gcloud pubsub subscriptions pull ak0001-sub --format="json(DATA,message.attributes)" --limit=100 --filter="message.publishTime>=2021-05-30T13:26:00" --auto-ack > my-dump.json

Create Topics & Subscriptions

The Cloud SDK already offers the possibility to create topics and related subscriptions:

There is even a command for updating the settings of an existing subscription.

Republishing Dumped Messages

It was already described how messages can be dumped. For re-publishing the dumped messages to a specific topic, we have a CLI tool, which can be used to publish the dumped messages to a specific topic. For further information refer to the README file.

Shoveling of Messages

For the "shoveling" of messages, which is in case of Pub/Sub publishing messages contained in a subscription to another topic, there exists already a template in the DataFlow Jobs.
So to setup a "shovel" just do the following:

Open the DataFlow Jobs page and select the desired project from the dropdown list (e.g. Mobile Engage or Mobile Engage Staging).
Click on "CREATE JOB FROM TEMPLATE"
Give the "shovel" a (job) name
Select one of the Europe endpoints (preferred Frankfurt) as the "Regional Endpoint", e.g. europe-west3 (Frankfurt)
Select "Pub/Sub to Pub/Sub" from the "Dataflow template" list
Set the full names of the subscription, topic and the name of path to the bucket folder which can be used for temporary storage of data.

If no such bucket exists, then create one
1. and locate it in the same region as the endpoint (so europe-west3) and
2. set a retention for the data of 5 days.
Setup optional parameters by clicking on "SHOW OPTIONAL PARAMETERS" if this is desired: Here you could set up a filter based on attributes to select which messages shall be shoveled.
Click on "Run job" and you are done.

Retriggering a Batch Campaign

Generally spoken, everything what is needed to retrigger a batch campaign is to dump the messages contained in the related subscription and then reduce the messages to the ones which shall be republished and then republish them.

Every topic named sending-batch-<dg> (where <dg> is the distribution group name) is connected to 2 subscriptions. The one called sending-batch-<dg>-sub is the one which is consumed by the Chunker. The one called sending-batch-<dg>-sub-backup contains a copy of each single request, the Chunker receives (this includes also the redelivered messages).

In more detail this means you need to do the following steps:

Dump the messages from the backup subscription which are interesting for you:

gcloud pubsub subscriptions pull sending-batch-<dg>-sub-backup --format="json(DATA,message.attributes)" --project="<ems-mobile-engage|ems-mobile-engage-staging>" --filter="NOT message.attributes.chunker-state:*" > batch-requests.json

The filter ensures that just the initial messages are contained in the result. You can also consider to extend the filter with a date/time regarding the time frame what is interesting for you.

Optionally edit the JSON file and remove entries you do not want to be republished
Trigger the republishing as described in the Republishing Chapter.

The backup subscriptions keep the messages just for 24 hours. But this should be anyway far enough, as triggering a batch after a few hours is most of the time too late.

Batch Throttled Campaign Runtime Exceeded Without Consuming Audience

A batch throttled campaign can be setup for push or web push campaigns. Data about BullMQ jobs that are created is stored in the message generator PostgreSQL db, more specifically in the table campaign_audience_inventory.

For each throttled campaign audience is stored in the corresponding table for each run campaign_audience_<customerID>_<campaignID>_<campaign_audience_inventoryID>. And upon consuming chunk of the audience, that chunk is stored in the campaign_utilized_audience_<customerID>_<campaignID>_<campaign_audience_inventoryID> table.

Every day cleanup job is running to remove these tables that are already fully consumed. However, if a campaign exceeds the runtime deadline which is controlled by ENV variable SLICER_THROTTLED_MAX_CAMPAIGN_RUNTIME and audience that is supposed to be utilized does not match audience_size in the inventory table an alert will be raised.

To start investigating what is the root cause of this you can look into the inventory table.

SELECT * FROM campaign_audience_inventory WHERE id = <ID>;

Check audience_size and count of utilized audience from utilized_audience_table column there should be difference.

SELECT COUNT(1) from campaign_utilized_audience_<customerID>_<campaignID>_<campaign_audience_inventoryID>;

Look at the time created_at for this inventory that is when the first job started. What is in the logs from this point in time? Was message generator Redis up, was it restarted? Is there anything in Pub/Sub error subscriptions regarding this campaign?

Generally you should check what is expected for a normal batch run apart from checking if there was any Redis related issue.

Are there any issues causing a slow-down?

First check in the Slack channel #war-room if there is an issue reported
Generally we should check if we have issues with one of the services we use in our sending chain. Therefore, check the depth of our Pub/Sub subscriptions. If you see a lot of messages and possibly related alerts, then you should check the logs of the message generator and logs of delivery in LaaS. If a transient error occurs we log the reason and so you can find out what is currently blocking us.

Report to TCS (see here) about your findings and get in contact with the team which is responsible for the service and mention it also in the Slack #war-room.

Is there a backlog in any of the queues the sending chain is using?

To find the distribution group for a customer:

In the metadata query result you see the distribution group.
Check the subscriptions which are related to this distribution group.

Marking inventory as done

If there is nothing more to do with this particular run you can mark inventory as done by executing. Make sure you notify team responsible for the service before doing it so that they are aware of the manual cleanup. You can do so for example in #team-mobile-dev channel where we usually notify.

UPDATE campaign_audience_inventory SET is_done = true WHERE id = <ID>;