Monitoring

The message generator has separate part for collecting and monitoring data which is related to completion and processing of campaigns. A monitoring exists for transactional as well as for batch related (ad-hoc and AC) campaigns.

Batch Campaign Monitoring

The following figure gives an overview of the added parts (in blue color) that are used to monitor the campaigns.

Batch Campaign Monitoring

For the most part, it was sufficient to expand the topics of the existing pub / sub infrastructure with additional subscriptions and the database of the message generator with additional tables. The only exception is the scheduler, as it - like AC - uses the generator’s API. A new topic and a subscription had to be added for the scheduler. The added subscriptions begin in the name with "monitoring" instead of "sending". As soon as one of the parts of the sending chain transmits a message to the next part, this message is now duplicated by pub / sub and used by the so-called collector to create "raw" log entries. The trigger for the aggregation of a campaign/batch run is sent by the slicer part of the chunker, when it is done with creating all the slices.

batch-metrics-collector

The batch-metrics-collector is a process that consumes multiple subscriptions at the same time. It simply selects the subscription, except of the one assigned to the scheduler, via the prefix in the subscription name. So if new subscriptions are necessary, there is no need to expand the configuration or adapt code.

The collector consumes the messages that are exchanged between the individual parts of the sending chain and uses them to create so-called raw log entries that are published to the sending-monitoring-batch-run-log Pub/Sub topic. Each message in this topic is then inserted into the batch_run_log GBQ table in the dataset sending_monitoring. It provides information about its source, the event time, the campaign (ID, run ID,…​) and possibly the ID of the associated contact.

batch-run-monitoring

The batch-run-monitoring is triggered by the slicer part of the chunker when it is done creating all slices by publising a proper message to the proper Pub/Sub topic. The batch-run-monitoring consumes the message from the subscription (monitoring-batch-metrics-aggregator-sub) which is connected to that topic. Based on the contained information, the batch-run-monitoring is executing queries on the raw batch_run_log in the GBQ as well as on the GBQ tables which contains information about sent/not sent messages and checks if batch is finished, expired, raises alert if needed, and publishes to batch_run_meta for finished campaigns.

If an error occurs, the batch-run-monitoring is simply doing a NACK on the message as all issues which can occur are transient ones. The messages are redelivered with an exponential backoff delay between 10s and 600s.

When the monitoring detects, that the batch run is finished (the unique number of contact IDs looking at the sent/not sent in data platform is equal to the number of contacts in the contact list) it marks the batch run in the batch_run_meta table as "finished", by publishing the finished_at date to the respective Pub/Sub topic. When it detects that the campaign expired (the contact ID count is different, and we also did not get new event within a configured time) it sets the expired_at column value to the current date to mark it as "expired".

Tasks which are related to the target === notificationinbox will be ignored. The aggregator will just mark the related batch run as "finished" it data exists in the batch_run_meta table.

When the aggregator detects that a batch run is finished it continues to summarize till the time of the most actual event in data platform is older than a configured backoff. So we ensure that delayed data is also taken into account (although this just has an influence on the endedAt).

Transactional Campaign Monitoring

tx-metrics-collector-worker

The collector is responsible for consuming sent and not sent events from subscriptions me-metric-sent and me-metric-not-sent, respectively, and writing to the partitioned table sending-times.

TX Campaign Monitor Overview