Reusing Idle Distribution Groups

This idea is not yet finalized and is subject to change. Background

Currently, most distribution groups (DGs) are not fully utilized. While the default DG is used by most customers, other DGs remain inactive. As a result, some campaigns experience delays in their delivery due to the first-in-first-out (FIFO) method used by gcp pubsub.

Requirements

It is essential to reserve some dedicated distribution groups for Runtastic and Betclic(?).
We should have the ability to easily allocate a dedicated distribution group for a particular customer, which can be utilized for performance testing purposes.
Currently we can allocate pubsub topics and subscriptions in our projects easily with Terraform, but we are also dependent on other teams for pubsub topics and subscriptions in their projects (e.g. personalization).
We are also able to add new deployments to the new distribution group if needed, even tho it’s a manual process.
Monitoring capabilities should be in place to keep track of the performance and status of the distribution groups.
This we lost due to rever of the customer id we report in the metrics, it was a cost saving measure.
We need dedicated time to implement me-metrics-collector.

Solution

To address this issue, we need to distribute campaigns across various DGs while keeping some DGs dedicated to Runtastic and Betclic. We also need an easy way to dedicate a DG to a specific customer and a monitoring system. Solutions

To distribute campaigns across various DGs, we can use different approaches such as round-robin distribution and dynamic distribution.

With round-robin distribution, campaigns are distributed evenly between queues in a cyclic manner.
With dynamic distribution, we can assign a score to a DG and adjust the distribution based on the score. For calculating the score, we can use different approaches such as queue size or weighted queue size.

Score Calculation

The distribution group consists of four queues, but the two most influential queues are the personalization and device mapper queues. When assigning a score to the distribution group, the absolute size of each queue can be considered. However, it is worth noting that the device mapper queue is slower than the personalization queue.

To obtain an overall score for the distribution group, we can assign a weight to each queue based on processing latency. For instance, the personalizer queue could be assigned a weight of 0.2 while the device mapper queue could be given a weight of 0.8.

Based on the scores obtained, we can identify the distribution group with the lower score as the one to use.

Implementation

Initially, the focus of the implementation will be on selecting the distribution groups at the start of the pipeline. However, in the future, we can optimize the selection process further by choosing the most empty distribution group for each step of the pipeline.

Step 1 - Collect backlog size

First, we need to determine the level of latency when using the Google Cloud Monitoring API directly. If the latency is too high, we can instead use a counter based on REDIS, which can be incremented and decremented using the INCR and DECR commands.

Step 2 - Use REDIS to store the backlog size

The REDIS key is created using the format: <DGID>-<WORKER>.

For instance, if the personalizer for dg01 receives a message, the key dg01-personalizer is incremented using INCR. Once the message is acknowledged, a DECR operation is performed on the dg01-personalizer key.

To ensure accuracy, we can initialize the counter using data from the Cloud Monitoring API, and store it in REDIS if desired.

Step 3 - Dynamic Distribution using the score

Once a request is received by the chunker/aggregator, it computes the score for each distribution group based on the backlog size stored in REDIS, and selects the distribution group with the lowest score.

Next Steps

A similar approach could be used for selecting the delivery subscription using the device mapper. Additionally, we observed that inbox subscriptions are very heavily underutilized (except default group), and we could consider using them for performance testing purposes using the Google Cloud Monitoring API approach described above.

We can create a small service that is checking the metrics of the subscriptions based on the configuration provided. It then calculates the best topic to publish to, based on the number of the undelivered messages in the subscription and other things we consider as valid. After which this recommendation is set in redis with the short TTL (TTL should be similar to the frequency at which we check Cloud Monitoring). This can be set in some very simple format, for example: inbox:topic-id. Then the device mapper would check redis for the topic to use, and if it is not set, it would use the default topic. Additionally, we can also insetad of using the default topic use the round robin approach to select the topic to use.

Just as an experiment we could then see if Cloud Monitoring is fast enough and usable for this purpose, or if we need to use REDIS for the backlog size. Additionally it would help us to get rid of the frequent alerts in the inbox service with the oldest message in the subscription. Another benefit is that we could already implement small fallback mechanism such as round robin topic selection.

Examples

Fetch the metrics using the Google Monitoring API

const { MetricServiceClient } = require('@google-cloud/monitoring');
const client = new MetricServiceClient();
const projectId = 'your-project-id';
const subscriptionName = `projects/${projectId}/subscriptions/your-subscription-id`;

const query = `metric.type="pubsub.googleapis.com/subscription/num_undelivered_messages"`;

const startTime = new Date();
startTime.setDate(startTime.getDate() - 1);

const request = {
  name: subscriptionName,
  filter: query,
  interval: {
    startTime: {
      seconds: startTime.getTime() / 1000,
    },
    endTime: {
      seconds: Date.now() / 1000,
    },
  },
  view: 'FULL',
};

client.listTimeSeries(request)
  .then(results =>