MAU Anomaly Detection

Model training

Model training process is a scheduled job which runs on the 1st of every month in GAP (me-push-service repo). It must run at least some time (preferably an hour) before the Anomaly Detection process. It uses BigQuery machine learning to train a model (ARIMA PLUS) on MAU data for the last 2 years. Every month we recreate (rewrite) the the model, we do not keep old models around. This model is then used throughout the month in the Anomaly Detection process. We will have an alert if model training fails. We can reschedule it in GAP (eg. using K9S).

-- Last day of the previous month. Format YYYY-MM-DD
DECLARE prev_month_end DEFAULT DATE_SUB(DATE_TRUNC(CURRENT_DATE(), MONTH), INTERVAL 1 DAY);

CREATE OR REPLACE MODEL `ems-mobile-engage.mobile_subscription.mau_prediction_model`
  OPTIONS(
    model_type='ARIMA_PLUS',
    time_series_timestamp_col='day',
    time_series_data_col='mau',
    time_series_id_col='customer_id',
    adjust_step_changes=false,
    clean_spikes_and_dips=false,
    data_frequency='DAILY',
    horizon=100 /* predict 100 days into the future */
  ) AS

SELECT
  DATE(aggregated_at) AS day,
  customer_id,
  SUM(value) AS mau
FROM
  `ems-mobile-engage.mobile_subscription.kpi_summary`
WHERE
  kpi_name = 'monthly_active_users'
  AND DATE(aggregated_at) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR) AND CURRENT_DATE()
  AND customer_id IN (
    /* ignore small customers */
    SELECT
      customer_id
    FROM (
      SELECT
        DATE(aggregated_at) AS day,
        customer_id,
        SUM(value) AS mau
      FROM
        `ems-mobile-engage.mobile_subscription.kpi_summary`
      WHERE
        kpi_name = 'monthly_active_users'
        AND DATE(aggregated_at) = prev_month_end
      GROUP BY
        1, 2
    )
    WHERE mau > 10000 /* can be changed in env var */
  )
GROUP BY
  1, 2

Anomaly Detection

Anomaly detection process is a scheduled job which runs weekly (on Tuesdays) in GAP. It uses Machine Learning model to detect anomalies in MAU data in BigQuery on the day of execution. It triggers a PagerDuty alert if any of the following are true: * Percentage of customers whose MAU value is too low is greater than 30% (Default; Can be changed in env var). * Percentage of customers whose MAU value is too high is greater than 50% (Default; Can be changed in env var). Here we only check customers whose MAU value on the day when the model was trained was greater than 10000 (Default; Can be changed in env var).

WITH
customers_in_model AS (
  SELECT
    DISTINCT customer_id
  FROM
    ML.EVALUATE(
      MODEL `ems-mobile-engage.mobile_subscription.mau_prediction_model`
    )
),
current_maus AS (
  SELECT
    DATE(aggregated_at) AS day,
    customer_id,
    SUM(value) AS mau
  FROM
    `ems-mobile-engage.mobile_subscription.kpi_summary`
  WHERE
    kpi_name = 'monthly_active_users'
    AND DATE(aggregated_at) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  GROUP BY
    1, 2
),
anomalies AS (
  SELECT
    *
  FROM
    ML.DETECT_ANOMALIES(
      MODEL `ems-mobile-engage.mobile_subscription.mau_prediction_model`,
      STRUCT(0.99 AS anomaly_prob_threshold),
      (SELECT * FROM current_maus)
    )
),
stats AS (
  SELECT
    COUNTIF(mau < lower_bound) AS mau_low,
    COUNTIF(mau > higher_bound) AS mau_high,
    COUNT(1) AS total
  FROM
    anomalies
)
SELECT
  *,
  (mau_low / total * 100) AS perc_mau_low,
  (mau_high / total * 100) AS perc_mau_high
FROM
  stats

Alert Handling

TODO: Have a meeting with team leads to discuss what to do when on-call person gets this alert.