MAU Anomaly Detection
Model training
Model training process is a scheduled job which runs on the 1st of every month in GAP (me-push-service repo). It must run at least some time (preferably an hour) before the Anomaly Detection process. It uses BigQuery machine learning to train a model (ARIMA PLUS) on MAU data for the last 2 years. Every month we recreate (rewrite) the the model, we do not keep old models around. This model is then used throughout the month in the Anomaly Detection process. We will have an alert if model training fails. We can reschedule it in GAP (eg. using K9S).
-- Last day of the previous month. Format YYYY-MM-DD
DECLARE prev_month_end DEFAULT DATE_SUB(DATE_TRUNC(CURRENT_DATE(), MONTH), INTERVAL 1 DAY);
CREATE OR REPLACE MODEL `ems-mobile-engage.mobile_subscription.mau_prediction_model`
OPTIONS(
model_type='ARIMA_PLUS',
time_series_timestamp_col='day',
time_series_data_col='mau',
time_series_id_col='customer_id',
adjust_step_changes=false,
clean_spikes_and_dips=false,
data_frequency='DAILY',
horizon=100 /* predict 100 days into the future */
) AS
SELECT
DATE(aggregated_at) AS day,
customer_id,
SUM(value) AS mau
FROM
`ems-mobile-engage.mobile_subscription.kpi_summary`
WHERE
kpi_name = 'monthly_active_users'
AND DATE(aggregated_at) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR) AND CURRENT_DATE()
AND customer_id IN (
/* ignore small customers */
SELECT
customer_id
FROM (
SELECT
DATE(aggregated_at) AS day,
customer_id,
SUM(value) AS mau
FROM
`ems-mobile-engage.mobile_subscription.kpi_summary`
WHERE
kpi_name = 'monthly_active_users'
AND DATE(aggregated_at) = prev_month_end
GROUP BY
1, 2
)
WHERE mau > 10000 /* can be changed in env var */
)
GROUP BY
1, 2
Anomaly Detection
Anomaly detection process is a scheduled job which runs weekly (on Tuesdays) in GAP. It uses Machine Learning model to detect anomalies in MAU data in BigQuery on the day of execution. It triggers a PagerDuty alert if any of the following are true: * Percentage of customers whose MAU value is too low is greater than 30% (Default; Can be changed in env var). * Percentage of customers whose MAU value is too high is greater than 50% (Default; Can be changed in env var). Here we only check customers whose MAU value on the day when the model was trained was greater than 10000 (Default; Can be changed in env var).
WITH
customers_in_model AS (
SELECT
DISTINCT customer_id
FROM
ML.EVALUATE(
MODEL `ems-mobile-engage.mobile_subscription.mau_prediction_model`
)
),
current_maus AS (
SELECT
DATE(aggregated_at) AS day,
customer_id,
SUM(value) AS mau
FROM
`ems-mobile-engage.mobile_subscription.kpi_summary`
WHERE
kpi_name = 'monthly_active_users'
AND DATE(aggregated_at) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
GROUP BY
1, 2
),
anomalies AS (
SELECT
*
FROM
ML.DETECT_ANOMALIES(
MODEL `ems-mobile-engage.mobile_subscription.mau_prediction_model`,
STRUCT(0.99 AS anomaly_prob_threshold),
(SELECT * FROM current_maus)
)
),
stats AS (
SELECT
COUNTIF(mau < lower_bound) AS mau_low,
COUNTIF(mau > higher_bound) AS mau_high,
COUNT(1) AS total
FROM
anomalies
)
SELECT
*,
(mau_low / total * 100) AS perc_mau_low,
(mau_high / total * 100) AS perc_mau_high
FROM
stats