Dataflow Alerts

A Mobile Engage pubsub-to-bigquery dataflow job is writing to the errors table

If you got this alert, this means that the processing of some reporting data of one of the dataflow jobs in GCP is failing and the results are being written to the error table of the dataflow’s affected target dataset in BigQuery. It is highly likely that there is nothing wrong with the dataflow itself, only that the data received was not able to reach its intended destination, for example because of some validation issues. First, identify which dataset(s) are affected. The errors table should exist within that dataset under the name errors. eg ems-mobile-engage.mobile_push_campaigns_raw.errors. You can simply query the latest errors from the table, for example:

SELECT * FROM `ems-mobile-engage.mobile_push_campaigns_raw.errors` WHERE _PARTITIONTIME IS NULL ORDER BY timestamp DESC

Note that the errors from all time and all customers go here. The occurence of the alert should roughly correspond to the timestamp of the errors.

The possible impact is inaccuracies in segmentation and reporting data being off (lower counts) or fully missing. Sometimes we can get errors here because customers are inserting invalid values in some fields. In this case we only need to notify TCS. But it could also point to a bug in the code (fields missing, data not validated appropriately, etc.), in which case we need to fix the error and then either notify TCS of the inaccurate reporting or re-send the data to the dataflow manually, or insert to the BQ table manually.