Contact Data Deletion

Contact Data Deletion for Mobile Engage is configured in the contact-data-deletion-me repo. It also holds the configurations for account deletion cleanups. The repo is private and can be accessed by the Mobile Engage team. It utilises the contact-data-deletion service (aka CDDS), which is a shared service for all Emarsys products, developed by the former Data Platform team. It runs the given cleanup configurations using cronjobs, and deletes data from the Mobile Engage BigQuery (BQ) tables.

Configurations

The configurations for the deletion jobs are held in the gap_patch_cron_job.yaml file where tables and deletion parameters are set for each job. These jobs are then executed using cronjobs configured in the gap.yaml of the service.

Product specific deletions

push
inapp
inbox
embedded-messaging
web-push

These are run on the 5th and the 20th of each month, with the exception of push which runs on the 6th and 21st. The jobs are staggered throughout the day to avoid running out of BQ slots: push at 10:00 UTC (on 6th & 21st), inapp at 12:00 UTC, inbox at 14:00 UTC, web-push at 16:00 UTC, and embedded-messaging at 18:00 UTC.

Push is offset by one day because it uses the bq_delete_template_ranged_with_offset.sql template, which includes an interval_days_upper_bound parameter to handle late-arriving session events. The offset on the client_changes table of 24 hours further ensures that the session_events streaming buffer has cleared before attempting to delete from it, which previously caused partial job failures due to entries in the streaming buffer being affected.

Mobile subscription deletions

subscription-conact-states

These are run on the 5th and the 20th of each month at 10:00 UTC. This deletes the underlying data related to the mobile subscription and kpi data.

Client State deletions

client-state-client-updates
client-state-push-t-updates
client-state-contact-update
client-state-snapshots

The contact-updates and snapshots are run on the 5th and the 20th of each month at 08:00 UTC. The client-updates and push-token-updates (push-t-updates due to naming restrictions) are run on the 7th and the 22th of each month at 10:00 UTC. The client-updates and push-token-updates are offset, since they do not contain contact_ids and so they must be run after the contact’s data has been cleaned up (here 2 days later, or the timeout of the dependent deletions).

Account deletions

account-deletion
account-del-wo-pt

These jobs run on the 15th of every month. They delete from all the Mobile Engage related tables we have in BQ. The difference between the 2 is that the account-del-wo-pt (meaning: account deletion without partitiontime) deletes from those table that do not use the _PARTITIONTIME column. They also run offset from each other at 11:00 UTC and 13:00 UTC respectively.

Anonless deletions

anonless-del
anonless-snapsh-del

These jobs run also on the 15th of every month at 14:00 UTC and 15:00 UTC respectively. They delete all client data in client-state that have neither a anonymous nor an identified contact attached to it. They are intended for those customers that disabled anonymous contact creation (aka anonless flow).

Alert Handling

If you get an alert for this service it is essential you check the logs in LaaS to find which deletions failed. In general, WARN log levels are used to point out issues in the queries themselves. Whereas the ERROR level is used when the CDDS framework encounters issues interacting with BigQuery.

You can use the datasetId and targetTable fields to identify which deletions failed. Depending on where the deletion failed we might have a bigquery job id in the logs. The job id should look somewhat like those (2 examples):

ems-dp-contact-deletion:EU.contact-deletion-me-ce8ac610-3124-4fd6-8cb9-e970d6456334
contact-deletion-me-198e9482-a989-484f-9285-31806bfb8ba3

They can either be in their own jobId field or inside the stack_trace. You can use this to check the status of the job in the BigQuery UI or CLI using the following command:

bq --project_id <projectId> show -j --format=prettyjson <jobId>

Keep in mind that the project in which the queries are executed might differ from the project of the targeted table(s). You can sometimes find the projectId under project in the logs. If in doubt you can check the target_project_id field in the yaml where the job is configured in the repo. (eg. ems-dp-contact-deletion)

If you find that the query in the job has failed, you should be able to re-run it manually. There are, again, multiple ways of doing this. If you used the command suggested above, it should have printed the failed query. You can copy it from there. Or if you found the errored query in the BigQuery UI you can copy it from there. Sometimes the service might even print queries of failed jobs, you can copy it from there too. Then you can re-run it manually in the BigQuery UI or CLI.

If there are too many failed jobs, there is no point in trying to rerun 100s of queries manually. Besides, it is likely there is some greater underlying issue with BQ at the moment. So in this case, we should simply rerun the entire cronjob either using k9s or the GAP CLI, once the issue has subsided. Should you still encounter problems you should contact the Mobile Engage Platform Team.

Browsing the LaaS logs Hints and Tips

You can find the logs in the gap-contact-data-deletion-me index in LaaS.

Use the @gap.cluster_name or @gap.resource.labels.project_id fields to distinguish between staging and production logs.

Use the labels.k8s-pod/app field to find logs for a specific job. The value of this field should be the same as the name field in the gap_patch_cron_job.yaml file for each job.

Usually, the errors you are looking for are right at the end of a run. So after you applied the above filters, use the timeline graph on top to hone in towards the end of the run. All the needed info (such as bigquery job ids, datasetIds, target tables, etc.) should be within the last 5-6 log messages.

Note that it is possible that the error logs are also inbetween (eg. in case of a backup failing, the job will continue with the deletion of the remaining configured datasets). In this case it is best to filter by level field, and then hone in on the logs around the time of error occurence.

Here are some useful links which may or may not work: * LaaS ERROR logs * LaaS WARN logs

If they do not work, you should be able to use this link to access the LaaS UI and apply the filters described above, manually: * https://kibana.service.emarsys.net/s/google-cloud/app/discover#

Known errors

java.lang.OutOfMemoryError: Java heap space

This means that the job has run out of memory. The default memory limit for the job in k8s is 256Mb (request) and 1Gb (limit). In most cases, this should be enough. However, in case of a large number of deletion queries running and the state of the BQ quota in the target project, it is possible, our job has to retry a lot of queries. We have observed in the past that an excessive number of retried queries has caused the Java heap to run out of memory in this service. In this case, k8s will retry the entire job, up to 3 times. (configured in the gap.yaml file of the service) THERE IS AN EXCEPTION FOR THIS:

contact-data-deletion-me-client-state-contact-update
contact-data-deletion-me-client-state-snapshots

These jobs are never automatically retried in this way by k8s as they are too expensive (~$2k for a successful run at times of writing). If you see this error for these jobs, you should manually rerun the job using k9s or GAP CLI, at a time where the job is more likely to succeed. This is usually later in the afternoon for days where the bulk of the deletions run (5th & 20th of each month). Or even the next day would be ok to manually rerun it.

Failed to execute backup for dataset xxx, deletion will not be executed.

This means that for whatever reason, the backup for the a targeted dataset has failed in this deletion job. The deletion will not be executed for this dataset in this case, as we do not want to risk losing data. We can retry this manually, however, the job will continue with the deletion of other datasets configured in this job. Therefore, we first want to wait for the job to finish.

After the job is finished, we want to go to LaaS2 and find out which parts of the job (i.e. which datasets) have failed the backup step, as there may be multiple at this point. We should then manually rerun the deletion only for those datasets that failed to back up, instead of the entire job again. This is to avoid incurring unnecessary costs and load on BigQuery. (This does not apply to client-state related jobs as they only target a single dataset each.)

To do this, first find out which job was affected. The alert should contain the job’s name, for example contact-data-deletion-me-push-gap-production means contact-data-deletion-me-push job on production. Verify in LaaS2 all the failed backups for this job, and note down the datasetId fields for each failed backup. Find the cronjob(s) in k9s (:cronjobs and then /deletion). Alternatively you can find the job’s configuration in the CDDS-ME repo in gap/production/gap_patch_cron_job.yaml. You can find the targeted datasets in the dataset_delete_configs section/array of the job. In k9s, find the cronjob and edit it. Remove all the job’s datasets except the one(s) that failed the backup step. Then save and exit. Then you can manually retrigger this job again. Once it is completed, you can revert the changes you made to the cronjob. I recommend reverting it by going to argoCD and re-syncing the service, to avoid any mistakes.