Database

The service uses PostgreSQL for storing snapshots of segments as well as the calculated diffs. The whole service never stores any sensitive recipient data, only the internal contact ids are stored locally.

This document only contains an overview of the table structure, for the details please refer the the migrations. Changes to the database structure are automatically applied on [deployment](Deployment).

Structure

Registered Segments

All segments which have been queried through the service are available in the segments table. This table among other things contains info on which is the current version and when the segment was last used.

Segment Snapshots

In order to calculate the differences of a segment overtime snapshots are stored locally. Each snapshot is stored in its own table. The table consists of a single column containing the contact ids.

Table Name

Format: segment_$customerId_$segmentId_$versionId

customerId - numeric id of the customer
segmentId - numeric id of the segment
versionId - numeric id of the version. This is generated by the service when creating a new snapshot.

Diffs

The changes to a segment over time is stored in the segment_diffs table. This table contains a record for all added or removed contacts for a certain version. This table is used when querying differences via the API.

Clean Up Segments

It is not straight-forward to cleanup segments (e.g. from an inactive account) since multiple tables are involved are explained above. By using the maintenance which is running daily we can make our life easier:

Set last_used_at in segments table more than 32 days into the past, update segments set last_used_at=now() - interval '35 days' where customer_id=CUSTOMER_ID;
Wait till maintenance is triggered by the Google Cloud or run it manually

Autovacuum

Understand autovacuum by Dan Robinson

Postgres Doc

https://www.postgresql.org/docs/9.4/static/routine-vacuuming.html

Global Settings

select name, short_desc, setting, unit, min_val, max_val from pg_settings where category = 'Autovacuum';

Table Settings

SELECT relname, reloptions FROM pg_class WHERE relname = 'segment_diffs';

Reindex

If indexes become bloated it is good to occasionally reindex them. There are several ways to achieve this, see Resources section for more details.

'segment_diffs' table index

Using REINDEX command

REINDEX INDEX segment_diffs_segment_id_version_contact_id_key

Reindexing took 9 minutes on production with ~120 million rows in the table.

REINDEX command locks the table for writing and obtains an exclusive lock on the index, which will block reads that attempt to use that index.

Therefore:

REINDEX should not be run while SDS maintenance or consumer are operating
sync_worker`s in `ems-audience-exchange should be stopped while REINDEX is running
REINDEX will block SELECT queries which try to fetch added and removed contacts (segment diff) from the segment_diffs table.

If this is not acceptable, approach with concurrent indexes should be used instead. See Resources for more details.

Using CREATE INDEX CONCURRENTLY

Following steps are needed: - Create new index concurrently, i.e. no reads/writes/updates are blocked by the index creation - Create constraint from new index =⇒ now the table has two constraints on the same columns

2019-01-24

Bloat of index for segment_diffs table is ~9GB:

$ heroku pg:bloat HEROKU_POSTGRESQL_YELLOW --app ems-segment-diff
 type  | schemaname |                          object_name                           | bloat |   waste
-------+------------+----------------------------------------------------------------+-------+------------
 index | public     | segment_diffs::segment_diffs_segment_id_version_contact_id_key |   2.0 | 9174 MB
 table | public     | segment_diffs                                                  |   1.0 | 345 MB

374,534,455 rows in the table (select count(*) from segment_diffs)

Turn off consumers so that the concurrent index creation can be faster.

create unique index concurrently segment_diffs_segment_id_version_contact_id_key_20190123 on segment_diffs (segment_id, version, contact_id);

alter table segment_diffs add constraint segment_diffs_segment_id_version_contact_id_key_20190123 unique using index segment_diffs_segment_id_version_contact_id_key_20190123;
alter table segment_diffs drop constraint segment_diffs_segment_id_version_contact_id_key;
ALTER TABLE segment_diffs RENAME CONSTRAINT segment_diffs_segment_id_version_contact_id_key_20190123 TO segment_diffs_segment_id_version_contact_id_key;

Bloat of index afterwards:

 type  | schemaname |                          object_name                           | bloat |   waste
-------+------------+----------------------------------------------------------------+-------+------------
 index | public     | segment_diffs::segment_diffs_segment_id_version_contact_id_key |   1.3 | 1086 MB

Resources

Issues

[SDS: REINDEX "segment_diffs" table](https://emarsys.jira.com/browse/EDREI-34984)
[SDS alerts](https://emarsys.jira.com/browse/SUITEDEV-16160)