Audience Contact References Table Structure

The table structure used for audiences in In-app is a very simple one. We have a huge single table which contains the campaign_id, contact_reference, and a list of RTI and AC program-ids.

This structure has worked quite well so far. Recently we’ve had new requirements related to customer deletion, and the 10x goal mentioned in the kickoff.

Objectives

Table structure must efficiently support the following:

10x growth
Deletion of a single audience (campaign)
Deletion of all audiences (campaigns) of a specific customer
Querying a campaign membership given a contact reference (DES)

Current Status

Table name	campaign_contact_reference
Table size	63 GB
Index size	70 GB
Rows count	861 564 210
Dead rows	53 493 181 (6%)
Vacuum count	0
Vacuum config	PostgreSQL default

Table name

campaign_contact_reference

Table size

63 GB

Index size

70 GB

Rows count

861 564 210

Dead rows

53 493 181 (6%)

Vacuum count

Vacuum config

PostgreSQL default

Table stats query

SELECT
  *
FROM
  pg_stat_user_tables
WHERE
  relname = 'campaign_contact_reference';

SELECT
  pg_size_pretty(pg_table_size('campaign_contact_reference')) AS table_size,
  pg_size_pretty(pg_indexes_size('campaign_contact_reference')) AS indexes_size;

Proposals

1. One Big Table

Keep the current campaign_contact_reference table as it is, but configure PostgreSQL autovacuum settings to more aggressive values, e.g. like it was done in Segment Diff Service (see here). This will be necessary to reduce the table bloat which will appear after regular audience deletions. Also, it will be necessary to recreate indexes when they become bloated, too (see here).

SDS segment_diffs table autovacuum settings

SELECT reloptions from pg_class WHERE relname = 'segment_diffs';

-- {autovacuum_vacuum_scale_factor=0.001,autovacuum_analyze_scale_factor=0.005}

Pros

Simple to start with

Cons

Potential recurring struggle with table/index bloat
Lower query performance in DES if table gets huge and index becomes bloated

2. Table Partitioning

Make use of PostgreSQL declarative table partitioning to split this large table into many smaller tables. We could decide to partition by customer ID or even more granular by campaign (audience) ID.

To migrate to this new table structure we can follow one of guides available online:

This might mean that we need to shut down workers which edit audiences (init-audience, add-contacts, remove-contacts), which would result in a small delay for audience updates, but we wouldn’t need a general database downtime.

Pros

Smaller indexes if partitioning by campaign ID - index by (contact_reference) instead of (campaign_id, contact_reference)
Less table/index bloat
Very easy deletion with no bloat on customer/campaign removal - detach and delete partition
Better query performance in DES - need to query only several partitions instead of whole table

Cons

Need to explicitly create each partition for each new customer/audience - might be automatically solved with PostgreSQL rules
Could lead to a higher DB memory usage if we end up with several thousand tables - should not happen if we delete old audiences regularly
Could be hard to change if we choose wrong partitioning level (customer/campaign)
Needs migration of data to partitioned tables

Table partitioning SQL code

-- Partition table by campaign ID
CREATE TABLE campaign_contact_reference (campaign_id INTEGER, contact_reference TEXT, ...) PARTITION BY LIST(campaign_id);

-- Create a specific partition when creating an audience (campaign launch)
CREATE TABLE campaign_contact_reference_123 OF campaign_contact_reference FOR VALUES IN (123);

-- Or create a rule to automatically create a missing partition (see https://stackoverflow.com/a/53600145/1709131)
CREATE OR REPLACE RULE autoCall_createPartitionIfNotExists AS ON INSERT
TO campaign_contact_reference
DO INSTEAD (
  SELECT createPartitionIfNotExists(NEW.campaign_id);
  INSERT INTO campaign_contact_reference (campaign_id, contact_reference) values (NEW.campaign_id, NEW.contact_reference);
);

-- Partition pruning when querying
SELECT
  campaign_id, ...
FROM
  campaign_contact_reference
WHERE
  campaign_id = ANY([123, 456, ...]) -- PostgreSQL can pick only specific partitions
  AND contact_reference = 'cr:12345678'

Rollout

We want to roll out the change customer by customer, without downtime.

To achieve this without downtime, we will introduce flippers:

One when active, tells worker to start writing to new table.
One when active, tells worker to stop writing to old table.

To do this, we will:

Create a view which combines both new and old table to be used by DES
Create a rule which automatically creates the partition if it does not exist on insert
Create a new partitioned master table
Start writing data for selected customer(s) to both tables (existing one and a new one)
Migrate data from old table to new table for selected customer(s)
Validate that everything is ok
Stop writing to old table for selected customer(s)

At this point, both reads and writes will be using the new partitioned table for selected customer(s).

Global

We’ll create a migration script which will move all audiences to the partioned table. It could be as simply as below.

insert into campaign_contact_reference_partition (campaign_id, contact_reference, rti_program_ids, ac_program_ids)
  select campaign_id, contact_reference, rti_program_ids, ac_program_ids
  from campaign_contact_reference
  where campaign_id in (select id from campaign where customer_id not in (:already_migrated_customers));

Cleanup

Once everything is done we should cleanup the db schema.

Change view to only query the new partioned table
Remove old table

Since this duplicates quite a bit of data we might need to migrate to a new DB instance after the migration is fully done to reduce the database size.