Audience Contact References Table Structure
The table structure used for audiences in In-app is a very simple one. We have a huge single table which contains the campaign_id, contact_reference, and a list of RTI and AC program-ids.
This structure has worked quite well so far. Recently we’ve had new requirements related to customer deletion, and the 10x goal mentioned in the kickoff.
Objectives
Table structure must efficiently support the following:
-
10x growth
-
Deletion of a single audience (campaign)
-
Deletion of all audiences (campaigns) of a specific customer
-
Querying a campaign membership given a contact reference (DES)
Current Status
Table name |
campaign_contact_reference |
|---|---|
Table size |
63 GB |
Index size |
70 GB |
Rows count |
861 564 210 |
Dead rows |
53 493 181 (6%) |
Vacuum count |
0 |
Vacuum config |
PostgreSQL default |
SELECT
*
FROM
pg_stat_user_tables
WHERE
relname = 'campaign_contact_reference';
SELECT
pg_size_pretty(pg_table_size('campaign_contact_reference')) AS table_size,
pg_size_pretty(pg_indexes_size('campaign_contact_reference')) AS indexes_size;
Proposals
1. One Big Table
Keep the current campaign_contact_reference table as it is, but configure PostgreSQL autovacuum settings to more
aggressive values, e.g. like it was done in Segment Diff Service
(see here). This will be necessary to reduce the
table bloat which will appear after regular audience deletions. Also, it will be necessary to recreate
indexes when they become bloated, too (see here).
SELECT reloptions from pg_class WHERE relname = 'segment_diffs';
-- {autovacuum_vacuum_scale_factor=0.001,autovacuum_analyze_scale_factor=0.005}
2. Table Partitioning
Make use of PostgreSQL declarative table partitioning to split this large table into many smaller tables. We could decide to partition by customer ID or even more granular by campaign (audience) ID.
To migrate to this new table structure we can follow one of guides available online:
This might mean that we need to shut down workers which edit audiences (init-audience, add-contacts, remove-contacts), which would result in a small delay for audience updates, but we wouldn’t need a general database downtime.
Pros
-
Smaller indexes if partitioning by campaign ID - index by
(contact_reference)instead of(campaign_id, contact_reference) -
Less table/index bloat
-
Very easy deletion with no bloat on customer/campaign removal - detach and delete partition
-
Better query performance in DES - need to query only several partitions instead of whole table
Cons
-
Need to explicitly create each partition for each new customer/audience - might be automatically solved with PostgreSQL rules
-
Could lead to a higher DB memory usage if we end up with several thousand tables - should not happen if we delete old audiences regularly
-
Could be hard to change if we choose wrong partitioning level (customer/campaign)
-
Needs migration of data to partitioned tables
-- Partition table by campaign ID
CREATE TABLE campaign_contact_reference (campaign_id INTEGER, contact_reference TEXT, ...) PARTITION BY LIST(campaign_id);
-- Create a specific partition when creating an audience (campaign launch)
CREATE TABLE campaign_contact_reference_123 OF campaign_contact_reference FOR VALUES IN (123);
-- Or create a rule to automatically create a missing partition (see https://stackoverflow.com/a/53600145/1709131)
CREATE OR REPLACE RULE autoCall_createPartitionIfNotExists AS ON INSERT
TO campaign_contact_reference
DO INSTEAD (
SELECT createPartitionIfNotExists(NEW.campaign_id);
INSERT INTO campaign_contact_reference (campaign_id, contact_reference) values (NEW.campaign_id, NEW.contact_reference);
);
-- Partition pruning when querying
SELECT
campaign_id, ...
FROM
campaign_contact_reference
WHERE
campaign_id = ANY([123, 456, ...]) -- PostgreSQL can pick only specific partitions
AND contact_reference = 'cr:12345678'
Rollout
We want to roll out the change customer by customer, without downtime.
To achieve this without downtime, we will introduce flippers:
-
One when active, tells worker to start writing to new table.
-
One when active, tells worker to stop writing to old table.
To do this, we will:
-
Create a view which combines both new and old table to be used by DES
-
Create a rule which automatically creates the partition if it does not exist on insert
-
Create a new partitioned master table
-
Start writing data for selected customer(s) to both tables (existing one and a new one)
-
Migrate data from old table to new table for selected customer(s)
-
Validate that everything is ok
-
Stop writing to old table for selected customer(s)
At this point, both reads and writes will be using the new partitioned table for selected customer(s).
Global
We’ll create a migration script which will move all audiences to the partioned table. It could be as simply as below.
insert into campaign_contact_reference_partition (campaign_id, contact_reference, rti_program_ids, ac_program_ids)
select campaign_id, contact_reference, rti_program_ids, ac_program_ids
from campaign_contact_reference
where campaign_id in (select id from campaign where customer_id not in (:already_migrated_customers));