Upgrading PostgreSQL

Using the new Database Migration Service by Google it’s possible to set up a streaming replication to an instance with a newer version. All steps needed are well documented in the Upgrade Postgres with pglogical and Database Migration Service article by Google.

By using the migration service we create a new instance which is continuously replicating the original DB and once we’re done testing we can simply promote the new instance and drop the old one.

Steps

Create new migration process.

Ask the tech lead to create the destination to use same configuration as the current PG instance.

The db should be named on this format: me-inapp-<staging|production>-pg<version>

Prepare Source DB

Activate required database flags.
- cloudsql.logical_decoding
- cloudsql.enable_pglogical
Create a new database user with the username migration, generate and store the password.
Login to the source database using the postgres user.
- Staging secret server entry
- Production secret server entry

Run the following code on the postgres database

CREATE EXTENSION IF NOT EXISTS pglogical;

ALTER USER migration WITH REPLICATION;
GRANT USAGE ON SCHEMA pglogical to migration;
GRANT USAGE ON SCHEMA public TO migration;
GRANT SELECT ON ALL TABLES IN SCHEMA pglogical to migration;

ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO migration;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON SEQUENCES TO migration;

GRANT SELECT ON ALL TABLES in SCHEMA public to migration;
GRANT SELECT ON ALL SEQUENCES in SCHEMA public to migration;

Run the following code on the inapp database

CREATE EXTENSION IF NOT EXISTS pglogical;

ALTER USER migration WITH REPLICATION;
GRANT USAGE ON SCHEMA pglogical to migration;
GRANT SELECT ON ALL TABLES IN SCHEMA pglogical to migration;

GRANT USAGE ON SCHEMA pglogical to "me-inapp@ems-mobile-engage.iam";

GRANT USAGE ON SCHEMA public TO migration;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO migration;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON SEQUENCES TO migration;
GRANT SELECT ON ALL TABLES in SCHEMA public to migration;
GRANT SELECT ON ALL SEQUENCES in SCHEMA public to migration;

GRANT USAGE ON SCHEMA audience_data TO migration;
GRANT SELECT ON ALL TABLES IN SCHEMA audience_data TO migration;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA audience_data TO migration;

ALTER DEFAULT PRIVILEGES IN SCHEMA audience_data GRANT SELECT ON TABLES TO migration;
ALTER DEFAULT PRIVILEGES IN SCHEMA audience_data GRANT SELECT ON SEQUENCES TO migration;

Make schema compatible

Drop campaign_priority constraint since it’s using a deferrable index which is not supported

ALTER TABLE campaign DROP CONSTRAINT campaign_priority

The auto-created table knex_migrations_lock is created without a primary index. In case the db instance has not been migrated before you need to create a PK to avoid warnings.

ALTER TABLE knex_migrations_lock ADD COLUMN dbid BIGSERIAL PRIMARY KEY;

Start migration process

1. Create connection profile

In Database Migration Service UI, create a new connection profile for source database instance (Cloud SQL for PostgreSQL).

Use the migration user for connection. Connection must be secured with the server-client encryption type. In Cloud SQL UI for source database instance, create a new client certificate, and download the certificate and private key. Also download the server certificate. Specify them in connection profile settings.

2. Create migration job

In Database Migration Service UI, create a new migration job. Use the previously created connection profile for the source database instance. For the target database instance, specify the target instance that you already created. For connectivity method use VPC-peering. For parallelism level choose Maximum, to speed up the migration.

Test the migration job to verify that all the prerequisites were met. Once the job is ready, create it, but do not start it yet!

3. Async audience table partition

Make sure to turn on the ASYNC_AUDIENCE_CREATION_ENABLED flipper and scale up the audience-partition-creation worker to 1 from 0.

4. Start the migration job

Go to the Database Migration Service UI and start the previously created migration job.

Prepare destination DB

Staging

Add new users in the cloud console.

IAM

Built-in

bigquery - The password is stored in the secret manager on using the database-password-inapp-bigquery key. Either use that password or generate a new ones and create a new version of the secret.

Grant permissions to the new users in the destination database using the postgres user. Use the CLI tool from me-validate-database-migration repository.

.env file

# Connection options for source and migrated databases (see README)
# ...

ROLES_MIGRATOR_USERS='[{"username": "me-inapp@ems-mobile-engage-staging.iam", "roleType": "write"}, {"username": "me-device-event@ems-mobile-engage-staging.iam", "roleType": "read"}, {"username": "bigquery", "roleType": "read"}]'

Production

IAM

Built-in

bigquery - The password is stored in the secret manager on using the database-password-inapp-bigquery key. Either use that password or generate a new ones and create a new version of the secret.

Grant permissions to the new users in the destination database using the postgres user. Use the CLI tool from me-validate-database-migration repository.

.env file

# Connection options for source and migrated databases (see README)
# ...

ROLES_MIGRATOR_USERS='[{"username": "me-inapp@ems-mobile-engage.iam", "roleType": "write"}, {"username": "me-device-event@ems-mobile-engage.iam", "roleType": "read"}, {"username": "bigquery", "roleType": "read"}]'

Make sure our custom pg-types are registered

Look up the OIDs used for our custom types using the following query

select typarray as oid, typname from pg_type where typname in ('test_contact', 'program_run');

When update the config in the inapp-service to use the new OIDs.

PG_PROGRAM_RUN_ARRAY_OIDS - a comma separated list of OIDs, keep the old one in case you need to rollback.
PG_TEST_CONTACT_ARRAY_OIDS - a comma separated list of OIDs, keep the old one in case you need to rollback.

Wait for database to be in sync

The status in the database migration service indicates that the initial transfer has been finished and that it’s continuously syncing data by the message CDC.

Switch to new instance

Prepare instance connection name for inapp-service as well as for device-event-service
Turn off inapp workers to avoid DB-writes
Turn off inapp-web deployment
Validate the migration by using the me-validate-database-migration tool.

npm run start:validator

Change PG_INSTANCE_CONNECTION_NAME in config in inapp-service.
Switch connection string in device-event-service
Promote destination database to primary in database migration service.

Cleanup

Remove cloudsqlexternalsync user

Use tool in me-validate-database-migration to transfer the ownership of database objects and remove the user.

npm run start:ownership-migrator

Add back the campaign-priority constraint.

ALTER TABLE campaign ADD CONSTRAINT campaign_priority UNIQUE (customer_id, application_code, priority) DEFERRABLE INITIALLY DEFERRED

Configure backup for new database
Enable HA by selecting "Multiple zones" in "Zonal availability"
Import new instance to me-infrastructure and remove the old one. See this pull request for an example on how to do this.

Make sure to turn off the ASYNC_AUDIENCE_CREATION_ENABLED flipper once the migration is done and scale down the audience-partition-creation worker back to 0.

Limitations

pglogical does not support deferrable indexes

In in-app we use a single deferrable index for campaign-priority constraint. This constraint will be remove during the migration and added back once everything is done.

pglogical only migrates tables that have a primary key

Not a problem for inapp, all tables have primary keys expect the knex_migrations_lock table which should be automatically created by knex.

pglogical only migrates the schema for materialized views, but not the data.

We don’t use materialized views in the inapp db.

pglogical cannot migrate large objects.

We don’t have any large object.

pglogical does not automatically migrate users.

We’ll need to manually setup the same users.

Needed Downtime

To prepare the source database we need to activate two flags (cloudsql.logical_decoding, cloudsql.enable_pglogical) which requires a restart to become active.

The restart which should not take more than 1-2 minutes is the only downtime needed for the whole upgrade process.