Database Migration

Table of Contents

Adapt Project
Prepare Source Database
Create Destination Postgres Instance
- 1. me-infrastructure
Run Migration Job
Prepare Destination Database
- 1. Create a user
- 2. Configure user permissions
Verify the migration
- 1. Verify permissions
- 2. Verify data
Switchover
Cleanup

Adapt Project

If you migrate staging replace ems-mobile-engage with ems-mobile-engage-staging.

Prepare Source Database

1. Add database flags

Add following lines to me-infrastructure for gcp/resources/sql_instances.staging.yaml or gcp/resources/sql_instances.production.yaml. This will cause a downtime of about 30 seconds:

    - name: cloudsql.logical_decoding
      value: 'on'
    - name: cloudsql.enable_pglogical
      value: 'on'

2. Create migration user

Create a new database user with the username migration, generate and store the password.

3. Configure `postgres` system database

Login to the source database using the postgres user. If you do not know the password you have to change it. Run the following code on the postgres database:

\c postgres

CREATE EXTENSION IF NOT EXISTS pglogical;

ALTER USER migration with REPLICATION;

GRANT USAGE on SCHEMA pglogical to migration;
GRANT SELECT on ALL TABLES in SCHEMA pglogical to migration;

GRANT USAGE on SCHEMA public to migration;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO migration;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON SEQUENCES TO migration;
GRANT SELECT on ALL TABLES in SCHEMA public to migration;
GRANT SELECT on ALL SEQUENCES in SCHEMA public to migration;

4. Configure `segment-diff` database

Login to the source database using the me-segment-diff@ems-mobile-engage.iam user. Run the following code on the segment-diff database:

\c segment-diff

CREATE EXTENSION IF NOT EXISTS pglogical;

ALTER USER migration with REPLICATION;

GRANT USAGE on SCHEMA pglogical to migration;
GRANT SELECT on ALL TABLES in SCHEMA pglogical to migration;

GRANT USAGE on SCHEMA public to migration;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO migration;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON SEQUENCES TO migration;
GRANT SELECT on ALL TABLES in SCHEMA public to migration;
GRANT SELECT on ALL SEQUENCES in SCHEMA public to migration;

If the statement involving pglogical fails:

-- Run this as the 'postgres' user
GRANT cloudsqlsuperuser TO "me-segment-diff@ems-mobile-engage.iam";

5. Add primary key to `segment_diffs` table

In order to support CDC migration, all tables must have a primary key. Only table that does not use a primary key is segment_diffs table. Since this table is huge we need to add a primary key in a way that it does not block write operations, so it needs to be done in a few steps.

All the necessary steps are demonstrated in this DB fiddle. Some of the steps are quick to execute (seconds), while the others take a long time (10-60 minutes). Thus, these statements must not be run from a local machine, but must be executed by launching a pod in the GAP cluster.

Create a Kubernetes resources YAML file with the below content. Replace $INSTANCE_CONNECTION_NAME` and $DATABASE_USER with correct values based on the environment in which you are running the migration (staging/production). When running gcloud sql instances describe $DATABASE_NAME --project=$PROJECT the instance connection name is shown. For example:

$INSTANCE_CONNECTION_NAME = ems-gap-me-segment-diff-p:europe-west3:me-segment-diff-production-pg13 ()
$DATABASE_USER = me-segment-diff@ems-mobile-engage-staging.iam or me-segment-diff@ems-mobile-engage.iam

Create PR with the file at gap/production or gap/staging folder.

apiVersion: v1
kind: ConfigMap
metadata:
  name: me-segment-diff-pg-migration
  namespace: mobile-engage
data:
  query: |
    \echo Adding id column to segment_diffs table

    ALTER TABLE segment_diffs ADD COLUMN id BIGINT;
    CREATE SEQUENCE segment_diffs_id_seq;
    ALTER SEQUENCE segment_diffs_id_seq OWNED BY segment_diffs.id;
    ALTER TABLE segment_diffs ALTER COLUMN id SET DEFAULT nextval('segment_diffs_id_seq');

    \echo Added id column to segment_diffs table

    \echo Populating id column in segment_diffs table

    DO $do$
    DECLARE
    _current INTEGER := 0;
    _step INTEGER := 1000;
    _last INTEGER := (SELECT MAX(segment_id) FROM segment_diffs);
    BEGIN
    LOOP
        RAISE NOTICE 'Updating rows starting from segment_id %', _current;

        UPDATE segment_diffs
        SET
        id = nextval('segment_diffs_id_seq')
        WHERE
        segment_id BETWEEN _current AND (_current + _step)
        AND id IS NULL;

        EXIT WHEN NOT FOUND AND _current + _step > _last;

        COMMIT;

        PERFORM pg_sleep(10);
        _current := _current + _step;

    END LOOP;
    END
    $do$;

    \echo Populated id column in segment_diffs table

    \echo Creating unique index on id column in segment_diffs table

    CREATE UNIQUE INDEX CONCURRENTLY segment_diffs_id_key ON segment_diffs (id);

    \echo Created unique index on id column in segment_diffs table

    \echo Adding primary key to segment_diffs table

    ALTER TABLE segment_diffs ADD CONSTRAINT segment_diffs_pkey PRIMARY KEY USING INDEX segment_diffs_id_key;

    \echo Added primary key to segment_diffs table
---
apiVersion: v1
kind: Pod
metadata:
  name: me-segment-diff-pg-migration
  namespace: mobile-engage
spec:
  serviceAccountName: me-segment-diff
  restartPolicy: Never
  securityContext:
    runAsUser: 1000
  volumes:
    - configMap:
        items:
        - key: query
          path: query.sql
        name: me-segment-diff-pg-migration
      name: query
  containers:
    - command:
        - /cloud_sql_proxy
        - -enable_iam_login
        - -ip_address_types=PRIVATE
        - -instances=$INSTANCE_CONNECTION_NAME=tcp:5432,
        - -term_timeout=30s
        - -structured_logs
      image: gcr.io/cloudsql-docker/gce-proxy:1.30.1-alpine
      imagePullPolicy: IfNotPresent
      name: cloud-sql-proxy
      resources:
        limits:
          cpu: 500m
          memory: 500Mi
        requests:
          cpu: 20m
          memory: 50Mi
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
          - ALL
        privileged: false
    - command:
        - sh
        - -c
        - "sleep 5; psql -h 127.0.0.1 -U $DATABASE_USER -d segment-diff < /tmp/query.sql"
      image: postgres:13
      name: psql
      resources:
        limits:
          cpu: 500m
          memory: 500Mi
        requests:
          cpu: 20m
          memory: 50Mi
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
          - ALL
        privileged: false
      volumeMounts:
        - mountPath: /tmp
          name: query

Once finished remove the file from the repository.

6. Run maintenance job

It is important to run the maintenance job shortly before starting the migration job in order to reduce the source database size, which will speed up the migration job.

Use k9s CLI tool to manually trigger the maintenance job:

k9s -n mobile-engage
Open cron jobs page: :cronjobs
Search for the job: / me-segment-diff-maintenance
Trigger the job: t
Open pods page :pods
Search for the pod: / me-segment-diff-maintenance
Check the logs: l
Wait for the job to finish
Delete the pod: ctrl-d

Create Destination Postgres Instance

1. me-infrastructure

In me-infrastructure copy definition of source Postgres instance and adapt it. Files are located at gcp/resources/sql_instances.staging.yaml or gcp/resources/sql_instances.production.yaml.

Adapt the name
Adapt the Postgres version
Adapt the disk_gb to the usage of the source database
Remove flags cloudsql.logical_decoding and cloudsql.enable_pglogical because these were just added for the migration
Remove segment-diff database from the new instance (Database Migration Service expects an empty Postgres instance)
Remove postgres system user from the new instance if present (postgres system user is always present and cannot be deleted in GCP, so it is better to leave out from our IaC)

Run Migration Job

1. Create connection profile

In Database Migration Service UI, create a new connection profile for source database instance (Cloud SQL for PostgreSQL).

Use the migration user for connection. Connection must be secured with the server-client encryption type. In Cloud SQL UI for source database instance, create a new client certificate, and download the certificate and private key (Connections | Security tab). Also download the server certificate. Specify them in connection profile settings.

2. Create migration job

In Database Migration Service UI, create a new migration job. Use the previously created connection profile for the source database instance. For the target database instance, specify the target instance that you already created. For connectivity method use VPC-peering. For parallelism level choose Maximum, to speed up the migration.

Test the migration job to verify that all the prerequisites were met. Once the job is ready, create it, but do not start it yet!

3. Stop the workers

Please inform Web Channel team that the workers will be stopped and no updates to segment snapshots and diffs will be done in the next eight hours. It might take longer.

We must stop all the background workers to reduce the load on the database and to prevent workers from creating/dropping segment snapshot tables, which interferes with the migration process, since in the first pass all the tables are added to the replication set.

We can keep the web worker running, which will allow the SDS service to serve existing segment versions.

Use k9s CLI tool to stop all deployments for background workers (everything except the web deployment).

k9s -n mobile-engage
Open deployments page: :deployments
Search for the deployments: / me-segment-diff
Scale deployments to 0: s
Open pods page :pods
Wait until all pods except web are terminated

4. Start the migration job

Go to the Database Migration Service UI and start the previously created migration job.

Prepare Destination Database

While migration job is running, we can prepare the destination database.

1. Create a user

Double-check if user me-segment-diff@ems-mobile-engage.iam exists in the destination database. Otherwise create an me-segment-diff@ems-mobile-engage.iam IAM database user.

2. Configure user permissions

\c segment-diff

GRANT ALL PRIVILEGES ON DATABASE "segment-diff" TO "me-segment-diff@ems-mobile-engage.iam";

GRANT ALL PRIVILEGES ON SCHEMA public TO "me-segment-diff@ems-mobile-engage.iam";

ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL PRIVILEGES ON TABLES TO "me-segment-diff@ems-mobile-engage.iam";
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL PRIVILEGES ON SEQUENCES TO "me-segment-diff@ems-mobile-engage.iam";

GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public to "me-segment-diff@ems-mobile-engage.iam";
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public to "me-segment-diff@ems-mobile-engage.iam";

You can use permissions-migrator in me-validate-database-migration.

Verify the migration

Monitor the migration job in Database Migration Service UI. Once the initial dump is complete, and CDC migration starts, monitor the replication delay. It should be close to 0. Now is a good time to verify the data in the destination database, before proceeding with the switchover.

1. Verify permissions

\c segment-diff

-- List privileges for schemas
\dn
-- List default privileges for tables/sequences
\ddp
-- List privileges for tables/sequences
\dp

2. Verify data

-- Verify number of tables is the same in source/target database
SELECT COUNT(1) FROM pg_tables WHERE schemaname = 'public';

-- Verify number of rows is the same in source/target tables
SELECT COUNT(1) FROM segments;
SELECT COUNT(1) FROM segment_snapshots;
SELECT COUNT(1) FROM segment_diffs;

You can use me-validate-database-migration to verify the data. Note it takes a long time for prod because there are many tables.

Switchover

Once we are confident that the migration was successful, we can proceed with the switchover. To perform the switchover, we first need to stop all writes to the source database, wait for the target database to fully catch up with the source (replication delay must be 0), and then promote the target database as the new primary.

1. Stop the `web` worker

At this point, only web worker is running and making writes to the database. We need to stop it, which means that SDS service will be completely unavailable from this point.

Use k9s CLI tool to stop the web deployment:

k9s -n mobile-engage
Open deployments page: :deployments
Search for the web deployment: / me-segment-diff-web
Scale the deployment to 0: s
Open pods page :pods
Wait until all web pods are terminated

2. Wait for the replication delay to be 0

Monitor the replication delay in Database Migration Service UI. Wait for the line in the chart to reach 0.

3. Promote the migration job

In Database Migration Service UI, click Promote on the migration job. This will disconnect the source instance from the target instance, and the target instance will become the primary instance.

4. Enable Point-in-time recovery

In Cloud SQL UI enable the point-in-time recovery for the new database instance. This update takes about 10 minutes.

5. Update the connection string

Use gap-cli to update the database connection string for the me-segment-diff service, to refer to the new database instance.

6. Start the workers

Use k9s CLI tool to scale all the deployments back to the original number of replicas.

Cleanup

1. Remove `cloudsqlexternalsync` user

We want to remove the user which was used for the migration job. To do so, we first need to transfer the ownership of all the objects owned by this user to the me-segment-diff@ems-mobile-engage.iam user.

\c segment-diff
GRANT cloudsqlexternalsync TO "me-segment-diff@ems-mobile-engage.iam";

\c segment-diff
REASSIGN OWNED BY cloudsqlexternalsync TO "me-segment-diff@ems-mobile-engage.iam";

\c postgres
REASSIGN OWNED BY cloudsqlexternalsync TO "me-segment-diff@ems-mobile-engage.iam";

You can use ownership-migrator in me-validate-database-migration.

Remove the cloudsqlexternalsync user in Cloud SQL UI.

2. Remove primary key on `segment_diffs` table

ALTER TABLE segment_diffs DROP CONSTRAINT segment_diffs_pkey;
ALTER TABLE segment_diffs DROP COLUMN id;

These statements execute quickly and can be run from a local machine.

3. Delete Migration Job

In the console delete the migration job created for the migration.

4. Update me-infrastructure

Import database

Import the segment-diff database of the new Postgres instance into me-infrastructure, e.g.

import {
  to = module.sql_instances.module.sql_instance["me-segment-diff-production-pg18"].google_sql_database.database["segment-diff"]
  id = "ems-gap-me-segment-diff-p/me-segment-diff-production-pg18/segment-diff"
}

Also, add the segment-diff database to the sql instance definition (it was not yet added because Database Migration Service expects an empty Postgres instance).

After it is imported please remove the import again.

Remove old instance

Update old Postgres instance with deletion_protection: false in me-infrastructure, plan and apply
Remove old Postgres instance from me-infrastructure, plan and apply