BigQuery Non-Breaking Schema Changes

This document describes a manual process of making non-breaking schema changes on BigQuery tables in Mobile Engage projects in Google Cloud.

Workflow

1. Update table schema file used by Dataflow job

If we are altering a schema of a BigQuery table which is being populated by a Dataflow job, we first need to update the table schema file (JSON file) in Google Cloud Storage. This file is used by a Dataflow job, and possibly other workers which use it to create new customer-specific tables on the fly for new customers.

Before altering existing tables, we need to make sure that new tables will be created with the new schema in mind. Only after that can we alter the old tables. Otherwise, we might end up in a state where some new tables are created with old schema.

Example PR which does both of these steps: https://github.com/emartech/me-infrastructure/pull/610

1.1. Change table JSON schema file in Google Cloud Storage

In the me-infrastructure repository, in the gcp/resources/storage_files.common folder, identify the JSON schema file of a table which needs to be updated. Update the schema and commit the changes.

1.2. Redeploy Dataflow job

Dataflow job reads and caches the JSON schema definitions for BigQuery tables. Therefore, after uploading the new schema file, we need to redeploy the job to pick up the changes.

To redeploy the Dataflow job, bump the version number of the job in the me-infrastructure repository in the gcp/resources/dataflow_jobs.common.yaml file, as part of the same PR as the table schema file change.

2. Update existing tables' schemas

Use the following script(s) to update the schemas of existing tables in BigQuery.

2.1. Add a new column to all tables in a dataset

The script first finds all the table IDs in the dataset, and then goes through all of those tables and adds a column to each table.

Update the <PROJECT_ID>, and <DATASET_ID> values, as well as column name and column type definition to suit your needs.

Also, if necessary you can further filter table IDs which are selected from the TABLES_SUMMARY table by specifying a WHERE clause.

DECLARE i INT64 DEFAULT 0;
DECLARE table_ids ARRAY<STRING>;

-- Select all tables that need to be updated
SET table_ids = (
  SELECT
    ARRAY_AGG(DISTINCT table_id)
  FROM
    `<PROJECT_ID>.<DATASET_ID>.__TABLES_SUMMARY__`
  -- WHERE
  --   table_id LIKE '%something%'
);

-- Update each table
WHILE i < ARRAY_LENGTH(table_ids) DO
  EXECUTE IMMEDIATE
    FORMAT(
      '''
        ALTER TABLE
          `<PROJECT_ID>.<DATASET_ID>.%s`
        ADD COLUMN IF NOT EXISTS
          my_string_array ARRAY<STRING>; -- Update column name and definition
      ''',
      table_ids[OFFSET(i)]
    );

  SET i = i + 1;
END WHILE;

2.2. Modify an existing column in all tables in a dataset

BQ does not allow changes to existing columns or schemas with SQL. Instead we will have to use the BigQuery web UI or CLI to perform these changes. The script going forward is going to use the CLI. So having bq installed and configured is a prerequisite.

Update the PROJECT_ID, DATASET_ID, and SCHEMA_FILE variables in the script below to suit your needs. The SCHEMA_FILE should point to a local JSON file which contains the updated schema for the tables.

Also note that there is a --max_results=5000 flag in the bq ls command. This is to avoid hitting the default limit of 1000 tables returned by the command. We might have more than 5000 tables in our dataset in the future, so you might need to increase this value further if necessary.

The script below will update ALL tables in the specified dataset. Currently only the errors table is excluded. If there are tables which should not be updated, you will need to add additional filtering logic.
#!/bin/bash
SCHEMA_FILE="/Users/inumber/path/to/your/schema/your_schema.json"

for table in $(bq ls --max_results=5000 PROJECT_ID:DATASET_ID | awk 'NR>2 {print $1}'); do
  if [[ $table == "errors" ]]; then
    echo "Skipping: $table"
    continue
  fi
  echo "Updating: $table"
  bq update PROJECT_ID:DATASET_ID.$table "$SCHEMA_FILE"
done

Make sure the schema file contains the full schema of the table, not just the changed column. It has to be a json array of objects, where each object represents a column in the table with its name and type (and mode if necessary). Here is an example of what the file should look like:

[
  {
    "name": "customer_id",
    "type": "INT64",
    "mode": "REQUIRED",
    "description": "The customer unique identifier"
  },
  {
    "name": "campaign_id",
    "type": "INT64",
    "mode": "REQUIRED",
    "description": "The campaign unique identifier per customer"
  },
  {
    "name": "contact_id",
    "type": "INT64",
    "mode": "REQUIRED",
    "description": "The contact unique identifier"
  },
  {
    "name": "domain_code",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "The domain unique code"
  },
  ...
]

You should be able to run the script locally now and it should update all the tables in the specified dataset with the new schema. Consider doing a dry-run first, by echoing or commenting the bq update command instead of executing it.

Automating the process

It would be possible to write a tool which would perform these actions instead of requiring us to execute them manually. Such tool would have to accept the parameters from the table, all of which are required.

Google Cloud Storage file path for table JSON schema can be deduced from the environment, dataset ID, and table ID parameters, by maintaining a mapping inside the tool.

Parameter Values Example

Environment

staging, production

staging

Dataset ID

STRING

client_state

Table ID

STRING

client_updates

Column name

STRING

my_string_array

Column type

STRING

ARRAY<STRING>