GCP Dataplex Universal Catalog for Data Management

Overview

Google Cloud Dataplex Universal Catalog is the next-generation metadata management and data discovery platform that replaces the legacy Google Cloud Data Catalog. It provides unified metadata management across BigQuery datasets/tables, Pub/Sub topics and subscriptions, and Dataflow jobs, enabling comprehensive data discovery, governance, and lineage tracking across your entire data ecosystem.

Metadata Management and Catalog Searching

Discovery and Search Capabilities

The Dataplex Universal Catalog enables team members to easily discover and search for data resources across the organization:

  • Unified Search Interface: Search across all data assets including BigQuery datasets/tables, Pub/Sub topics, and Dataflow jobs from a single interface

  • Rich Metadata Display: View comprehensive metadata including schema information, data quality metrics, and business context

  • Tag-based Organization: Organize resources using custom tags for team ownership, data classification, and usage patterns

  • Advanced Filtering: Filter search results by resource type, project, dataset, team ownership, or custom attributes

Essential Metadata Attributes

To enable effective data governance and discovery, the following metadata should be consistently applied to all data resources:

  • Data Owner: Identify the responsible team (me-push, me-inapp)

  • PII Classification: Mark resources containing personally identifiable information

    • contains-pii: Boolean flag for PII presence

    • pii-level: Classification level (high, medium, low)

  • Data Sensitivity: Classification of data sensitivity (public, internal, confidential, restricted)

  • Data Type: Classification of data processing level

    • real-time: Raw, unprocessed streaming data

    • derived: Calculated, transformed or aggregated datasets

  • Retention Policy: Data retention period and cleanup schedules

  • GDPR Cleanup: Whether data is affected by periodic GDPR cleanup scripts

  • Data Freshness: Update frequency and last refresh timestamp

  • Purpose: Business purpose and use cases for the data

  • SLO: Service level objectives and availability requirements

  • Dependencies: Upstream and downstream data dependencies

  • Documentation Links: References to detailed documentation and runbooks

Implementation

Implement automated tagging on each supported GCP resource, by defining it in me-infrastructure repository using Terraform, e.g.

# Example tag template for BigQuery datasets
bigquery_dataset_tags:
  team_owner: "${var.team_name}"
  data_classification: "${var.sensitivity_level}"
  contains_pii: "${var.has_pii}"
  gdpr_scope: "${var.gdpr_applicable}"
  data_type: "real-time"
  retention_days: "${var.retention_period}"

Data Lineage

Data Lineage Support

Dataplex Universal Catalog provides different levels of lineage support across GCP services:

BigQuery

BigQuery is by far the most supported product in Dataplex Universal Catalog, including column-level lineage, cross-project tracking, real-time updates, and historical lineage information. It is enabled by default when enabling Data Lineage API in the project. Scheduled queries are also supported out-of-the-box.

Dataflow

Data Lineage for Dataflow is supported and needs to be enabled via experimental flag on job-by-job basis. Both Pub/Sub and BigQuery sources/sinks are fully supported.

Pub/Sub

Data Lineage for Pub/Sub is not supported yet. There is no data lineage connection between a Pub/Sub topic and corresponding Pub/Sub subscriptions. This can be manually implemented using Data Lineage API, by creating a Pub/Sub process and sending Data Lineage events for each Pub/Sub message sent or consumed. Data Lineage events could also be sampled for cost/performance optimization.

Data Lineage process and run could be created via me-infrastructure repository using Terraform, by using local-exec provisioner, since these resources are not natively supported by Terraform. Data Lineage events could be implemented in @emartech/pubsub-client library.

However, I have opened tickets to GCP team to support this natively in GCP:

GAP workers

We could implement Data Lineage for GAP workers by using Data Lineage API, by creating a Data Lineage process/run for each GAP deployment (e.g. on startup) and sending Data Lineage events when publishing/consuming Pub/Sub messages, or writing/reading data from BigQuery tables.

End-to-End Lineage Visualization

The complete data lineage flow would track:

  1. Worker in GAP

  2. Pub/Sub topic

  3. Pub/Sub subscription

  4. Dataflow job

  5. BigQuery dataset/table

  6. Scheduled BigQuery job