GCP Dataplex Universal Catalog for Data Management
Overview
Google Cloud Dataplex Universal Catalog is the next-generation metadata management and data discovery platform that replaces the legacy Google Cloud Data Catalog. It provides unified metadata management across BigQuery datasets/tables, Pub/Sub topics and subscriptions, and Dataflow jobs, enabling comprehensive data discovery, governance, and lineage tracking across your entire data ecosystem.
Metadata Management and Catalog Searching
Discovery and Search Capabilities
The Dataplex Universal Catalog enables team members to easily discover and search for data resources across the organization:
-
Unified Search Interface: Search across all data assets including BigQuery datasets/tables, Pub/Sub topics, and Dataflow jobs from a single interface
-
Rich Metadata Display: View comprehensive metadata including schema information, data quality metrics, and business context
-
Tag-based Organization: Organize resources using custom tags for team ownership, data classification, and usage patterns
-
Advanced Filtering: Filter search results by resource type, project, dataset, team ownership, or custom attributes
Essential Metadata Attributes
To enable effective data governance and discovery, the following metadata should be consistently applied to all data resources:
-
Data Owner: Identify the responsible team (
me-push,me-inapp) -
PII Classification: Mark resources containing personally identifiable information
-
contains-pii: Boolean flag for PII presence -
pii-level: Classification level (high, medium, low)
-
-
Data Sensitivity: Classification of data sensitivity (public, internal, confidential, restricted)
-
Data Type: Classification of data processing level
-
real-time: Raw, unprocessed streaming data -
derived: Calculated, transformed or aggregated datasets
-
-
Retention Policy: Data retention period and cleanup schedules
-
GDPR Cleanup: Whether data is affected by periodic GDPR cleanup scripts
-
Data Freshness: Update frequency and last refresh timestamp
-
Purpose: Business purpose and use cases for the data
-
SLO: Service level objectives and availability requirements
-
Dependencies: Upstream and downstream data dependencies
-
Documentation Links: References to detailed documentation and runbooks
Implementation
Implement automated tagging on each supported GCP resource, by defining it in me-infrastructure repository using Terraform, e.g.
# Example tag template for BigQuery datasets
bigquery_dataset_tags:
team_owner: "${var.team_name}"
data_classification: "${var.sensitivity_level}"
contains_pii: "${var.has_pii}"
gdpr_scope: "${var.gdpr_applicable}"
data_type: "real-time"
retention_days: "${var.retention_period}"
Data Lineage
Data Lineage Support
Dataplex Universal Catalog provides different levels of lineage support across GCP services:
BigQuery
BigQuery is by far the most supported product in Dataplex Universal Catalog, including column-level lineage, cross-project tracking, real-time updates, and historical lineage information. It is enabled by default when enabling Data Lineage API in the project. Scheduled queries are also supported out-of-the-box.
Dataflow
Data Lineage for Dataflow is supported and needs to be enabled via experimental flag on job-by-job basis. Both Pub/Sub and BigQuery sources/sinks are fully supported.
Pub/Sub
Data Lineage for Pub/Sub is not supported yet. There is no data lineage connection between a Pub/Sub topic and corresponding Pub/Sub subscriptions. This can be manually implemented using Data Lineage API, by creating a Pub/Sub process and sending Data Lineage events for each Pub/Sub message sent or consumed. Data Lineage events could also be sampled for cost/performance optimization.
Data Lineage process and run could be created via me-infrastructure repository using Terraform, by using local-exec provisioner, since these resources are not natively supported by Terraform. Data Lineage events could be implemented in @emartech/pubsub-client library.
However, I have opened tickets to GCP team to support this natively in GCP: