Untitled :: Mobile Engage Docs

Redis/Memorystore Alerts

This section covers handling alerts and issues related to Google Cloud Memorystore for Redis instances.

Delivery Redis High System Memory Usage

This alert monitors the delivery Redis instance used for idempotency/aborted campaigns.

OOM (Out of Memory) Error

Error message:

-OOM command not allowed under OOM prevention.

Root Cause:

This error originates from the System Memory protection mechanism within Memorystore, not from the standard Redis key limit. The instance is physically out of RAM, likely due to fragmentation caused by many small keys. This condition blocks all write operations to prevent a server crash.

Standard eviction processes cannot function because the system is too saturated to process the necessary background tasks required to free memory.

Quick Recovery: Manual Failover (Standard Tier only)

For Standard Tier instances, the fastest recovery method (~30 seconds) is a Manual Failover using force-data-loss mode. This bypasses data synchronization checks.

Command:

gcloud redis instances failover INSTANCE_NAME --region=REGION --data-protection-mode=force-data-loss

Using force-data-loss mode does not check the offset delta between primary and replicas before initiating failover. You can potentially lose more than 30MB of data changes.

For minimal data loss (if bytes pending replication < 30MB):

gcloud redis instances failover INSTANCE_NAME --region=REGION --data-protection-mode=limited-data-loss

Memorystore instances do not have a "Restart" button, and standard Redis commands like SHUTDOWN are disabled. Manual failover is the fastest way to recover from an OOM condition on Standard Tier instances.

See: About Manual Failover

Important: BullMQ Considerations

Some Redis instances are used by BullMQ for job queues. Services like scheduler and msg-generator use BullMQ to manage job processing.

Before performing any failover with force-data-loss mode, a DevOps engineer should:

Identify what the Redis instance is used for - Check if it stores critical job queue data
Assess the impact - Understand what jobs/data could be lost
Coordinate with the team - Ensure stakeholders are aware of potential data loss
Consider alternatives - If job data is critical, wait for limited-data-loss mode to become viable or explore other recovery options

Losing BullMQ job data can result in:

Scheduled campaigns not being sent
Messages stuck in processing queues
Jobs that need to be manually re-triggered

When in doubt, prefer limited-data-loss mode or consult with the team before proceeding with force-data-loss.

Why Scaling May Take a Long Time

Scaling a Memorystore instance requires it to "fork" (create a background copy) to synchronize data to the new, larger node. If the instance has 0% free memory, this fork operation fails repeatedly, causing the system to enter a retry loop.

A high write load or high memory pressure can cause a scaling operation to take significantly longer and can cause the operation to fail.

Recommendation: Before scaling, reduce write pressure by scaling down dependent services to 0 pods if possible. However, note that even with no new writes, if memory is already highly utilized, scaling can still be delayed.

The maxmemory-gb flag can help in future incidents by providing headroom for such operations.

See: About Scaling Instances

Preventive Measures and Best Practices

To prevent OOM conditions, implement the following:

Set up memory utilization alerts - Alert when system memory usage ratio exceeds 80%

See: Setting up alerts

Enable activedefrag (Redis 4.0+) - Helps mitigate memory fragmentation

gcloud redis instances update INSTANCE_NAME --region=REGION --update-redis-config=activedefrag=yes

Lower the maxmemory-gb limit - Provides headroom for overhead and memory-intensive operations

gcloud redis instances update INSTANCE_NAME --region=REGION --update-redis-config=maxmemory-gb=VALUE

Choose appropriate eviction policy - Use allkeys-lru if storing non-volatile data
Set TTL values on volatile keys - Ensure keys can be evicted when using volatile-* eviction policies
Manually delete unused keys - When immediate relief is needed
Scale up the instance - If memory pressure persists, scale during low traffic periods

See: Memory Management Best Practices

Key Metrics to Monitor

Metric	Description	Alert Threshold
System Memory Usage Ratio	Memory usage relative to system memory. Critical metric for OOM prevention.	> 80%
Memory Usage Ratio	How close working set is to maxmemory-gb limit.	Monitor for trends
System Memory Overload Duration	Time instance has been blocking writes due to OOM protection.	> 0
Cache Hit Ratio	Percentage of successful cache hits.	Monitor for drops after config changes

Metric

Description

Alert Threshold

System Memory Usage Ratio

Memory usage relative to system memory. Critical metric for OOM prevention.

> 80%

Memory Usage Ratio

How close working set is to maxmemory-gb limit.

Monitor for trends

System Memory Overload Duration

Time instance has been blocking writes due to OOM protection.

> 0

Cache Hit Ratio

Percentage of successful cache hits.

Monitor for drops after config changes