Redis/Memorystore Alerts
This section covers handling alerts and issues related to Google Cloud Memorystore for Redis instances.
Delivery Redis High System Memory Usage
This alert monitors the delivery Redis instance used for idempotency/aborted campaigns.
OOM (Out of Memory) Error
Error message:
-OOM command not allowed under OOM prevention.
Root Cause:
This error originates from the System Memory protection mechanism within Memorystore, not from the standard Redis key limit. The instance is physically out of RAM, likely due to fragmentation caused by many small keys. This condition blocks all write operations to prevent a server crash.
Standard eviction processes cannot function because the system is too saturated to process the necessary background tasks required to free memory.
Quick Recovery: Manual Failover (Standard Tier only)
For Standard Tier instances, the fastest recovery method (~30 seconds) is a Manual Failover using force-data-loss mode. This bypasses data synchronization checks.
Command:
gcloud redis instances failover INSTANCE_NAME --region=REGION --data-protection-mode=force-data-loss
Using force-data-loss mode does not check the offset delta between primary and replicas before initiating failover. You can potentially lose more than 30MB of data changes.
|
For minimal data loss (if bytes pending replication < 30MB):
gcloud redis instances failover INSTANCE_NAME --region=REGION --data-protection-mode=limited-data-loss
Memorystore instances do not have a "Restart" button, and standard Redis commands like SHUTDOWN are disabled. Manual failover is the fastest way to recover from an OOM condition on Standard Tier instances.
|
Important: BullMQ Considerations
| Some Redis instances are used by BullMQ for job queues. Services like scheduler and msg-generator use BullMQ to manage job processing. |
Before performing any failover with force-data-loss mode, a DevOps engineer should:
-
Identify what the Redis instance is used for - Check if it stores critical job queue data
-
Assess the impact - Understand what jobs/data could be lost
-
Coordinate with the team - Ensure stakeholders are aware of potential data loss
-
Consider alternatives - If job data is critical, wait for
limited-data-lossmode to become viable or explore other recovery options
Losing BullMQ job data can result in:
-
Scheduled campaigns not being sent
-
Messages stuck in processing queues
-
Jobs that need to be manually re-triggered
When in doubt, prefer limited-data-loss mode or consult with the team before proceeding with force-data-loss.
Why Scaling May Take a Long Time
Scaling a Memorystore instance requires it to "fork" (create a background copy) to synchronize data to the new, larger node. If the instance has 0% free memory, this fork operation fails repeatedly, causing the system to enter a retry loop.
A high write load or high memory pressure can cause a scaling operation to take significantly longer and can cause the operation to fail.
Recommendation: Before scaling, reduce write pressure by scaling down dependent services to 0 pods if possible. However, note that even with no new writes, if memory is already highly utilized, scaling can still be delayed.
The maxmemory-gb flag can help in future incidents by providing headroom for such operations.
Preventive Measures and Best Practices
To prevent OOM conditions, implement the following:
-
Set up memory utilization alerts - Alert when system memory usage ratio exceeds 80%
See: Setting up alerts
-
Enable activedefrag (Redis 4.0+) - Helps mitigate memory fragmentation
gcloud redis instances update INSTANCE_NAME --region=REGION --update-redis-config=activedefrag=yes -
Lower the maxmemory-gb limit - Provides headroom for overhead and memory-intensive operations
gcloud redis instances update INSTANCE_NAME --region=REGION --update-redis-config=maxmemory-gb=VALUE -
Choose appropriate eviction policy - Use
allkeys-lruif storing non-volatile data -
Set TTL values on volatile keys - Ensure keys can be evicted when using
volatile-*eviction policies -
Manually delete unused keys - When immediate relief is needed
-
Scale up the instance - If memory pressure persists, scale during low traffic periods
Key Metrics to Monitor
| Metric | Description | Alert Threshold |
|---|---|---|
System Memory Usage Ratio |
Memory usage relative to system memory. Critical metric for OOM prevention. |
> 80% |
Memory Usage Ratio |
How close working set is to maxmemory-gb limit. |
Monitor for trends |
System Memory Overload Duration |
Time instance has been blocking writes due to OOM protection. |
> 0 |
Cache Hit Ratio |
Percentage of successful cache hits. |
Monitor for drops after config changes |