We have a cluster that started experiencing OutOfMemory (OOM) issues after stress testing with a large number of materialized views (MVs). The compute pods have a memory limit of 32GB, but they began OOM'ing and now the cluster cannot recover. We're looking for ways to recover without wiping the state and also want to understand why memory usage grows with no limit. Is there a way to set limits to prevent this issue?
Here's an example of the error we're seeing:
...
And the compute nodes get OOM killed on startup:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Restart Count: 9
Bahador Nooraei
Asked on Aug 08, 2023
To prevent OOM issues in a cluster with high memory usage due to numerous materialized views, we can consider the following strategies:
Use Windowing Functions: Implement windowing functions to limit the amount of state that needs to be stored in memory. This can help prevent unbounded growth of internal state.
Scale Out Pods: Temporarily scale out the pods and increase their memory to handle the increased load. This is an immediate but not a long-term solution.
Proactive Memory Control: Implement memory control mechanisms that periodically check memory usage and evict cache entries to reduce memory usage when it exceeds a certain threshold.
Expose Memory Limit Config: Have the system expose a memory limit configuration to users, allowing them to set a maximum memory usage for the system.
Reduce Concurrent Backfilling: Limit the number of concurrent operations, such as backfilling materialized views, to avoid stressing the system.
Implement Resiliency Tactics: Employ tactics like back-pressuring, throttling, rate limiting, and timeouts to manage resource consumption and prevent OOM.
Manual Scaling: Manually adjust the configurations and scale the resources as needed, especially in on-prem Kubernetes clusters.
Auto-Scaling: Develop and utilize auto-scaling features to dynamically adjust resources based on the workload.
Update to Latest Version: Upgrade to the latest version of the system that may contain fixes and improvements addressing such issues.
Sequential Operations: Perform operations like creating materialized views sequentially rather than in parallel to manage resource usage more effectively.
It's important to monitor the system and adjust these strategies as needed to ensure stability and performance.