I keep seeing a 'lagging epoch' alert in the risingwave_dashboard
in Grafana. The alert comes and goes, and the info button suggests checking the Hummock Manager of the dev dashboard for insights, but I didn't find anything illuminating there. I'm unsure if I should be concerned about this alert and what resources I might need to provision more of to resolve the issue. Could it be that I need more memory or CPU on the compactor or compute nodes, or do I need to refactor my materialized views? The dev dashboard hasn't provided clear insights into the bottleneck causing this alert.
Rick Otten
Asked on Dec 11, 2023
The 'lagging epoch' alert indicates that the pinned or safe epoch is lagging behind the current max committed epoch. This could be a temporary situation caused by workload fluctuations in your running cluster. If the alert vanishes on its own, it's usually not a concern. However, if it persists, you should investigate further. You can start by checking the Hummock Manager
section in the dev dashboard for more details. If you're experiencing this alert frequently, it might be helpful to provide more CPU and memory to your compute and compactor nodes. Additionally, you can adjust the alert thresholds in Grafana to better suit your workload, especially if you're moving into load testing soon. It's also worth considering horizontal scaling to handle peak loads efficiently. If increasing the compactor node to 2 cores doesn't resolve the alert, you may try bumping it up to 4 cores. Keep in mind that more compact resources could be helpful in this case.