I've noticed that when our RisingWave instance, which consumes from Postgresql CDC, is idle (no database updates or queries) for a period like over a weekend, the first query can take over 30 seconds to respond. Subsequent queries return to normal response times around 100ms. We suspect that the cache might be expiring and the slow query is pulling data from S3. However, if there's no pressure on the cache, why would it expire? Is there a need to implement a background job to keep the cache warm to prevent this from happening?
Rick Otten
Asked on Jan 17, 2024
It appears that the slow response times after idle periods could be related to the behavior of the container service (ECS) rather than RisingWave itself. ECS might be swapping out containers after a certain period, which correlates with the observed 12-hour idle time before the cold start issue occurs. When a new compute node is introduced to the cluster, queries hitting it are slow until it warms up. To mitigate this, you could potentially warm up the cache on a new compute node before marking it as 'healthy' in the ECS coordinator. However, it's unclear what specific actions should be included in the container start script to warm up the cache, as there's no direct way to target a specific compute node with a query. Further discussion and investigation are needed to find a solution to avoid the performance hit when a new compute node first comes online.