Why are long barrier times and etcd errors occurring in RisingWave?
I'm experiencing issues with RisingWave where I'm seeing long barrier times and etcd errors in the logs. The barrier times are unusually long, over a minute, and there are etcd errors related to timeouts and stream errors. I've recently added large Clickhouse sinks and increased the request size, which may be related. Additionally, compaction/vacuum has failed, and there are too many barriers. Despite these issues, materialized views seem to update, but object count/size remains unchanged. I suspect these problems started after a large amount of work caused the minio S3 instance to run out of space, which has been resolved, but RisingWave hasn't recovered. Here are some log excerpts:
2024-02-21T11:46:12.301144232Z WARN rw-streaming actor{...}: risingwave_stream::executor::source::source_executor: source Source 1B5E000029B8 paused, wait barrier for 66.640378724s
...
2024-02-21T11:49:04.292805236Z ERROR rw-main risingwave_meta::rpc::election::etcd: lease keeper failed error=grpc request error: status: Unavailable, message: "etcdserver: request timed out, waiting for the applied index took too long", details: [], metadata: MetadataMap { headers: {} }
...
Why are these long barrier times and etcd errors happening, and how can they be resolved?
Kai
Asked on Feb 21, 2024
It appears there might be an issue with etcd, which is blocking the meta node from proceeding with the barrier loop. The source executor warnings indicate that the meta node isn't issuing barriers within the expected time frame. This could be due to the large metadata from the Clickhouse sinks affecting etcd's performance. Although etcd seems fine based on the logs you provided, the warnings about the election request taking a long time reflect the issues RisingWave is logging. The standalone mode of your deployment could also be impacting performance, but that's not the main concern here. The fact that RisingWave isn't reporting data loss and is hanging on FLUSH commands is troubling. After resetting RisingWave, the system is working as expected, but there's a risk of this issue recurring if the underlying problems aren't addressed. It's recommended to investigate the etcd health thoroughly and consider the impact of large metadata operations on its performance.