I'm facing an issue with a stalled RisingWave cluster that has been deployed in Kubernetes using a Helm chart. All deployments and statefulsets appear healthy, but the cluster is stalled. Previously, the stall was resolved by switching from MinIO to GCS for storage and restarting the risingwave-meta
statefulset. However, restarting components hasn't helped this time. The meta service is logging errors about failing to collect barriers, and the compute service is logging errors related to Kafka timeouts. There are also logs indicating issues with the compactor. We've checked that other services are able to read/write to Kafka successfully. Here are some of the log messages:
2024-04-24T15:45:41.368572196Z INFO failure_recovery{error=gRPC request failed: Internal error: failed to collect barrier. epoch: [6339814191005696, 6339814355435520, ...], err: Actor 310 exited unexpectedly: failed to send message to actor 289, message: Barrier(Barrier { epoch: EpochPair { curr: 6339815207469056, prev: 6339815141933056 }, mutation: None, kind: Checkpoint, tracing_context: TracingContext(Context { entries: 0 }), passed_actors: [] })
2024-04-24T15:53:39.106495551Z ERROR risingwave_stream::task::stream_manager: actor exit with error actor_id=1374 error=Executor error: Sink error: Kafka error: Message production error: MessageTimedOut (Local: Message timed out)
2024-04-24T19:21:09.73566754Z INFO risingwave_storage::hummock::compactor: running_parallelism_count=0 pull_task_ack=false pending_pull_task_count=5
I would appreciate any help with troubleshooting this issue.
Paul Weinger
Asked on Apr 24, 2024
To troubleshoot the stalled RisingWave cluster, Yuhao Su suggested checking the downstream Kafka status due to the timeout errors. Despite other services working fine with Kafka, it was important to verify that RisingWave's Kafka sinks were operational. Additionally, Yuhao Su recommended the following steps:
set sink_decouple to true;
before creating the sink to decouple the sink from the compute node.set streaming_parallelism to 8;
before creating the sink.properties.message.timeout.ms = 20000
in the CREATE SINK
statement.Moreover, Yuhao Su confirmed that librdkafka producer settings can be specified in the CREATE SINK
statement and provided a link to the documentation for additional Kafka parameters. They also mentioned that the ability to set producer.acks
would be added soon.
In the event of a future stall, capturing an Await tree while the cluster is stalled was recommended as it would be very helpful for debugging. Yuhao Su also noted that RisingWave already has configuration options for limiting the number of messages sent in a single produce
request to a broker, specifically properties.batch.num.messages
for the number of messages and properties.batch.size
for the size.