troubleshooting

How to resolve the 'Failed to send_heartbeat: timeout deadline has elapsed' error in Risingwave?

I'm experiencing an issue with my Risingwave deployment where the meta node keeps crashing and recovering, and I'm seeing a warning in one of the compute nodes: WARN risingwave_rpc_client::meta_client: Failed to send_heartbeat: timeout deadline has elapsed. I'm using the helm chart for deployment but haven't set the rustBacktrace variable. The logs show a DNS error when the compute node tries to connect to the meta node. How can I resolve this issue?

An

André Falk

Asked on Dec 27, 2023

It looks like there might be a network issue between the Compute node and Meta node, potentially related to a DNS server problem. To address this, you can try the following steps:

  1. Set the rustBacktrace variable to full in the helm chart values to get the entire backtrace for better debugging.
  2. Manually delete the meta pod and compute node (cn) pod to let Kubernetes respawn them, which might resolve any transient network glitches.
  3. Check for any DNS server issues that could be preventing the compute node from resolving the meta node's address.

Additionally, consider upgrading to Risingwave v1.5.4, which includes improvements for printing out the connecting IP address and the address resolved by the DNS server if any network connection issues occur. This can provide more information when debugging the infrastructure, such as Kubernetes.

Dec 28, 2023Edited by