troubleshooting

How to resolve a RisingWave meta node startup failure due to an empty checkpoint file?

I'm experiencing an issue where our RisingWave meta node is unable to start, and the logs indicate a panic due to an unwrap() on a None value related to the checkpoint file. Here's the relevant part of the log:

thread 'rw-main' panicked at src/meta/src/hummock/manager/checkpoint.rs:43:90:
called `Option::unwrap()` on a `None` value

Upon checking, I found that the checkpoint file at [state store root]/hummock/checkpoint/0 is empty:

root@risingwave-compute-0:/SOMEPATH/hummock/checkpoint# ll
total 1012
drwxr-xr-x 2 root root    4096 Feb 16 17:13 ./
drwxr-xr-x 4 root root 1028096 Feb 19 18:07 ../
-rw-r--r-- 1 root root       0 Feb 19 18:07 0

It seems like the filesystem state store is not recommended for production as it doesn't commit write operations atomically. How can I resolve this startup failure, and what steps should I take to prevent such issues in the future?

Vi

Victor Müller

Asked on Feb 20, 2024

The issue with the RisingWave meta node failing to start is due to an empty checkpoint file, which is a critical file for the system's operation. Unfortunately, if the checkpoint file is lost or corrupted, the cluster is considered corrupted, and the data is lost. To resolve this and prevent future occurrences:

  1. Switch to a production-ready state store: Use Amazon S3 or a compatible object store for production environments. For local disk setups, MinIO is a good choice.

  2. Migrate to S3: Although the current cluster is corrupted, for future setups, you should migrate your state store to S3 or a similar service to ensure atomic write operations and prevent such issues.

  3. Run benchmarks: If your cluster is not in AWS EC2, it's recommended to benchmark the performance between your cluster and S3 to ensure adequate speed.

  4. Avoid using filesystem state store for production: The filesystem state store is intended for testing purposes and does not guarantee atomicity in write operations, which can lead to corruption if a write operation is interrupted or fails.

  5. Consider RisingWave Cloud: If your requirements change and you can use hosted solutions, RisingWave Cloud is an option to consider.

  6. Monitor and backup: Regularly monitor the health of your state store and keep backups to recover from potential data loss scenarios.

Feb 20, 2024Edited by