troubleshooting
How Can I Force Cancel a Stuck Job in RisingWave?
I have a job in RisingWave that I can't seem to cancel. I've tried various methods including CANCEL JOBS <job id>
, restarting the cluster, and allowing the job to run over the weekend, but it's stuck at 0.0% progress. I also attempted FLUSH
but received an error: QueryError: internal error: Service unavailable: cluster is under recovering
. Are there any other steps I can take to remove this job?
Do
Dominic Lindsay
Asked on Dec 11, 2023
To address the issue of a stuck job in RisingWave, you can try the following steps:
- Set the
pause_on_next_bootstrap
system parameter totrue
using the commandalter system set pause_on_next_bootstrap to true;
. - Restart the meta node to stop it from entering a recovery loop.
- If you have any problematic sinks, try dropping them after the meta node is back up.
- If you're running a version prior to rw 1.5, consider upgrading to rw 1.5 as it has better memory control and contains bug fixes related to job cancellation failures.
- Adjust the memory limits for the compute node to about 75% of the total available memory using the
--total-memory-bytes
flag to prevent OOM (Out of Memory) issues. - Ensure that the
pause_on_next_bootstrap
parameter is effective by checking the meta node logs for confirmation that the streaming jobs are paused. - If the pause is effective, try canceling the job again and then restart the meta node.
If these steps do not resolve the issue, you may need to investigate further by checking the logs for any errors or unusual activity that could be causing the job to remain stuck.
Dec 19, 2023Edited by