troubleshooting

How Can I Force Cancel a Stuck Job in RisingWave?

I have a job in RisingWave that I can't seem to cancel. I've tried various methods including CANCEL JOBS <job id>, restarting the cluster, and allowing the job to run over the weekend, but it's stuck at 0.0% progress. I also attempted FLUSH but received an error: QueryError: internal error: Service unavailable: cluster is under recovering. Are there any other steps I can take to remove this job?

Do

Dominic Lindsay

Asked on Dec 11, 2023

To address the issue of a stuck job in RisingWave, you can try the following steps:

  1. Set the pause_on_next_bootstrap system parameter to true using the command alter system set pause_on_next_bootstrap to true;.
  2. Restart the meta node to stop it from entering a recovery loop.
  3. If you have any problematic sinks, try dropping them after the meta node is back up.
  4. If you're running a version prior to rw 1.5, consider upgrading to rw 1.5 as it has better memory control and contains bug fixes related to job cancellation failures.
  5. Adjust the memory limits for the compute node to about 75% of the total available memory using the --total-memory-bytes flag to prevent OOM (Out of Memory) issues.
  6. Ensure that the pause_on_next_bootstrap parameter is effective by checking the meta node logs for confirmation that the streaming jobs are paused.
  7. If the pause is effective, try canceling the job again and then restart the meta node.

If these steps do not resolve the issue, you may need to investigate further by checking the logs for any errors or unusual activity that could be causing the job to remain stuck.

Dec 19, 2023Edited by