all-things-risingwave

What is the approach for system recovery in case of a compute node failure in the discussed scenario?

In the discussion, there is a mention of system recovery in case of a compute node failure. The current approach involves global rollback to the last checkpoint. There is also a consideration for investigating partial checkpoint and recovery mechanisms.

JJ

JJ

Asked on May 23, 2022

  • The current approach for system recovery in case of a compute node failure involves global rollback to the last checkpoint.
  • The team is exploring the implementation of partial checkpoint and recovery mechanisms in the system.

Example:

# Current approach
def system_recovery_on_failure():
    restart_failed_compute_node()
    rollback_all_compute_nodes_to_last_checkpoint()

# Investigating partial checkpoint and recovery
def investigate_partial_checkpoint_recovery():
    # Code implementation for partial checkpoint and recovery
    pass
May 23, 2022Edited by