In the discussion, there is a mention of system recovery in case of a compute node failure. The current approach involves global rollback to the last checkpoint. There is also a consideration for investigating partial checkpoint and recovery mechanisms.
JJ
Asked on May 23, 2022
Example:
# Current approach
def system_recovery_on_failure():
restart_failed_compute_node()
rollback_all_compute_nodes_to_last_checkpoint()
# Investigating partial checkpoint and recovery
def investigate_partial_checkpoint_recovery():
# Code implementation for partial checkpoint and recovery
pass