Chain Halt - Recovery
Chain Halt Recovery
This document describes how to recover from a chain halt.
It assumes that the cause of the chain halt has been identified, and that the new release has been created and verified to function correctly.
Background
Resolving halts during a network upgrade
Manual binary replacement (preferred)
Rollback, fork and upgrade
Troubleshooting
Data rollback - retrieving snapshot at a specific height (step 5)
Validator Isolation - risks (step 6)
Background
Pocket network is built on top of cosmos-sdk, which utilizes the CometBFT consensus engine. Comet's Byzantine Fault Tolerant (BFT) consensus algorithm requires that at least 2/3 of Validators are online and voting for the same block to reach a consensus. In order to maintain liveness and avoid a chain-halt, we need the majority (> 2/3) of Validators to participate and use the same version of the software.
Resolving halts during a network upgrade
If the halt is caused by the network upgrade, it is possible the solution can be as simple as skipping an upgrade (i.e. unsafe-skip-upgrade) and creating a new (fixed) upgrade.
Read more about upgrade contingency plans: https://dev.poktroll.com/develop/upgrades/chain_halt_upgrade_contigency_plans
Manual binary replacement (preferred)
Significant side effect: this breaks the ability to sync from genesis without manual interventions.
For example, when a consensus-breaking issue occurs on a node that is synching from the first block, node operators need to manually replace the binary with the new one. There are efforts underway to mitigate this issue, including configuration for cosmovisor that could automate the process.
Since the chain is not moving, it is impossible to issue an automatic upgrade with an upgrade plan. Instead, we need social consensus to manually replace the binary and get the chain moving.
TODO_IMPROVE(@okdas):
For step 2: Investigate if the CometBFT rounds/steps need to be aligned as in Morse chain halts. See https://docs.cometbft.com/v1.0/spec/consensus/consensus
For step 3: Add
cosmovisordocumentation so it's configured to automatically replace the binary when syncing from genesis.
Rollback, fork and upgrade
We do not currently use x/gov or onchain voting for upgrades. Instead, all participants in our DAO vote on upgrades off-chain, and the Foundation executes transactions on their behalf.
This should be avoided or more testing is required. In our tests, the full nodes were propagating the existing blocks signed by the Validators, making it hard to rollback.
Performing a rollback is analogous to forking the network at the older height.
Revert validator set prior to halt
Disconnect the Validator set from the rest of the network 3 blocks prior to the height of the chain halt. For example:
Assume an issue at height
103.Revert the validator set to height
100.Submit an upgrade transaction at
101.Upgrade the chain at height
102.Avoid the issue at height
103.
Isolate validator set from full nodes
Isolate the validator set from full nodes (see validator isolation risks). This is necessary to avoid full nodes gossiping blocks that have been rolled back. This may require using a firewall or a private network. Validators should only be permitted to gossip blocks amongst themselves.
Start validator set and perform upgrade
Start the validator set and perform the upgrade. Example:
Start all Validators at height
100.On block
101, submit theMsgSoftwareUpgradetransaction with aPlan.heightset to102.x/upgradewill perform the upgrade in theEndBlockerof block102.The node will stop climbing with an error waiting for the upgrade to be performed.
Cosmovisor deployments automatically replace the binary.
Manual deployments will require a manual replacement at this point.
Start the node back up.
Troubleshooting
Data rollback - retrieving snapshot at a specific height (step 5)
There are two ways to get a snapshot from a prior height:
Validator Isolation - risks (step 6)
Having at least one node that has knowledge of the forking ledger can jeopardize the whole process. In particular, the following errors in logs are a sign of nodes syncing blocks from the wrong fork:
found conflicting vote from ourselves; did you unsafe_reset a validator?conflicting votes from validator
Was this helpful?
