Chain Halt - Recovery

Chain Halt Recovery

This document describes how to recover from a chain halt.

It assumes that the cause of the chain halt has been identified, and that the new release has been created and verified to function correctly.

Background
Resolving halts during a network upgrade
- Manual binary replacement (preferred)
- Rollback, fork and upgrade
- Troubleshooting
  - Data rollback - retrieving snapshot at a specific height (step 5)
  - Validator Isolation - risks (step 6)

Background

Pocket network is built on top of cosmos-sdk, which utilizes the CometBFT consensus engine. Comet's Byzantine Fault Tolerant (BFT) consensus algorithm requires that at least 2/3 of Validators are online and voting for the same block to reach a consensus. In order to maintain liveness and avoid a chain-halt, we need the majority (> 2/3) of Validators to participate and use the same version of the software.

Resolving halts during a network upgrade

If the halt is caused by the network upgrade, it is possible the solution can be as simple as skipping an upgrade (i.e. unsafe-skip-upgrade) and creating a new (fixed) upgrade.

Read more about upgrade contingency plans: https://dev.poktroll.com/develop/upgrades/chain_halt_upgrade_contigency_plans

Manual binary replacement (preferred)

This is the preferred way of resolving consensus-breaking issues.

Significant side effect: this breaks the ability to sync from genesis without manual interventions.

For example, when a consensus-breaking issue occurs on a node that is synching from the first block, node operators need to manually replace the binary with the new one. There are efforts underway to mitigate this issue, including configuration for cosmovisor that could automate the process.

Since the chain is not moving, it is impossible to issue an automatic upgrade with an upgrade plan. Instead, we need social consensus to manually replace the binary and get the chain moving.

Prepare and verify a new binary

Prepare and verify a new binary that addresses the consensus-breaking issue.

Announce and coordinate upgrades

Reach out to the community and validators so they can upgrade the binary manually.

Update documentation

Update the documentation (for example: the Upgrade List) to include a range/height when the binary needs to be replaced.

TODO_IMPROVE(@okdas):

For step 2: Investigate if the CometBFT rounds/steps need to be aligned as in Morse chain halts. See https://docs.cometbft.com/v1.0/spec/consensus/consensus
For step 3: Add cosmovisor documentation so it's configured to automatically replace the binary when syncing from genesis.

Rollback, fork and upgrade

These instructions are only relevant to Pocket Network's Shannon release.

We do not currently use x/gov or onchain voting for upgrades. Instead, all participants in our DAO vote on upgrades off-chain, and the Foundation executes transactions on their behalf.

This should be avoided or more testing is required. In our tests, the full nodes were propagating the existing blocks signed by the Validators, making it hard to rollback.

Performing a rollback is analogous to forking the network at the older height.

Prepare & verify a new binary

Prepare & verify a new binary that addresses the consensus-breaking issue.

Create release & upgrade transactions

See the instructions in Upgrade Preparation: https://dev.poktroll.com/develop/upgrades/upgrade_preparation to create a release and upgrade transactions.

Revert validator set prior to halt

Disconnect the Validator set from the rest of the network 3 blocks prior to the height of the chain halt. For example:

Assume an issue at height 103.
Revert the validator set to height 100.
Submit an upgrade transaction at 101.
Upgrade the chain at height 102.
Avoid the issue at height 103.

Ensure validators use same snapshot

Ensure all validators rolled back to the same height and use the same snapshot (see the data rollback section below). The snapshot should be imported into each Validator's data directory. This is necessary to ensure data continuity and prevent forks.

Isolate validator set from full nodes

Isolate the validator set from full nodes (see validator isolation risks). This is necessary to avoid full nodes gossiping blocks that have been rolled back. This may require using a firewall or a private network. Validators should only be permitted to gossip blocks amongst themselves.

Start validator set and perform upgrade

Start the validator set and perform the upgrade. Example:

Start all Validators at height 100.
On block 101, submit the MsgSoftwareUpgrade transaction with a Plan.height set to 102.
x/upgrade will perform the upgrade in the EndBlocker of block 102.
The node will stop climbing with an error waiting for the upgrade to be performed.
- Cosmovisor deployments automatically replace the binary.
- Manual deployments will require a manual replacement at this point.
Start the node back up.

Wait for network to reach prior ledger height

Wait for the network to reach the height of the previous ledger (for example 104+).

Reopen network connections

Allow validators to open their network to full nodes again (full nodes will need to perform the rollback or use a snapshot as well).

Troubleshooting

Data rollback - retrieving snapshot at a specific height (step 5)

There are two ways to get a snapshot from a prior height:

Method 1: Repeated rollback

Execute the following repeatedly until the command responds with the desired block number:

Repeat until desired height

Method 2: Use snapshot and halt-height

Use a snapshot from below the halt height (e.g. 100) and start the node with --halt-height=100 so it only syncs up to a certain height and then gracefully shuts down. Add this argument to pocketd start like this:

Start with halt-height

Validator Isolation - risks (step 6)

Having at least one node that has knowledge of the forking ledger can jeopardize the whole process. In particular, the following errors in logs are a sign of nodes syncing blocks from the wrong fork:

found conflicting vote from ourselves; did you unsafe_reset a validator?
conflicting votes from validator

PreviousChain Halt Upgrade - Contingency Plans NextContributing

Was this helpful?