Skip to main content

How to set up Failover/Backup node

Overview​

There are two main ways to set up a failover node, each with its own pros and cons. Read through both options and choose the one that works best for your situation.

Option 1 is best if you need a failover node that can also serve as an RPC node, or if you need one failover node for multiple validators. However, it requires a restart when the failover node becomes a validator.

Option 2 is best for minimizing downtime, as the failover happens almost instantly. However, the failover node must be dedicated to a single validator and cannot serve as an RPC node.

Important Note​

  • Once a node becomes a validator node and begins tracking a specific shard, it cannot be reverted to track all shards.
  • During the failover process, a node operator may lose one RPC node when the node transitions to a validator node (option 1). This is a known issue, and it is on the team's roadmap for resolution.

[Option 1] RPC node as a failover node​

This is the traditional recovery plan which has been available on the mainnet, where you have a primary validator node and a secondary failover node.

Pros​

  • It is possible to use an RPC node as a failover node.
  • Failover node can be used for multiple validator nodes, each tracking different shards.

Cons​

  • A restart of neard is required before a failover node can be promoted to a new validator node.

Setup for the failover node while it is on standby​

In config.json

  • Set tracked_shards to [0]
  • Set store.load_mem_tries_for_tracked_shards to false

Procedure​

  • Copy validator_key.json to the failover node.
  • In config.json file of the failover node:
    • Set tracked_shards to []
    • Set store.load_mem_tries_for_tracked_shards to true
  • Note: You don’t need to swap the node_key.json file on the failover node. The network identifies nodes by their key and IP address, so changing the IP address might prevent successful syncing.
  • Stop the primary validator node.
  • Restart the failover node.

[Option 2] Validator key hot swap​

This method allows you to quickly transfer the validator key to the failover node with very little downtime.

Pros​

  • No need to restart the failover node during transition.
  • Failover can happen in seconds, minimizing downtime.

Cons​

  • The failover node cannot be used as an RPC node while it is on standby.
  • The failover node is dedicated to just one validator node.

Setup for the failover node while it is on standby​

  • In config.json
    • Add “tracked_shadow_validator”: “<validator_id>” (where <validator_id> is the pool ID of the validator).
    • Set tracked_shards to [].
    • Set store.load_mem_tries_for_tracked_shards to true.
  • Note: The failover node must be dedicated to a single validator and cannot be used as an RPC node during failover. Since mem_trie doesn’t work well with RPC nodes, the failover node won’t be able to perform RPC functions.

Procedure​

With the changes made to the config.json file and the validator key hot swap procedure, the failover node can quickly take over validator responsibilities.

  • Copy validator_key.json to the failover node.
  • [Optional] Remove “tracked_shadow_validator”: “<validator_id>” from config.json file of the failover node.
  • Note: You don’t need to swap the node_key.json file on the failover node. The network identifies nodes by their key and IP address, so changing the key might prevent successful syncing.
  • Stop the primary node.
  • Send a SIGHUP signal to the failover node (without restarting it).
  • The failover node will pick up the validator key and start validating.