Troubleshooting Resharding
Resharding Timeline
The 1.37.0 release contains a protocol upgrade that splits shard 3 into two shards.
When network upgrades to the protocol version 64, it will have 5 shards defined by these border accounts vec!["aurora", "aurora-0", "kkuuue2akv_1630967379.near", "tge-lockup.sweat"]
.
Any code that has a hardcoded number of shards or mapping of an account to shard id may break. If you are not sure if your tool will work after mainnet updates to protocol version 64, test it on testnet, as it is already running on 5 shards.
Resharding will happen in the epoch preceding protocol upgrade. So, if the voting happens in epoch X, resharding will happen in epoch X+1, and protocol upgrade will happen in epoch X + 2. Voting for upgrading to protocol version 64 will start on Monday 2024-03-11 18:00:00 UTC . By our estimations, resharding will start on Tuesday 2024-03-12 07:00:00 UTC, and first epoch with 5 shards will start on Tuesday 2024-03-12 23:00:00 UTC.
Resharding is done as a background process of a regular node run. It takes hours to finish, and it shouldn’t be interrupted. Failure to reshard will result in the node not being able to sync with the network.
1.37.0 release resharding
General recommendations
- Do not restart your node during the resharding epoch. It may result in your node not being able to finish resharding.
- Disable state sync until your node successfully transitions to the epoch with protocol version 64.
You should disable it before the voting date (Monday 2024-03-11 18:00:00 UTC).
It should be safe to enable it on Thursday 2024-03-14.
To disable state sync, assign
false
to thestate_sync_enabled
field in config. - Make sure that state snapshot compaction is disabled.
Your node will create a state snapshot for resharding.
State snapshot compaction may lead to stack overflow.
Make sure that fields
store.state_snapshot_config.compaction_enabled
andstore.state_snapshot_compaction_enabled
are set tofalse
. - Ensure that you have additional 200Gb of free space on your
.near/data
disk.
Before resharding
If your node is out of sync
If your node is far behind the network, consider downloading the latest DB snapshot provided by Pagoda from s3 Node Data Snapshots. Your node will likely fail resharding if it is not in sync with the network for the majority of the resharding epoch.
If you run legacy archival node
We don’t expect legacy archival nodes to be able to finish resharding and stay in sync on mainnet. We highly recommend migrating to split storage archival nodes as soon as possible. The easiest way is to download DB snapshots provided by Pagoda from s3. Be aware that the cold db of a mainnet split storage is about 22Tb, and it may take a long time to download it. You can find instructions on how to migrate to split storage on Split Storage page.
During resharding
Monitoring
To monitor resharding you can use metrics near_resharding_status
, near_resharding_batch_size
, and near_resharding_batch_prepare_time_bucket
.
You can read more on github.
If you observe problems with block production or resharding performance, you can adjust resharding throttling configuration. This does not require a node restart, you can send a signal to the neard process to load the new config. Read more on github.
After resharding
If your node failed to reshard or is not able to sync with the network after the protocol upgrade, you will need to download the latest DB snapshot provided by Pagoda from s3 Node Data Snapshots. We will try to ensure that these snapshots are uploaded as soon as possible, but you may need to wait several hours for them to be available.
Pagoda s3 DB snapshots have a timestamp of their creation in the file path. Check that you are downloading a snapshot that was taken after the switch to protocol version 64.