MainNet Postmortem (2020-07-10)

Mainnet block producing stopped for a few hours today. On-chain services (mainly subscribe) was affected. Off-chain data transmission (relay) was NOT affected.

Based on our investigation, it was triggered by massive amount of nodes leaving the network that affect block propagation. Currently node only uses a small set of random neighbors to propagate the blocks. To save resources aggressively, this neighbor set is small (<= 8 outbound neighbors per node), and updated slowly (update 1 neighbor per 5 minutes). When too many nodes leave the network in a short amount of time, the newly produced block might not be able to reach the whole network, causing some nodes not able to update its ledger.

The lesson we learnt is that, optimizing resource usage shouldn’t be too aggressive that might affect the robustness of the network. To solve the problem and prevent it from happening again, a new version v2.0.1 was released with the following relevant changes:

  1. The random neighbor size is doubled. We might continue to increase it in the future if it’s not enough.
  2. Random neighbors are fully filled each time it gets updated.

This outage also exposed one potential drawback of the current routing algorithm: nodes do not have any economical incentives to be reliable. To address this issue, we created [NKP-0021] Prefer stable nodes in routing to incentivize nodes to be more stable that introduce mechanism that incentivize nodes to be more stable.