[NKP-0021] Prefer stable nodes in routing to incentivize nodes to be more stable

The current relay packet routing algorithm (proximity routing) only take latency and packet loss into consideration, but has no preference over node stability. As we have seen recently, massive amount of nodes getting online and offline, although having no impact on off-chain data transmission (relay), can potentially cause issues to block propagations. Therefore, I’m proposing a routing algorithm that also prefers to relay packets to nodes that have stay online for longer time. This mechanism can economically incentivize nodes to stay online for longer time and be more reliable without any centralized monitor.

The current routing algorithm can be briefly described as follows: A node actively measures the latency from it to all its neighbors. When the node is given X (currently 3) next hop candidates, it will choose the one with lowest latency.

The proposed routing algorithm can be briefly described as follows: A node actively measures the latency from it to all its neighbors. In addition, it keeps track of how long each neighbor connection has been established, which will be reset if the connection lost. when a node is given X next hop candidates, it will compute a score that depends on both latency and the connection uptime, and will choose the neighbor with highest score. An example of such function is uptime/latency.

Expected results of using such algorithm:

  1. Nodes will prefer to stay online for longer and minimize downtime.
  2. Nodes with more stable network (but same latency) will get higher rewards.
  3. Nodes will prefer to upgrade to latest nknd version because nodes restart together with most other nodes will have higher connection uptime from other nodes’ perspective.

These exceptions, if achieved, will make the network more stable.

Please vote if you think we should or shouldn’t implement this proposal. Also welcome to share your thoughts in the comments.

  • Yes
  • No

0 voters

1 Like

One potential drawback: newly joined nodes will get even less chance of selected as relay nodes, thus reducing its likelihood to become block proposer and get mining reward. The time to first block might be further increased, and making new miners waiting longer. Over longer period of time, it is not an issue. But new miners are always anxious for their first block, as we have seen so many times in this forum and elsewhere.

1 Like

I think this concern can be partially resolved by choosing a score function that does not gain advantage indefinitely as uptime increase, but it’s still true that time to first block will increase to some degree

1 Like

Sounds like a good idea.

Downside: I am mining from my raspberry pi4 and I have to reboot my raspberry every 2-3 days because it goes offline after a while. I have nog idea why. When I presented this problem in the telegram group some others mentioned having similar issues.

Is this a know issue? Any fix for this?

No it’s not a known issue. Could you please send me your logs? I’m not sure if you can do it in the forum, but if not you can find me on both Discord and Telegram. Discord might be better for tech discussion.

first of all: thank you for taking the time to respond to my issue.

I will first relaunch my node again (now offline since update to 2.0.2) Next time my node goes offline I will contact you via discord and send you the log.

I think this is crucial and should be capped, also in an ideal world I imagine the protocol should be sufficiently flexible that nodes could be going online and offline all the time without that causing any significant issues (perhaps a few chunks dropped incidentally here and there when a node goes down without notice).

Yeah technically we are quite near the state where nodes online and offline don’t affect anything. But economically we still want to make nodes stable because that enables the node to be able to provide more services. CDN, for example, definitely wants nodes to be stable because client side (browser) will show an error instead of reconnecting if a current connection is cut off.

I’ve been mining for over two weeks with still no nkn. relayMessageCount: 1951020

Has this algorithm been implemented yet?

Not yet. Is the proposalSubmitted field of your node still 0?

Yes still 0. What does it mean? Thanks for replying.

It means your node hasn’t been elected to be a block proposer yet. You probably just need to wait :slight_smile:

I have a couple of nodes in the netherlands (geoip doesn’t show it as such, but they really are). Those nodes just drop dead with no neighbors every once in a while.

I did make a bug report about that and it was advised to just restart the node if it shut down itself due to no more neighbors. In my case i’m letting docker automatically restart nodes that i didn’t manually stop. Sometimes it takes a long while (my nodes in the netherlands had their latest automatic restart ~2 weeks ago), sometimes it’s in mere days. If you were to proceed with this proposal (which sounds awesome btw!) then you should probably look into a way to not shut down a node if it doesn’t have neighbors anymore.

As if you do proceed and this neighbor issue remains, you’ve (probably unintentionally) added a incentive to start a node in a region where more are active already. Thus you don’t incentivise starting a node in an area that has no or just a few nodes. That would, in my opinion, be a bit of a downside to this proposal.

If a node has a slow Internet (but not completely disconnected), it should never loss all its neighbors because for each neighbor disconnected, the node will immediately try to find another neighbor as replacement. Because a node typically has around 100 neighbors, it’s basically impossible for a node with working Internet to loss all its neighbors no matter where the node is. The only possible case when nknd quits due to loss of all neighbors is that the node completely lost its Internet for a period of time (e.g. 10+ seconds). In my experience it’s usually ISP’s problem…

This is with nodes hosted on vultr…
Try it out, put some in the netherlands :slight_smile:

Just to note, the internet connection in the netherlands (and in the data center where it seems to be hosted) are definitely top tier ones with practically no downtime.

DO also has Netherlands nodes but we didn’t observe such behavior. For example, 178.128.136.86 is one of our nodes in Netherlands but never had this problem. Actually we run nodes in every DO data center and also use 3rd party services to monitor their uptime. They are only down when DO is “migrating” the instances or having temporary network issues.

Hmm, i can’t find the logs for that anymore.
It did very definitely happen a couple times as that was the sole reason for me opening https://github.com/nknorg/nkn/issues/706 which is why i’m now adding a restart policy to all nkn instances :wink:

I’ll report back when i catch it again. Is there anything i can do in terms of monitoring to catch this? And if so, how?

Lastly, please don’t let my reply stop you from implementing this NKP! I’m sure you’ll find a fitting solution if the disconnect still happens.

Oh I know this happens for sure! Actually I have also seen it from time to time, like when my node at home lost Internet occasionally, or when DO droplet is migrating, etc. Even if this NKP is implemented, it shouldn’t cause mining reward to drop noticeably as long as it’s appearing very occasionally.

Why does a large number of nodes coming online and offline slow down block propagation? Can you explain the specifics?

I don’t think this is a good solution. It doesn’t solve the root issue. I’d rather see an approach that makes it so that large numbers of nodes coming online/offline doesn’t affect block propagation.

This does discourage node operators from frequently turning on and off their nodes, but if a node operator decides to stop mining NKN altogether, this doesn’t discourage them from turning off all their nodes at once (disrupting block propagation).

This has little to do with block propagation, but about relay and other network related services (e.g. tuna).

When a packet is transmitting through a path, any sudden disappear of the node on the path will have negative effect of the packet transmission, e.g. packet loss, latency increase, through decrease. Currently there is no mechanism to economically encourage node to work collaboratively to be more stable, and that’s why we have this NKP.