Nodes keep restarting even after latest update to 1.0.6

MiserableOracle · August 6, 2019, 5:48pm

I’m constantly keeping an eye on my nodes after realizing that they were restarting quite often. After checking in syslog, I found following:

[22991] 2019/08/06 11:11:11.511375 #033[0;33m[WARN ]#033[m GID 77, Error handling msg: reveive vote at 139298 for 0000000000000000000000000000000000000000000000000000000000000000 error: Election has already stopped
[22991] 2019/08/06 11:11:11.562320 #033[0;32m[INFO ]#033[m GID 229, Change expected block height to 139300
[22991] fatal error: runtime: out of memory
[22991] runtime stack:
[22991] runtime.throw(0xca0a6b, 0x16)
[22991] #011/usr/local/go/src/runtime/panic.go:617 +0x72
[22991] runtime.sysMap(0xc030000000, 0x4000000, 0x1422b58)
[22991] #011/usr/local/go/src/runtime/mem_linux.go:170 +0xc7

I need more information to debug this problem, kindly suggest further.

I’m seeing these two error frequently

Out of memory
Concurrent read write on map

Aug 6 13:31:55 – nknd[27501]: 2019/08/06 13:31:55.587982 #033[0;32m[INFO ]#033[m GID 236, Receive block info 5ba4cefb331487fd3a231facaeaf334232b9a33528272778aba98583c75f5278, 145 txn found in pool, 111 txn to request
Aug 6 13:31:55 – nknd[27501]: 2019/08/06 13:31:55.639029 #033[0;32m[INFO ]#033[m GID 236, Receive block proposal 5ba4cefb331487fd3a231facaeaf334232b9a33528272778aba98583c75f5278 (256 txn, 390199 bytes) by a5dbe02a766b59b2e6710c85d5f17675d17353c15c46b257132374851e9b6fa1
Aug 6 13:31:56 – nknd[27501]: fatal error: concurrent map read and map write
Aug 6 13:31:56 – nknd[27501]: goroutine 74 [running]:
Aug 6 13:31:56 – nknd[27501]: runtime.throw(0xcaa721, 0x21)
Aug 6 13:31:56 – nknd[27501]: #011/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc00f4418d8 sp=0xc00f4418a8 pc=0x42da52

zbruceli · August 6, 2019, 8:24pm

Once vast majority of nodes upgrade to v 1.0.6, then the system will return to normal. Currently there are still nodes on older v 1.0.5.

yilun · August 6, 2019, 9:26pm

Can you also post your node spec (e.g. RAM)?

MiserableOracle · August 7, 2019, 3:29pm

I’m using standard droplet settings from digitalocean
1 GB Memory / 25 GB Disk / BLR1 - Debian 9.7 x64
Let me know if you need more information and how can i collect it.

I could only check syslog after noticing any node’s reboot, if there’s a way I can check the problem from nknc debug information, that’d be helpful to you too I guess

Thanks.

MiserableOracle · August 7, 2019, 4:09pm

Update

I haven’t observed the issue in last 5-6 hours. I think it was as @zbruceli said, once majority of nodes switched to 1.0.6, network seem to be stabilized compared to before.

MiserableOracle · August 7, 2019, 8:03pm

Looks like the problem came up again. Half of my nodes restarted… Can we check some logs and debug this?

zbruceli · August 7, 2019, 9:52pm

Yes, our devs are testing a fix. To be released as v1.0.7.

MiserableOracle · August 8, 2019, 4:08pm

Thank you.
Just updated all my nodes to 1.0.7
I’ll keep an eye on the performance and update here incase I see that restart problem again.

MiserableOracle · August 15, 2019, 2:47pm

Out of memory issue in 1.0.8
Aug 15 14:25:16 nknd[26234]: fatal error: runtime: out of memory
Aug 15 14:25:16 nknd[26234]: runtime stack:
Aug 15 14:25:16 nknd[26234]: runtime.throw(0xca2cad, 0x16)
Aug 15 14:25:21 systemd[1]: nkn.service: Service hold-off time over, scheduling restart.
Aug 15 14:25:21 systemd[1]: Stopped nkn.
Aug 15 14:25:21 systemd[1]: Started nkn.

zbruceli · August 15, 2019, 4:03pm

Hi, the developer team is aware of the issue and looking into it.

yilun · August 19, 2019, 4:49am

You can try to set TxPoolMaxMemorySize to lower value (e.g. 4) in config.json and see whether it resolves the problem