Post-mortem on SCCN and MCCN instability

Both networks suffered instablity. Funds were lost, but they were reimbursed and refunded. The networks finally acheived stability after…

Post-mortem on SCCN and MCCN instability

Both networks suffered instablity. Funds were lost, but they were reimbursed and refunded. The networks finally acheived stability after fixes were rolled out.

Summary

Both networks suffered instability over the last two weeks. Around $800k was lost in funds in total from both networks, causing nodes to be slashed by around $1.2m. The treasury fully covered all losses, and nodes were refunded all slashes. The reasons for the fund losses were varied and complex, but eventually solved by the team and community. The protocol engineers have put place fixes to stop re-occurrence.

SCCN Instability

The instability began with over 1/3rd of nodes suddenly losing connection with their Binance nodes. This was partly due to some nodes having a config with a small cluster for Binance, which was filled up and halted the daemon. This caused inbound transactions to be unable to reach consensus, and left ignored.

Manual Observations

The team released updates to the nodes to help them kick-start their daemons, as well as releasing a feature to allow nodes to make a manual observation of the transactions to bump them into consensus. Unfortunately it seemed that some nodes did not do this with a fully synced thord and it caused the node to enter consensus-halt. This seems to be a bug with Cosmos V0.37. The crashed nodes then needed to LEAVE the network to recover their nodes.

Large Churn

Due to a lot of nodes leaving, the retiring vault did not have many operational nodes. SCCN uses delegated TSS parties, and will continue do-looping until it finds a committee that are all active.

Due to many nodes being offline it took a very long time to find an online party before finally completing the migration.

Fund Loss

Additionally there was a bug in the tssKeySignFail message which again crashed thord. This caused fund loss.

The outbound transactions were continually re-scheduled (since the network thought they were not sent), causing the network to bleed out funds in duplicate transactions. When they were finally observed, nodes were accused of stealing and slashed. All in all, around $1m was lost.

TradeHalt

At this point the system was very unhealthy and users were still sending funds. The team made the decision to halt trading, which is a mimir setting that sets a flag and stops all further observations (but does not inhibit withdrawals). In addition it refunds via Asgard and stopped the fund loss.

All bots and interfaces should be subscribed to this flag.

V0.19.4

Finally V0.19.4 went out containing the root-cause fix to the tssKeySignFail bug. This was adopted by nodes and the network fully recovered some time later.

TradingHalt2

Immediately after V0.19.4 was updated, the system re-entered chaos, and trading halted again. One of the fixes in V0.19.4 caused a follow-on issue that caused loss of consensus again on inbounds. The team used mimir to halt trading and rush out V0.19.5 which fixed the issue.

Summary

The root cause was the crashed Binance Daemon, which caused nodes to miss observations, but the fix of manual observations caused some nodes to enter consensus halt and had to leave. So many nodes had to leave that the retiring vault entered into large amounts of TSS Failure. A bug on the tssFailure tx caused by thord to crash and cause loss of funds from duplicate tx, causing TradingHalt to be invoked. A rushed fix for the tssFailure issues created a new issue where restarts would miss observations, causing TradingHalt2. Finally it was fixed and the network recovered.

In summary, a small issue on the cluster for Binance nodes caused a cascading effect for two weeks straight, uncovering more issues, with some self-owns along the way. SCCN holds over $800m in funds, but only $800k was actually double-spent, with $1.2m slashed from nodes. The treasury fully insured the losses. Some of the funds were actually returned from multiple community members who received the double-spent transactions.💚

MCCN does not have any of the bugs found above.

MCCN Instability

After SCCN recovered, MCCN nodes immediately started reporting fund loss and slash events. The team very quickly found that BitcoinCash was exhibiting unexpected behaviour, and rushed a fix to stop the fund loss. The network recovered, around $100k was lost.

BitcoinCash addresses

The team were aware that BitcoinCash has two address formats (legacy and new, cash formats). However nobody was aware that BitcoinCash will quietly convert the legacy address to the cash address if you spend to a legacy address. It seems there is a conversion from legacy -> cash, and this is done either in the BCH mempool or in the chain itself.

The trigger was an update pushed to xchainjs to revert back to legacy address formats since the cash addresses have prefixes that are annoying for wallets. This decision was made by the wallet teams (although it was incorrect to do).

This then caused THORChain to process a swap to a legacy address, but it was converted to cash format. THORChain could not identify the address in the final transaction, slashed the node and then re-scheduled it. It was continually re-scheduled (THORChain is censorship-resistant) until the team rushed an update to stop it.

Summary

Around $100k was lost, although the community member who received the double spends returned 90% of the funds after a 10% bounty was negotiated.

The network currently does not support legacy addresses, although that is being fixed in the next updated.

In addition the team are considering engineering an ability for a minority (>33%) of nodes to fully halt/isolate a chain for a maximum of 72 hours. A 33% minority of nodes can already halt THORChain if they wanted to, so this does not change the network characteristic. If nodes every spot an issue with a certain chain, a minority of them can move quickly to isolate it for 3 days, before it will get re-activated. This should be enough time to fix it.

Additionally the team are adding an exponential back-off to rescheduled transactions, so the more it is re-schedule, the slower it will be tried. This will allow time to understand why.

Overall

Significant instability issues were seen in both networks which tied up the team and community for two weeks. However, the network pulled through, funds were returned by community and stability was achieved. THORChain is embryonic technology, not even 12 months old. MCCN is a month old. There will be chaos, the team has a treasury to insure losses (for now) and progress is being made.

Once all the bugs are found the network should be as stable as the chains it is connected to.


Community

To keep up to date, please monitor community channels, particularly Telegram and Twitter: