Postmortem — Stuck Retiring Vault #q00t
A retiring vault got stuck halfway during a migration. Approximately $80m in funds were frozen until the team released a patch to continue…
A retiring vault got stuck halfway during a migration. Approximately $80m in funds were frozen until the team released a patch to continue the migation. The protocol remained operational.
Summary
The team noticed that a retiring vault stopped migrating funds. Upon further investigation, it was noticed that a unique series of events had occurred to put the system in a state where it lost TSS consensus.
The system was operational the entire time since it had enough funds in the new vault to process swaps and withdrawals. The queue began growing due to resource bandwidth taken up on each node — so the frontend was disabled.
A patch was made to Testnet and then on to Chaosnet. It took Chaosnet 6 hours to upgrade, then a further 2 hours to work through the backlog of TSS transactions and complete the migration. Approximately 2 days had transpired between the halt of migration and it starting again.
Root Cause
BUG 1: The root cause was that a node had left (churned out and unbonded), and took their node offline whilst it was still a member of the retiring vault and that vault was still migrating funds. The bug was that the system should had not let the node unbond until the vault was completely retired — the node operator would have kept their node operational.
BUG 2: Since retiring vault members are still processing TSS transactions to move funds forward, but are marked as standby to the system, the system wouldn’t let them send tssKeySignFail transactions. This is a bug — retiring vault members are technically still active so should be able to send these transactions.
ISSUE 1: This normally wouldn’t have been an issue, but it coincided with a large churn, where over 67% of the members of the previous vault had been churned out as well (11 nodes had been churned). Thus there weren’t enough new, active, members who had visibility on the retiring vault and could report on the failing TSS. This is a separate issue — the system is too aggressive in churning out nodes and doesn’t give nodes a sufficient grace period.
FIX
The fix (V0.18) was released to patch BUG 1 and BUG 2. A follow-on release (V0.19) will be released to address ISSUE 1. Merge Requests:
Release version 0.19.0 to chaosnet
Other Consids
The team also discovered that Testnet was unhealthy and the patch was delayed in rolling out to testnet. Testnet node operators should strive to maintain a healthy testnet since it needs to be ready to test fixes and patches prior to Chaosnet. Reach out to the team if you notice any testnet issues.
Community
To keep up to date, please monitor community channels, particularly Telegram and Twitter:
- Twitter: https://twitter.com/thorchain_org
- Telegram Community: https://t.me/thorchain_org
- Telegram Announcements: https://t.me/thorchain
- Reddit: https://reddit.com/r/thorchain
- Github: https://github.com/thorchain
- Medium: https://medium.com/thorchain