Do not use Btrfs for any application!

Learn about the latest news here.
Post Reply
User avatar
Logg
Level 20
Posts: 29
Joined: Sun Jan 22, 2023 11:17 am
Location: Atlanta GA, USA
Contact:

Do not use Btrfs for any application!

Post by Logg »

Do not use Btrfs for any application!

All Open RSC severs were offline on the 30th of November 2023, between about 7:30 AM and 3:10 PM Eastern time, an outage of nearly 8 hours. No player data was lost, and there was no rollback, but there was an extended period of time that the servers were offline, which could have been avoided if Btrfs had not been used as the filesystem of the game server.

The servers crashed because of an apparent lack of free disk space available. Since backups of the game databases (which are made automatically every hour) are (temporarily) stored on the local disk before being copied to long term storage, eventually, the disk can fill up if it's not emptied in time.

However, there was actually 36 GB of free space reported as available by traditional tools such as df when free space was, in Btrfs's reality, actually totally depleted. Btrfs has its own tools for checking free space, but regardless, when the servers crashed and I realized that no files could be written on the system due to a lack of available free space, I announced we would be back online in about 20 minutes, copied all the backups off the drive and deleted them, freeing up an additional 70 GB.

Disturbingly, after taking this action, leaving 60% of the drive empty, there was still no ability to write files. Typically, you can expect that in any situation, deleting files on a filesystem will create free space. This is not always the case with Btrfs, it turns out, and it would also turn out that this was an unrecoverable broken state for the host operating system of the game server.

Unrecoverable Error
no.space.left.png
no.space.left.png (115.45 KiB) Viewed 6717 times

In order to make the 107 GB of free space actually usable by programs, it turns out that one must "balance" the disk. I don't fully understand this, but it has to do with how Btrfs is designed to work as software raid for multiple disks. It wants to keep equal amounts of free space on each disk, so that reads are faster, since it can read from multiple disks at the same time. In Open RSC's case, Btrfs was on a single virtual hard drive, but the concept of "balancing" still seems to apply. According to this article, balancing your disk will also cause it to "scan through all unused block groups and reclaim the space".

Unfortunately, it is not possible to balance disks when disk space is completely depleted, since in order to balance, Btrfs needs a place to copy data to prior to putting it in the place it belongs. With the disks already being full, and deleting over 70 GB of files freeing up an amazing 0 bytes of space for Btrfs to work with, there is no way to recover from this.

Well, there are workarounds for this, such as resizing the disk to be larger, or mounting a temporary device in the filesystem, then balancing the disk on a cronjob so that in the future, free space can be released before it's too late. Frankly, I felt these workarounds were about as much work as simply recreating the virtual machine without the usage of Btrfs, which is what I ended up doing. It will be much better to not have to worry about this issue of unusable free space in the future, which does not affect any other file system that I know of.

It's a "Known Issue"?
In the article linked above, Suse says
This issue is known to occur on btrfs filesystems. Particularly ones with relatively high filesystem usage (>80%)
This does describe the way the Open RSC game server was being used, and compounding the issue was the fact that it was being emptied and filled and emptied and filled over and over again, as large backup files were created then deleted.

The article links a blog post from 2014 about how to deal with the issue, meaning it has been known for nearly 10 years, and still remains an issue today.

Btrfs was released as "Stable" in 2013, and I remember a lot of hype at the time that it was going to replace ext4 entirely, and would soon become the default in Debian and other Linux distributions. When I chose Btrfs as the file system for the Open RSC game server in 2023, I expected that since it has been over a decade since release, the software must be stable by now, and it would be beneficial to use the newer file system design. Unfortunately, although I'm sure a lot of people have invested a lot of effort into Btrfs, I want to warn everyone that it is NOT stable, it probably will never BE stable at this rate, and BTRFS SHOULD NOT BE USED FOR ANY PURPOSE. If you are using Btrfs, recreationally or professionally, I would strongly advise that you migrate away, as we now have.

It is obviously unacceptable behaviour that, with no reliable warning system for how much free space is available, the Btrfs file system may suddenly enter a state where it is impossible to write files (an action typically very important for nearly all activities on a computer), and that recovery from that state without external devices is impossible.

Resulting Open RSC upcoming maintenance

I was travelling at the time of this disastrous file system failure, unable to physically access the server, and had to erect a temporary virtual machine on a local computer which Open RSC could run from until a permanent solution could be made. It is a very good computer, possibly better than the one that OpenRSC normally runs on, so I don't anticipate any lag from the change of hardware. However, I will be replacing this with another new virtual machine, since the one OpenRSC is currently running from is using Hyper-V as its hypervisor, and I prefer KVM/QEMU for increased stability (i.e., no downtime due to Windows Updates or bluescreens). There will likely be some 30 minutes of downtime for that transition in the near-future, within a month I would guess.

Thanks for your support and understanding shown in the Discord. It meant a lot that everyone was supportive and offering help, instead of complaining about downtime.

— Logg
Post Reply