The Day I Found a Critical Bootloader Bug in RHEL/CentOS
Life as a sysadmin is constantly entertaining. Some days, even when you think you’ve accounted for every possible contingency, something happens that still manages to take you by surprise. Wednesday was one of those days.
I manage a few production Red Hat Enterprise Linux servers that, until Wednesday of this week, were all running RHEL 7. RHEL 7 is still well within its support window, but ever since RHEL 8 came out in May of last year I’ve been preparing to proactively upgrade my systems. By chance, I finished my preparations this week, so I scheduled an in-place rebuild of one of my less critical servers for the Wednesday the 29th.
The Upgrade Begins
Because true hardware-based RAID controllers are blisteringly expensive, I like to run simple software RAID arrays with mdadm where possible. The RHEL installer makes it easy to place all your partitions — even the EFI system partition — on an mdadm array during the installation process, so I started my server rebuild by creating a few RAID 1 arrays on the server’s dual HDDs. Later, with the installation complete, I rebooted the server and was greeted by a fresh RHEL 8 login prompt at the console.
Shortly after that, things went sideways. I ran
yum update to pull down security patches and discovered a kernel/firmware update and a GRUB2 update. During the update process, I noticed that the server had slowed to a crawl, so I checked
/proc/mdstat and realized that mdadm was still building the RAID 1 arrays and was eating up all the bandwidth my HDDs could muster while doing so. Impatient, and eager to get out of the loud server room and back to a desk, I decided to reboot the server to apply the kernel update so I could finish setting things up over SSH.
Two minutes later, I was staring at a frozen BIOS splash screen. As I’d just installed a new network card, I immediately suspected hardware problems, so I powered the server down and checked things over. Nothing helped: The hardware seemed fine, but it still wouldn’t boot.
Mdadm is pretty resilient, but since I’d shut the server down mid-verification I hastily assumed I’d somehow broken my RAID setup. Because I hadn’t gotten very far post-installation, I decided to wipe the server and reinstall RHEL 8 to rule out any issues. This time, I let mdadm sit for an hour or so before I touched anything, and then patched and rebooted the server again. Cue the frozen BIOS splash screen.
In hindsight, the common factor was clearly the updates, but as I’d just updated my RHEL 8 development server the day before with no ill effects I didn’t immediately consider a bad update as a possibility. Instead, I reset the BIOS to factory defaults and reviewed all my settings. When that didn’t help, I rummaged through my drawer of spare parts and grabbed an unused NVMe SSD to replace the server’s frustratingly slow HDDs in case they or the RAID configuration was the source of the problem. After installing RHEL 8 on the new drive, I rebooted the server several times to verify everything worked before applying updates. Once again, everything was fine until I applied the GRUB2 updates.
Verifying the Problem
Faced with what now seemed to be a bootloader issue, I went back to my RHEL 8 development server and updated it again. Sure enough, a new GRUB2 update popped up, and when I rebooted after applying it I got stuck at a black screen. Confident that I’d narrowed the issue down to a bugged update, I reinstalled RHEL 8 one last time on my production server — this time, skipping the update step — and set about reinstalling software on it.
When I finished later that night, I got the Red Hat daily digest email summarizing the latest RHEL updates. As it turned out, Red Hat had released patches for the BootHole vulnerability just a few hours before I arrived on-site in the afternoon ready to rebuild my server. (For reference, the RHEL 8 patch is RHSA-2020:3216 and the RHEL 7 patch is RHSA-2020:3217.) I quickly disabled automatic updates on the rest of my servers and wrote up a hasty bug report at 10:15pm.
I woke up on Thursday morning to 50+ email notifications from Bugzilla and a tweet from @nixcraft linking to the bug report. As the day went on, it became apparent that RHEL 7 was also affected and certain Ubuntu systems were suffering from the fallout of a similar patch.
As of the writing of this post, it seems like the specific issue lies with shim rather than GRUB2 itself. Right now, Red Hat is advising that people avoid the broken updates, and they’ve published various workarounds that may come in handy if you’ve already applied them. For the moment, I still have automatic updates disabled, and I’m hoping that Red Hat will publish fixed versions of GRUB2 and shim soon.
In the end, I spent four hours reinstalling RHEL 8, submitted a hastily written bug report, and became the anonymous “user” mentioned in the first paragraph of an Ars Technica article:
Early this morning, an urgent bug showed up at Red Hat’s bugzilla bug tracker—a user discovered that the RHSA_2020:3216 grub2 security update and RHSA-2020:3218 kernel security update rendered an RHEL 8.2 system unbootable.