Node unbootable after in-place upgrade to Fedora 30 converted grub config to BLS

TL;DR Upgrade f29 -> f30 breaks the node. This post describes the issue and a workaround, then asks a question.

Late update: 2019.05.08 -- Problem has been fixed in the Linode boot infrastructure. See this post. I've comfirmed that upgrades to Fedora 30 (from up-to-date F29) are now completely trouble free. Nodes running Linode kernels are not affected by this problem. Otherwise, ignore this post and skip down to repair/prevention steps summarized in this reply below. Nodes running native kernels become unbootable and, when repaired, remain vulnerable to further issues. Skip to this post if you're in a hurry. For nodes running native kernels the problem is preventable if the right steps are taken before the upgrade. Read this post.

After

dnf system-upgrade download --releasever=30   # (from f29)
dnf system-upgrade reboot

the upgrade completes but the node is then unbootable. In the console one sees grub starting, but it never gets to a grub menu, grub shell, or grub rescue shell. This is reproducible simply by creating a new node with Fedora 29, upgrading & rebooting, then system-upgrading to f30 as above. A fresh node built with f30 is fine though.

In a finnix rescue shell, after mounting the broken node's rootfs and comparing /etc/default/grub with the same on a working f30 node, one can see the only significant difference is that

GRUB_ENABLE_BLSCFG=true

is true in the broken node, false in a working node. After chrooting into the broken node's rootfs as described at the end of https://www.linode.com/community/questions/332/my-disk-fails-to-mount-after-a-reboot-what-do-i-do, editing /etc/default/grub to disable BLS, and then running

grub2-mkconfig -o /boot/grub/grub.cfg

and rebooting the node, the problem is fixed. I should mention that I'm running distribution kernels on these nodes.

Question: Is there a compatiblity issue between the Linode boot ROM and BLS grub configs? Currently a new Fedora 30 node built by Linode's installer comes with BLS disabled. A vanilla f30 has BLS enabled, and an upgrade to f30 turns on BLS silently (with no -.rpmsave file :-o). BLS affects how the grub config is recreated after every kernel upgrade. I worry that BLS may someday be mandatory rather than optional in Fedora and other distros.

34 Replies

@tome - Thank you for writing this up! I was able to confirm that BLS is currently incompatible with our GRUB kernel, which is why we have it disabled in the stock Fedora 30 image.

Our recommendation at this time for anyone else upgrading from an older Fedora version would be to ensure BLS is disabled before rebooting.

Hi @jcardillo, thanks for the info!

However, be aware that the second reboot during the upgrade process is automatic. In-place upgrade stages are 1) download the upgrade, 2) system-upgrade reboot, 3) install all the downloaded packages, followed by 4) automatic reboot. The troublesome config change is made between 2) and 4) in a special environment implemented as a systemd target. The system is unavailable througout step 3) and the user has no chance to intervene.

I haven't checked this, but it might be possible to edit the ephemeral systemd target just before invoking dnf system-upgrade reboot, in such a way as to defeat the automatic 2nd reboot and start up sshd instead, allowing the user to shell in and fix things.

Following up on my last comment, I haven't found any preventive measure that isn't more complicated and dangerous than fixing the problem after it happens.

Here is a more succinct recipe to repair after the upgrade:

Do a normal dnf system-upgrade download … dnf system-upgrade reboot sequence to release 30. While the upgrade is running following the reboot, the node is inaccessible, but if you have a lish console going you can see it progress up to a failed grub boot. This takes about 10 minutes for a stock Fedora 29 on a nanode. You should see

Booting from ROM...
Welcome to GRUB!

error: variable `prefix' isn't set

in the lish console, and it hangs there.

Now restart the node in rescue mode using the Linode Manager. Run the following in the rescue shell:

mount -o exec,barrier=0 /dev/sda
cd /media/sda
mount -t proc proc proc/
mount -t sysfs sys sys/
mount -o bind /dev dev/
mount -t devpts pts dev/pts/
chroot /media/sda /bin/bash <<EOF
cd /etc/default
grep BLSCFG=true grub || (echo 'Unexpected state, quitting' && exit 1)
sed -i s/BLSCFG=true/BLSCFG=false/ grub
grep BLSCFG=false grub || (echo 'Repair failed, quitting' && exit 1)
grub2-mkconfig -o /boot/grub2/grub.cfg
EOF
umount {proc,sys,dev/pts,dev}
cd
umount /media/sda

If that goes well, reboot the node and verify it's at Fedora 30.

Wow. What about booting with a Linode kernel, does that work?

@kmansoft, when booting a Linode kernel it does not use your node's grub config (nor anything else under /boot AFAICT), so the problem does not occur. Everything needed is built into the virtual ROM image used for the initial stages of the boot. In fact you can remove /boot/grub2/grub.cfg and it still boots just fine.

Thanks for asking the question! It brings out another way to repair the node without the rescue and chroot horseplay, and much easier: just add a Linode kernel configuration to the node, boot it, disable BLS in /etc/default/grub, run grub2-mkconfig, and then reboot with your regular node configuration using native kernels, and you're done. I was blind to that possibility because I didn't know enough about the Linode boot ROM structure when booting Linode kernels. I've confirmed it works. In hindsight, switching to Linode kernels before doing the release upgrade is probably the easiest course. Package updates continue to install new distribution kernels and update your grub.cfg regardless of what kernel is actually running, so you still have to disable BLS and rerun grub2-mkconfig before switching back to booting your preferred node configuration.

To summarize:

If the node runs Linode kernels it's not affected by this problem… go ahead and upgrade to F30.

If the node runs native kernels, temporarily configure the node to boot Linode kernels, do the release upgrade, disable BLS after the upgrade and rerun grub2-mkconfig, then switch back to native kernels. (Disable BLS means edit /etc/default/grub and change GRUB_ENABLE_BLSCFG=true to =false.) Watch out for future package updates that might turn on BLS again.

If the node runs native kernels and it's already broken by a release upgrade to F30, boot it one time with a Linode kernel and repair it as described in the previous paragraph.

Thanks @kmansoft!

PS: Evidently it's OK to run all binaries built against newer kernels with older kernels, within reason. Otherwise Linode kernels would not work as well as they do. Current Linode kernels are 4.x series; current Fedora kernels are 5.x series. I'm not sure how much risk is involved, if any.

another way to repair the node without the rescue and chroot horseplay, and much easier

Yes that's what I was thinking…. Thanks for confirming that as another workaround.

I'm also a Fedora fan but (at least for now) use it only on the desktop.

Current Linode kernels are 4.x series; current Fedora kernels are 5.x series

Linode's "latest" kernel at this time points to 4.18.16.

There is a 5.0.8 available in the drop-down, you just got to select that specifically.

I'm not sure why the gap - it's been like that for a couple of months - but on kernel.org, "longterm" is the 4.19 series and 5.0 has had "stable" status for a while now. And 4.18 isn't even listed.

But that's a whole different matter.

Evidently it's OK to run all binaries built against newer kernels with older kernels, within reason

One thing to keep in mind is - Linode kernels do not include SELinux, and Fedora by default runs with SELinux enabled.

Booting Fedora with an SE-less Linode kernel should be harmless and work normally in all other ways (except SELinux will not be available in the booted system).

Linode kernels do not include SELinux, and Fedora by default runs with SELinux enabled

Good point. The longer you run with SELinux disabled, the more likely there'll be trouble when you try to go back to it.

When you boot back into Fedora after performing this fix, you may see alot of audit messages complaining about SELinux context. This was due to the recovery kernel not having SELinux and contexts getting messed up. The fix is:
fixfiles onboot
Reboot and the system will perform SELinux relabeling. This should solve any SELinux context problems you might be having.

Having learned a little more about what's involved in the issue, I hope you'll forgive me for revising recommended solution again.

Here is more information to help you decide how to handle release upgrades to Fedora 30.

The problem is that Fedora 30 changes to a new grub configuration that is incompatible with the Linode boot ROM used to boot native kernels, rendering the node unbootable. Three alternative methods to repair or avoid an unbootable node have been discussed so far. They are

  1. Upgrade while running a native Fedora distribution kernel. The final reboot will fail. Rescue the node as described in this answer.
  2. Boot the node with a Linode kernel before doing the upgrade. When the upgrade completes it will reboot again, successfully, with the Linode kernel. Repair the node as recommended here. Then reboot back to your native kernel node configuration.
  3. Proceed as in method 1., but instead of working in a chroot in rescue mode, do the repair work while booted into a Linode kernel.

Method-1 is what I did to repair a production node when I hit this problem. It works.

Method-2 was suggested as a safer alternative to method-1. It works too, but when I tested method-2 I did it quickly without considering the SELinux implications. I just created a new nanode with native F29 kernel, brought it up to date, switched over to a Linode kernel for the upgrade and repair, switched back to native kernel, checked that it booted, destroyed the node. Never looked at any logs.

After reading dschadlich1's helpful comment I decided to run through scenario 2 again and pay more attention. In the case, I didn't have to run fixfiles onboot. SELinux detected the problem and fixed it automatically, reporting

*** Warning -- SELinux targeted policy relabel is required.
*** Relabeling could take a very long time, depending on file
*** system size and speed of hard drives.

during the last boot back into the native kernel.

The upshot is that release-upgrading under a Linode kernel makes massive changes in the root filesystem without SELinux support, messing up SELinux context all along the way, whereas post-upgrade repair in a chroot only touches two files and leaves their SELinux context unchanged. That makes method-1 a far less invasive intervention than method-2. Method-3 has the same disadvantage as method-2 on a smaller scale.

I consider method-1 the best option at this point. For a node that has the default disk configuration with root fs on /dev/sda you can run the rescue shell script listed here verbatim, otherwise adjust accordingly.

After the upgrade and repair.

The job's still not done after getting F30 to boot successfully. It can break again.

When I ran

dnf reinstall grub2-tools

it appended a line

GRUB_ENABLE_BLSCFG=true

at the end of /etc/default/grub and wrote out an unbootable /boot/grub2/grub.cfg.

I fixed that. When I ran

chmod -w /etc/default/grub
dnf reinstall grub2-tools

the reinstall completed without complaint and left my /boot/grub2/grub.cfg alone.

Hmmm, would that work during a release upgrade? I tried making /etc/default/grub read-only before the upgrade. Starting again from F29, I worked up to just before the upgrade reboot and ran

chmod -w /etc/default/grub
dnf system-upgrade reboot

That left the node unbootable and unrescueable (that is, the rescue job refused to start, saying it couldn't find something on /dev/sda and /dev/sdb. Even after rebuilding the node in the Manager the rescue problem persisted; the node had to be scrapped.).

So, keeping /etc/default/grub readonly provides a small measure of safety, but still requires continued vigilance and hacking to avoid trouble in the future.

A Direct Disk node configuration is a Linode boot option that chainloads an MBR of your own creation. Native grub tooling on Fedora would then produce a complete boot environment compatible with itself, right down to the MBR. But that requires you to migrate the node to a partitioned disk, which disqualifies the node from using Linode auto backup services and limits disk resizing options. That's beyond the scope of this post. Read more about it here.

It would fix everything in the long term if Linode provided BLS-capable ROM images for booting native kernels.

Another day older and a little wiser.

The bad news is that installing a kernel update from the F30 repos renders the node unbootable again, even with BLS properly disabled in /etc/default/grub.

The good news is that this is an RTFM problem.

During kernel updates, grub2-mkconfig gets wrapped inside layers of grubby and package installation scripts that do not respect the BLS flag in /etc/default/grub. The fix is in a package called grubby-deprecated that's new in Fedora 30.

Fedora 30 ChangeSet doc for BLS-style configuration has an upgrade impact section that mentions

On Fedora 30, the script to switch to a BLS configuration will be automatically executed on grubby upgrade, and the old grubby tool will be moved to a grubby-deprecated package. So users can switch back to a non-BLS configuration by restoring the old configuration from a backup file and installing the grubby-deprecated package.

users can also switch back by installing the grubby-deprecated package, removing "GRUB_ENABLE_BLSCFG=true" from /etc/default/grub , and using grub2-mkconfig to re-generate their configuration file.

Actually it's not enough to remove "GRUB_ENABLE_BLSCFG=true" from /etc/default/grub. You have to explicitly say "GRUB_ENABLE_BLSCFG=false" or else the next grub2-tools update will set it true again and then ruin your /boot/grub2/grub.cfg file.

Putting the right pieces together before upgrading to F30 sidesteps recovery/rescue/repair altogether. See next comment.

Running native Fedora 30 kernels on a Linode requires the grubby-deprecated package and a properly configured /etc/default/grub. These changes must be in place before upgrading to F30 or the node will be unbootable after the upgrade. Fedora 29 does not have a grubby-deprecated package but you can shoehorn one in. Upgrade as follows.

Have a backup. Run

dnf system-upgrade download --releasever=30
# Downloaded packages will not include grubby-deprecated; but it has
# imported the GPG key for F30 repo and the following should succeed
rpm -i --replacefiles https://dl.fedoraproject.org/pub/fedora/linux/releases/30/Everything/x86_64/os/Packages/g/grubby-deprecated-8.40-30.fc30.x86_64.rpm
# Prepare grub defaults with BLS explicitly disabled
echo GRUB_ENABLE_BLSCFG=false >> /etc/default/grub
# Ready to upgrade
dnf system-upgrade reboot

and when the upgrade completes the node will reboot successfully. Done.

If you've already upgraded without this prep and the node is unbootable, start it in rescue mode and in the rescue shell run

mount -o exec,barrier=0 /dev/sda   # or wherever your root fs is
cp /media/sda/boot/grub2/grub.cfg{.rpmsave,}
umount /media/sda
reboot

When the node comes back it'll be running the last F29 kernel that was installed. Then run

echo GRUB_ENABLE_BLSCFG=false >> /etc/default/grub
dnf install grubby-deprecated
dnf reinstall kernel-core

The next reboot will boot latest installed F30 kernel. Done.

With this setup I've been able to continue applying package updates without any of the recurring damage mentioned in previous posts on this article.

I was beating my head against a wall for a couple hours last night trying to figure out why my upgrade was failing until the kind support staff pointed me to this article. Thanks for the thorough research, @tome. I followed your steps in the previous post (install grubby-deprecated and edit /etc/default/grub before doing the first reboot) and it worked perfectly.

This is why I never install new releases on day one, and why I'm rather religious about keeping daily backups. :) (And here I thought I was the only person using Fedora around here…)

I'll echo your concerns about how this will be addressed going forward, however. I'm a little leery depending on a deprecated package, and I'm already starting to feel nervous about Fedora 31. I'm glad Linode was able to tweak new F30 installs, but if BLS will be required going forward, that could be a long-term problem.

if BLS will be required going forward, that could be a long-term problem

Indeed. Non-BLS already looks like the ugly stepchild of booting methods in Fedora 30. You have to opt out of BLS, and the ChangeSet instructions for opting out involve a recovery step even on bare metal.

Thanks @jtdarlington for your feedback and thanks for adding your support for a long term solution. I love Fedora and want to have a strong ecosystem at Linode to run it in.

As a last resort one can migrate the node to a partitioned disk, let the Fedora utilities build a complete boot environment right down to the MBR, and then use Linode's Direct Disk option to chain load your MBR.

Hi @tome and everyone else!

Just wanted to follow up on this thread to mention that BLS is now supported on Linode, and our Fedora 30 image has been updated to boot with a native BLS configuration. Upgrades from older versions of Fedora should now also work without any intervention.

Let me know if you are still experiencing any issues.

Hey @lblaboon, that's great news. Big thumbs up to Linode support and engineering!

@lblaboon I tried switching back to GRUB 2 from the Linode kernel, but I get stuck on the GRUB screen. Running configfile /boot/grub2/grub.cfg shows this:

error: file `/boot/grub/i386-pc/increment.mod' not found.
error: file `/boot/grub/i386-pc/blscfg.mod' not found.
error: can't find command `blscfg'.
error: file `/boot/grub/i386-pc/increment.mod' not found.
error: file `/boot/grub/grubenv' not found.

Running sudo grub2-install /dev/sda while booted with the Linode kernel shows this:

Installing for i386-pc platform.
grub2-install: warning: File system `ext2' doesn't support embedding.
grub2-install: warning: Embedding is not possible.  GRUB can only be installed in this setup by using blocklists.  However, blocklists are UNRELIABLE and their use is discouraged..
grub2-install: error: will not proceed with blocklists.

There is a possibly relevant Red Hat issue.

@mohd-akram Hey, sorry for the late reply (looks like your @ didn't work for some reason).

It looks like you might have run into the issue described in the bug report you linked to. The following commands should get your system working again:

cp /usr/lib/grub/i386-pc/{increment,blscfg}.mod /boot/grub2/i386-pc
grub2-mkconfig -o /boot/grub2/grub.cfg

Let me know if you are still running into any issues.

[[[@lblaboon] (/community/user/lblaboon)] (/community/user/lblaboon)] (/community/user/lblaboon) I think @'s might not be working in Safari. Tried it, but still the same issue. I'm not sure why it says it's looking in /boot/grub instead of /boot/grub2.

I guess they're buggy in Chrome too. @lblaboon

@mohd-akram /boot/grub should be a symlink on your system pointing to /boot/grub2. If it's not there you can create it with ln -s /boot/grub2 /boot/grub.

@lblaboon It got past the GRUB screen, but I got all sorts of avc errors and had to set SELinux=permissive in /etc/selinux/config to get SELinux to relabel properly.

Is the requirement to symlink /boot/grub2 to /boot/grub because of BLS? Previously only grub.cfg was symlinked from that directory which is what the docs here say. Perhaps they need to be updated.

I'm guessing this is a Linode thing as in a local fresh installation of Fedora 30 there is no /boot/grub at all.

@mohd-akram The reason for the symlink is because our copy of GRUB expects to find GRUB's files in /boot/grub, whereas Fedora puts them in /boot/grub2. Our Fedora images (and also a few others) have this symlink out of the box, but if you installed Fedora through some other means then you would need to manually create it.

@lblaboon Thanks, I've wondered about that symlink and now you've cleared the mystery. BTW while I'm writing this I discover that @user followed immediately by punctuation doesn't work. In fact, having typed any non-working @, say, by adding superfluous characters after it, or by backing up and inserting it imediately before other non-space characters, then it cannot be fixed by inserting a space at the correct location, it stays broken. Only appending an @user to the end of a message and using tab completion to finish it seems to work as intended (and then adding a space and continuing the message is fine). FWIW I'm composing this in Firefox.

Thank you for providing information about Fedora 30 kernels details and implementation for a solution.

@lblaboon I just ran into this problem after upgrading on Centos 8. I was able to get booted via the Linode console, found this thread, and disabled BLS, but is there a known issue with BLS on Centos 8?

@RogerT I am not aware of any issues with BLS on CentOS 8. Do you have any more details?

@lblaboon I guess RogerT was happy with just disabling BLS. I'm seeing problems with CentOS 8 too though, e.g.

grub> list_env
saved_entry=3e729c2d7c094902af0333ce40564ffe-4.18.0-147.5.1.el8_1.x86_64
kernelopts=root=/dev/sda ro console=ttyS0,19200n8 net.ifnames=0
crashkernel=auto rhgb
grub> configfile (hd0)/boot/grub2/grub.cfg

                             GNU GRUB  version 2.02

 +----------------------------------------------------------------------------+
 | CentOS Linux (4.18.0-147.5.1.el8_1.x86_64) 8 (Core)                        | 
 | CentOS Linux (4.18.0-147.3.1.el8_1.x86_64) 8 (Core)                        |
 | CentOS Linux (4.18.0-80.11.2.el8_0.x86_64) 8 (Core)                        |
 |*CentOS Linux (0-rescue-3e729c2d7c094902af0333ce40564ffe) 8 (Core)          |
 |                                                                            |
 |                                                                            |
 |                                                                            |
 |                                                                            |
 |                                                                            |
 |                                                                            |
 |                                                                            | 
 +----------------------------------------------------------------------------+

      Use the ^ and v keys to select which entry is highlighted.          
      Press enter to boot the selected OS, `e' to edit the commands       
      before booting or `c' for a command-line. ESC to return             
      previous menu.                                                      
   The highlighted entry will be executed automatically in 1s.                 

In this example, I rebooted after applying updates and it didn't come back, I used Lish and found it sat at the grub screen with the bottom Rescue option selected and no countdown. The configuration looks correct though.

In this case, it appears to have been resolved by removing the rescue entry, i.e.

# rm /boot/loader/entries/3e729c2d7c094902af0333ce40564ffe-0-rescue.conf

Which contained invalid paths, missing the /boot prefix, seemingly built on the idea that boot is it's own partition, which for linode default builds, is not the case:

title CentOS Linux (0-rescue-3e729c2d7c094902af0333ce40564ffe) 8 (Core)
version 0-rescue-3e729c2d7c094902af0333ce40564ffe
linux /vmlinuz-0-rescue-3e729c2d7c094902af0333ce40564ffe
initrd /initramfs-0-rescue-3e729c2d7c094902af0333ce40564ffe.img
options $kernelopts
id centos-20200205020746-0-rescue-3e729c2d7c094902af0333ce40564ffe
grub_users $grub_users
grub_arg --unrestricted
grub_class kernel

I didn't try correcting the paths to see if that would somehow make it function properly, but, it seems that perhaps the rescue entry that's left as part of the default CentOS 8 image is confusing the version of grub that's in use.

Another possibility here would be to configure grub to boot from the first menu entry.

If you edit /etc/default/grub_config, change GRUB_DEFAULT=saved to GRUB_DEFAULT=0, and then run

grub2-mkconfig -o /boot/grub2/grub.cfg

the system will then default to booting from the latest kernel.

@dtucny @rpeterson Thanks for the updates! We will be updating our CentOS images accordingly with the GRUB_DEFAULT change.

Hi .@lblaboon

I seem to have this on fedora 31. I just upgraded from fedora 30 and noticed I was still booting fedora 29 kernel.) I was able to reinstall grub2 etc to get where I could boot fedora 31 kernel after setting GRUB_ENABLE_BLSCFG=false in /etc/default/grub things work.

After seeing your comment "BLS is now supported on Linode" I tried setting GRUB_ENABLE_BLSCFG=true, I get the problem described by .@mohd-akram

I am able to boot linode latest 64bit to recover.

Is BLS broken on Grub2 using Fedora 31?

I have had issues with BLS on Fedora 32, never booting into the newest kernel.

I have had some success using the steps I documented on this post although it wasn’t 100% reproducible every time.

I ended up making an image with a working setup that I built my 2 Linodes from, and I’ve gone through 3 kernel updates since with no additional steps needed post-update and Fedora automatically boots the latest kernel.

Thanks .@andysh

In my case each reboot I'm booting the right kernel already, `GRUB_DEFAULT=0 in /etc/default/grub but once I change GRUB_ENABLE_BLSCFG=false to true, the system doesn't boot.

I ran grub2-switch-to-blscfg manually to ensure that /boot/loader/entries/ has what (I think?) BLS requires, no luck.

I don't know if this is related but I have this another [glish grub2 issue](
https://www.linode.com/community/questions/20473/glish-console-error-variable-prefix-isnt-set) as well.

When your Linode doesn’t boot, what do you get on the console (Lish or Glish?)

I’m not overly familiar with Fedora just yet as I’ve only used it for a month or so having spent the last 10(ish) years with Ubuntu.

The only other thing I can suggest is booting up a fresh Fedora image and comparing config files related to Grub and BLS.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct