Unexplained reboot

SUMMARY

My node rebooted without apparent cause, at midnight UTC between 22 and 23 Jan 2023 (Sat-Sun). Support says it wasn't maintenance or a problem with the VM host. Nothing helpful in the on-node logs. Trying to find out what happened, or at least, set things up so I can find out next time.

ENVIRONMENT

  • Simple Linode, just set up a few weeks ago
    • 1 CPU, 1 GB RAM, 25 GB VHD
    • No Linode Volumes or other add-on features in use
  • Debian "bullseye" 11.6
    • Kernel 5.10.0-20 / 5.10.158-2
    • Current with latest Debian updates at the time
  • Not running many services yet
    • SSH for admin
    • Serving DNS for a few domains
    • Postfix, Dovecot, and Apache are up for testing but not in use yet
    • Only one user account besides root (my own)
  • I am fairly confident in the security of the system
    • I have 2FA enabled on my Linode account (using TOTP)
    • I use SSH public key for all admin/access
    • Password SSH access is disabled on the node
    • I have a strong passphrase on my SSH private key
    • The only computer that has my private key should be my desktop
    • SSL is configured and required on Dovecot
    • The root password is long and completely random characters
    • The user account has a strong passphrase for its password
    • I use apticron (email-only) and typically install updates within 24 hours

INITIAL SYMPTOMS

I received an email from Linode, dated Sunday 2023 January 22 at 00:15:13 UTC, saying Lassie had booted my node. Completion time was given as 00:00:44 UTC.

Checking my desktop PC, I found all my SSH sessions to the node had been disconnected uncleanly ("broken pipe"). Whatever happened, it killed the SSH daemon on the node before it could close down the connection gracefully.

In the Linode web UI, the "Activity Feed" simply shows "Linode bean has been booted by the Lassie watchdog service" at 00:00 UTC.

I was able to SSH back in to the node without trouble. The system uptime agreed with the reboot reports.

INVESTIGATION

The on-node system logs show what looks like a normal boot process, with the first messages at 00:01:13 (kernel banner, systemd, etc.).

The on-node system logs show nothing special happening immediately prior. Overall it was quiet, with no authentication attempts for hours before.

The last pre-boot log message is at 00:00:09 UTC, a routine on-node firewall (iptables) reject message. The next prior message is at 00:00:00 UTC, systemd finishing "system activity accounting tool".

The same kernel version was running before and after, so it wasn't a kernel upgrade forced upon me somehow.

The LISH console showed a normal login prompt for the node, with Debian startup messages immediately prior. Using CTRL-A, ESC, I was able to scroll back through the boot messages, then the GRUB menu, then the QEMU SeaBIOS banner, at which point the scroll back stops. There is nothing from prior to the boot. All the messages that are there, look like a normal boot to me.

I have no auto-upgrade or auto-reboot configured. apticron is installed but with defaults, which merely emails me when updates are available (and none were at the time). unattended-upgrades is not installed.

SUPPORT TICKET

I opened a ticket with Linode support. After I rejected the first canned reply, I got someone who looked at the problem, and they said:

  • No planned maintenance at the time
  • No other known host-related issues
  • No clues from the console output from before the reboot

They suggested I ask here.

CONCLUSIONS

I find the timing (right around 00:00:00 UTC) suspicious but that's very inconclusive.

Unless I've missed something, I likely will never know what happened here, but maybe I can make things better for the next time.

Ideally, I would be able to have Lassie take no action if the OS requests a power off or halt, and have Lassie only reboot the node when the OS requests a reboot, but as I understand it, that's not an option. Either I turn Lassie off and OS-requests-reboot leaves the VM powered off, or I turn Lassie on and OS-requests-halt leads to a VM reboot. If there's a better way, please let me know.

Is there anything I can do to increase suitability of diagnostic information in a kernel crash situation? Some way to preserve the console log across reboots?

I'm open to other ideas and suggestions.

4 Replies

You may want to consider saving your console logs to a file with a cron job. This will keep the data persistent over reboots.

If you believe the reboot is suspicious, I suggest reading through this post from the Community Questions site titled I've noticed some suspicious activity on my Linode, what do I do?. It offers some troubleshooting steps you can take to investigate behavior you do not recognize.

Finally, you could also set up some heftier monitoring for your instance. These guides offer some solutions for doing just that.

You may want to consider saving your console logs to a file with a cron job. This will keep the data persistent over reboots.

Unfortunately, if the kernel panics/hangs/crashes, the cron job isn't going to run. I have rsyslog configured to log kernel messages to /var/log/kernel without buffering. systemd's journal gets them too. So as much possible, this is already done, and better. As noted, there was nothing of help there.

When I say I want to capture the console, I mean, I'm hoping for some way that Linode's hosting environment can capture and preserve the output of the kernel as it crashes and burns. The cloud equivalent of plugging in the crash cart and reading the panic message on the screen.

If you believe the reboot is suspicious …

I find the timing suspicious in that a problem that occurs right at 00:00:00 UTC seems like some kind of overflow or scheduled event happening. I just don't know what that would be.

I have no reason to believe this was malicious. There's no sign of any failed authentication attempts to SSH ever. I run SSH on a non-standard port to cut down on log noise, and the port scanners haven't found this one yet. It's gotten probed for SMTP and HTTP, but only of the automated script/scan kind.

If the system is rootkit'ed, an intruder could be hiding from me, of course.

Finally, you could also set up some heftier monitoring for your instance.

I've already got logwatch and fail2ban and UptimeRobot in place as a matter of course. To the best of my knowledge, none of the tools in that list will capture a kernel panic or other system fatal error. Am I missing something?

What output do you get from:

cat /etc/apt/apt.conf.d/50unattended-upgrades| grep "Unattended-Upgrade::Automatic-Reboot-Time"

And how long ago did you install these: Postfix, Dovecot, and Apache

What output do you get from:
cat /etc/apt/apt.conf.d/50unattended-upgrades| grep "Unattended-Upgrade::Automatic-Reboot-Time"

cat: /etc/apt/apt.conf.d/50unattended-upgrades: No such file or directory

unattended-upgrades is not installed.

And how long ago did you install these: Postfix, Dovecot, and Apache

Shortly after the OS was installed. Specifically, the apt install run with my main package list began at 2023-01-05 18:46:01 EST and finished at 19:01:22 EST.

Configuration and service enablement took place over the next several days. Postfix was enabled and started 2023-01-11 20:08:07 EST, Apache at 2023-01-14 19:49:16, Dovecot at 2023-01-16 16:11:25.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct