Inconsistent performance across Linodes
A few weeks ago we moved from a large single-server setup to
haproxy + several smaller web nodes.
The process was fairly painless and, notwithstanding that we have some requests which are "heavier" than others, the load across the server group appears to be more-or-less uniform. We monitor with New Relic (APM & Infrastructure) and use that for alerting.
All of these servers are 4 GB linodes, and we use an automated scripting system to provision, set up, and deploy the servers. They only run nginx+php-fpm and other system services like sendmail, ssh, and the newrelic daemons.
They all run exactly the same version of php/php-fpm/and use the exact same configurations.
Over the weekend, most recently, and earlier last week as well, we have had issues with one of the servers running at a VERY high load average. Top reports a load average of
~43 (I've seen spikes up to
55) for the last 15 minutes. The other servers are comfortably sitting at around
<img alt="new-relic" src="https://uccd2a804fa8d4a5bdef03f93469.dl.dropboxusercontent.com/cd/0/get/A74w_oS5qqcNzQaHPvfXc1TIafbm0FcVRr7VpTWRaJpFtfAyxCLe4zlAmBqPpvBUhZ73Q07yauuX-eBafExoWXwir097Mi5tR8ZntedoEi8vb-rULtUfSrNtUsXTD4v5680/file?dl=1">
Logging into the servers, the server which is triggering the appears to be on a system which has higher CPU steal. In the screenshot, it says 6%, but I've seen it spike between 30/50% while logged into the server. I'm not sure how it calculates it.
I'm trying to figure how how to get consistent performance across all machines. Earlier in the week, when I had this problem, I had to tear down and provision a new server and then wait to see if it had the same issue. I know that I can provision "dedicated" machines, but surely I can provision and set up "a 4GB linode" and it should provide roughly the same performance as my others, right?
Am I missing something here? Or is that my only option: spin up a linode, check to see if it has CPU steal issues, and if it does, tear it down and try it again?
CPU steal is an issue that will occur on all Linode's that aren't running dedicated CPU's. You could provision a Linode, check CPU steal, and if it's too high, provision another Linode and hope you get a lower CPU steal rate. The best way to get consistent performance, however, would be to go with a dedicated Linode, so you aren't dealing with CPU steal at all.
So far as I'm aware, the chosen host for a Linode is random. As a result, there's a chance that you'll be put on a host where CPU use is high across multiple Linode's, thus, increasing the rate of CPU steal. With a dedicated Linode, that won't be an issue, as each Linode is provisioned its own CPU that no other can access.
Hopefully this information is helpful for you. Good luck!
Reach out to the support team, they are really accommodating, so with a CPU steal that high I can’t imagine they wouldn’t do anything.
On a couple of occasions I’ve asked for a host with a newer CPU, or to a lower-loaded host to try to resolve an issue with a block storage connection, and they’ve always found another host and set up a migration that I can kick off.
The support team in my experience are one of the best in the industry.
While this is true, that couldn't be automated as easily, which I believe @quibblejibbets is using to provision the Linode's. Support could definitely help, but for automating the deployment of Linodes that perform consistently, I think dedicated Linodes is the best way to go. It will depend if you want consistent performance as quickly as possible, or if you can handle some wait time. You'll be waiting for support, even if you contact them over the phone to speed up the process, but you won't wait as much if you automate the setup with scripts and dedicated linodes.