4096 server performs much worse than a 2048?

My workplace is in the process of switching from dedicated servers to Linode for hosting our website, and I'm trying to decide which plans to choose for each server. This has been pretty straightforward for most of the servers, except for the one that'll host the main part of our site, which is a fairly complex Rails application. The application gets a lot of traffic, and I want to get a good idea of how Linode will perform before migrating. I decided to start by ordering a 4096 Linode and a 2048 Linode (both in the Newark, NJ datacenter), provision them identically using Puppet, and benchmark them by running ApacheBench off a 3rd Linode using the private IPs. Both Linodes are running Centos 6.5 64-bit and have the same configuration profile.

To my surprise, the 2048 performed signficantly better in terms of latency: average response times were about 60% higher on the 4096 Linode, and there was much higher variation. The 4096 performed better in terms of throughput since it could run more Passenger workers, but I'm more concerned about latency. The parts of the Rails application I'm benchmarking are mainly CPU-bound, so I'm guessing this is due to the noisy neighbor problem (i.e the server the 4096 is on has a lot more tenants using the CPU than the 2048). To check, I ran "sysbench –test=cpu" on both Linodes, and the results seem to confirm my suspicions:

4096 Linode:

[[email protected] masonm]# sysbench --test=cpu --cpu-max-prime=100000 --num-threads=2 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 2

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 100000

Test execution summary:
    total time:                          266.7814s
    total number of events:              10000
    total time taken by event execution: 533.4837
    per-request statistics:
         min:                                 33.73ms
         avg:                                 53.35ms
         max:                                202.86ms
         approx.  95 percentile:              89.52ms

Threads fairness:
    events (avg/stddev):           5000.0000/1.00
    execution time (avg/stddev):   266.7419/0.01

2048 Linode:

[[email protected] shared]# sysbench --test=cpu --cpu-max-prime=100000 --num-threads=2 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 2

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 100000

Test execution summary:
    total time:                          141.9363s
    total number of events:              10000
    total time taken by event execution: 283.8482
    per-request statistics:
         min:                                 27.67ms
         avg:                                 28.38ms
         max:                                 32.22ms
         approx.  95 percentile:              29.39ms

Threads fairness:
    events (avg/stddev):           5000.0000/0.00
    execution time (avg/stddev):   141.9241/0.00

Is there something else that can explain this? The next thing I'm going to try is switching datacenters, but it feels like I'm missing something.

EDIT: I got the plans wrong in my original message: the server I said was a 8192 is actually a 4096, and the 4096 is actually a 2048. Sorry for any confusion!

7 Replies

Have you asked Linode support about this?

@Main Street James:

Have you asked Linode support about this?

No. I figured I'd ask here in case I was doing something stupid so I wouldn't bother support unnecessarily.

@masonm:

sysbench –test=cpu --cpu-max-prime=100000 --num-threads=2 run

Perhaps the difference due to the fact that your benchmark only uses 2 threads. Likewise, your Rails app probably uses only one thread per request.

A lot of things could be different between the host that houses your 2GB Linode and the one that houses your 4GB Linode. One of the possibilities is that the 4GB host has a larger number of slower CPUs.

This would slow down single-threaded CPU-bound apps, but the total amount of CPU that is shared among the tenants would be similar, and there would be half as many tenants on average. (The fact that you only see 8 cores in both cases is irrelevant because you're never supposed to max out all the cores.)

Linode has gone through many generations of servers, so I wouldn't be surprised if this were the case. And of course there could be noisy neighbors as you said.

As hybinet brought up, the two servers may have different model CPUs. You can check that yourself with cat /proc/cpuinfo.

To see if noisy neighbors are limiting your CPU resources, run something CPU-intensive, open top or htop, and see if the "st" (steal) percentage is high.

@mnordhoff:

As hybinet brought up, the two servers may have different model CPUs. You can check that yourself with cat /proc/cpuinfo.

Yes, I meant to include that in my original post. The 4096 has a Xeon E5-2670, while the 2048 has a Xeon E5-2680 v2. From some Googling, it looks like the E5-2680 is a bit faster, but not nearly enough to account for the differences I'm seeing.

> To see if noisy neighbors are limiting your CPU resources, run something CPU-intensive, open top or htop, and see if the "st" (steal) percentage is high.

Cool, I didn't know about the steal percentage metric. I ran sysbench again while monitoring the steal percentage on both hosts. On the 2048 it never went above 1%, while on the 4096 it fluctuated widely from ~6% to ~70%. That pretty much cinches it. I'm going to file a support ticket to have the 4096 moved to California and hope I have better neighbors this time.

Thanks for your help hybinet and mnordhoff!

Keep in mind that the "v2" in the model number means that it's a different generation, Ivy Bridge vs Sandy Bridge. On top of that, the E5-2670 is an 8-core CPU and the E5-2680 v2 is a 10-core CPU. And then the clockspeed is higher…

Benchmarks seem to indicate a ~10% performance improvement from the newer architecture, then you've got an ~8% improvement in clockspeed, and a 25% improvement in core count. Overall, that should produce ~48% performance improvement. That would seem to reflect a good chunk of the difference you're seeing.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct