disk latency weirdness
I've been seeing occasional slow database updates. They've corresponded closely with disk-latency spikes. You can see a 9 month graph (using datadog) latency graph here. While the graph only shows spikes of a few milliseconds, the underlying spikes can be quite large (up to 7seconds!!)
I've contacted linode support, they have been responsive but haven't said that anything changed at the host level (or noisy neighbour). Looking that that graph linked above, you can clearly see something changed around March 13th and has persisted ever since.
Is this something anyone else has seen? Any suggestions on what to do about it? Does it seem likely that this is a linode level issue rather than anything I have done? I can't see any provisioning / app changes that were made around that time.
I tried to locate a ticket on your account to see if I could get a better understanding of what troubleshooting was suggested. I couldn't find a ticket, so some of this may be redundant from what you've already tried.
Some contention is expected in a shared virtual hosting environment, though steal can also be caused by internal factors. By opening a Support ticket, we can check the status of the host that your Linode is on. The following Community Question site post provides some helpful commands you can run to get a better sense of what could be internally causing these performance issues:
Since you mentioned a database in your question, I also wanted to bring up some issues that may be a factor if you're using MySQL.
One MySQL option,
sync_binlog is set by default to cause every transaction to be written to the log before it is committed. Another option,
innodb_flush_log_at_trx_commit, causes the contents of the InnoDB log buffer to be written out to the log file at each transaction commit and the log file is then flushed to disk. Again, this takes place with every single database transaction. While these two options work to make the server ACID compliant and minimize the risk of data loss, they can cause serious IO overhead if you have a high volume of database transactions, especially on a journalling filesystem.
In that above link regarding steal, the following command is included:
for x in `seq 1 1 30`; do ps -eo state,pid,cmd | grep "^D"; echo "-"; sleep 2; done
I'd suggest running that as it will display a timestamp as well as the process which may be waiting on IO (processes in a D state):
You may see something like
mysql, or a combination of them with other processes in this output.
jbd2 is a kernel process used to synchronize the filesystem journal to disk. If it's waiting on IO, the OS is having a hard time keeping up with journaling and MySQL is becoming write-bound. Example output showing jbd2 issues:
# for x in `seq 1 1 5`; do ps -eo state,pid,cmd | grep "^D"; echo "-"; sleep 5; done - D 2064 [jbd2/sda-8] - D 2064 [jbd2/sda-8] - D 2064 [jbd2/sda-8] - D 2064 [jbd2/sda-8] -
Here are some links for additional information about specific configuration options for performance tuning, though there are drawbacks to some specific changes.
Feel free to share these with a customer and ask that they look into making one or more of the suggestions in the ServerFault article to help them troubleshoot and tune their database for better IO performance:
- IO Wait causing so much slowdown (EXT4 JDB2 at 99% IO ) During Mysql Commit
- 184.108.40.206 Binary Logging Options and Variables: sync_binlog
- 220.127.116.11 Binary Logging Options and Variables: innodb_flush_log_at_trx_commit
Before you make any changes, I highly recommend backing up your data. Our Backup Service is an option, though one of the limitations for this service is regarding highly transactional databases. With that in mind, it wouldn't hurt to use mysqldump to create a data dump for backing up your database.