[b]主机上的Linux 2.6[/b][/size] 。
20台主机中的6台现在运行在2.6版本的Linux内核上,带有CFQ公平排队磁盘调度器。
现在我们已经在几个盒子上运行了一段时间,我对它的性能有了相当好的感觉。 我注意到,与2.4相比,2.6在某些工作负载上表现得更好,而在其他工作负载上则稍差一些(通过比较2.6前/后的mrtg和vmstats输出来确定)。 我很乐观地认为,通过一些虚拟机调整选项(/proc/sys/vm/*)可以获得一些额外的收益。
总的来说,我认为2.6是 "一件好事",我们最终会把其他的主机转移到2.6。
[b]Disk I/O Thrashing is no more![/b][/size]
我想转移到2.6的主要原因是I/O性能比2.4有所提高。
当有大量的随机读/写请求时,Linux很容易受到我称之为 "硬盘拒绝服务攻击 "的影响,这些请求占满了请求队列。 这将导致其他请求的延迟问题,并在本质上使事情陷入困境。
这正是当Linode不断地刺激它的交换设备(快速读写),以及当主机面临写出这些脏页的压力时(在一段时间后,它总是会这样)所发生的工作负荷。 不幸的是,2.6版本的CFQ补丁并没有解决这个问题。 (默认的预期或最后期限调度器也没有)。
CFQ确实对许多线程做随机I/O有一点帮助(比如在cron job聚会期间),但它并不能消除一个Linode楔入整个主机的可能性。 请继续阅读解决方案...
[b]UML I/O Request Token-Limiter补丁[/b][/size]
我在UML里面围绕异步UBD驱动实现了一个简单的Token Bucket Filter/Limiter。 token-bucket方法是非常整洁的。 下面是它的工作原理: 每一秒钟,x个令牌被添加到桶中。 每个I/O请求都需要一个令牌,所以它必须等待,直到桶里有一些令牌才允许执行I/O。
这种方法允许可迸发/不受限制的速率,直到水桶空了,然后就开始节流了。 非常完美!
链接:
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.补丁]token-limiter-v1.补丁[/url]
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.README]token-limiter-v1.README[/url]
[b][color=darkred]有了这个补丁,单个Linode就不能再楔入主机了![/color][/b] 。
这是一个大问题,因为当它发生时,纠正这种情况的唯一方法是由我来干预,并停止违规的Linode。
限制器补丁在2.4.25-linode24-1um内核中(2.6很快就会出现)。
默认值设置得很高,我怀疑你们中的任何一个人在正常使用下会受到影响。 我可以在运行期间改变补给和桶的大小值,所以我将能够为每个主机设计一个监视器,根据主机负载动态地改变配置文件。 这是一个大问题! 🙂
[b]Linux 2.6 for the Linodes[/b][/size] 。
我还没有正式宣布2.6-um内核的消息。 还有一些错误和性能问题需要解决。 我还不建议在生产中使用2.6-um内核,但是,一些有冒险精神的用户已经在测试它,并报告了在每个发行版下运行它所涉及的一些怪异现象。 我将尝试编写一份关于迁移到2.6的指南,并在内核更加稳定时发布。
[b]UML世界里有什么新东西?[/b][/size] [/size]
我们早就该有新的 UML 补丁了。 我猜测在接下来的两周左右,我们会有一个新的UML版本(针对2.4和2.6)。
除了常规的错误修复,我知道Jeff一直在为UML内部的IO驱动提供AIO支持。 AIO是在2.6(在主机上)实现的一个新功能。 一些好处是:
[list][*] 能够用一个系统调用提交多个I/O请求。
[*] 提交一个I/O请求而不等待其完成的能力,以及将请求与其他处理重叠的能力。
[*] 内核通过合并或重新排序分批I/O的单个请求来优化磁盘活动。
[*] 通过消除额外的线程和减少上下文切换,提高CPU利用率和系统吞吐量。
[/list]
关于AIO的更多信息:
http://lse.sourceforge.net/io/aio.html
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf
-
这就是全部!
-克里斯
评论 (12)
This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?
[quote:9a75d3e3be=”diN0″]This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?[/quote]
It might, but that’s not the point, really. Before this patch, a single UML could consume all of the I/O (say, for a given device, like you suggested). It would still cause the same problem when other Linodes tried to access the device. The same effect can be had with “swap files” that exist on your filesystem (rather than actual ubd images) or heavy I/O on any filesystem.
With this patch, I am able to guarantee a minimum level of service. Previously that wasn’t possible.
-Chris
Great work chris, I genuinely can’t think of anything else you can improve upon! 😉
Chris,
I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.
Another forum thread had the same problem.
dev/ubd/disc0: unknown partition table
/dev/ubd/disc1: unknown partition table
I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.
It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6 (and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.
Great job man! Keep up the good work!
Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.
Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.
I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?
caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.
It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. 😛
[quote:52760ef410=”Quik”]Great work chris, I genuinely can’t think of anything else you can improve upon! :wink:[/quote]
Thanks Quik 🙂
[quote:52760ef410=”gmt”]Chris,
I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.
Another forum thread had the same problem.
dev/ubd/disc0: unknown partition table
/dev/ubd/disc1: unknown partition table[/quote]
You can always ignore this warning message — it’s just telling you that the ubd devices are not partitioned. You’re using the entire block device as one giant ‘partition’.
To get 2.6 to work under RedHat, first rename /lib/tls to something else (since 2.6-um and NPTL don’t mix yet).
-Chris
[quote:2eaacf3890=”bji”]I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.
It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6[/quote]
Not sure where you read that from my post. I’ve already patched the 2.4.25-linode24-1um kernel with the token-limiter patch, and 2.6-um to follow shortly.
[quote:2eaacf3890=”bji”](and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.[/quote]
Most/all of the repeat offenders have already been rebooted into the “linode24″ kernel (with the limiter patch). So the solution is in effect right now. But, you are correct — there are still many Linodes running un-limited.
[quote:2eaacf3890=”bji”]Great job man! Keep up the good work![/quote]
Thanks!
-Chris
[quote:f066e66db0=”bji”]Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.
Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.
I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?[/quote]
I agree — the correct solution is to get Linux fixed, or perhaps to get UML to use the host more efficiently. Some of the UML I/O rework is already under way (the AIO stuff), but that kind of thing *is* months away…
One interesting “feature” of the CFQ scheduler is an ionice priority level. But, I wasn’t able to get the syscalls working to test it.
-Chris
[quote:01c9cda963=”griffinn”]caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.
It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. :P[/quote]
I’m not sure where the bottleneck is — but as far as I can tell, CFQ and the standard scheduler in 2.4 appear equally (non)responsive in the worst-case scenario. Go figure…
One interesting thing is that UML uses the no-op elevator. Jeff and I got into a discussion about this, and he says there’s no point to UML doing any request merging, but I disagree. I’d rather have UML do some of it’s own request merging and reordering than force the host to do it all. Plus, it makes UML appear to the host as more of a streaming type load than a random load…
Think back to the last set of tiobench benchmark results you’ve seen — look how poorly the random-i/o results are compared to “streaming-read” and “streaming-write”…
So .. another hack to the UML code (one-liner) to test…
-Chris
Thanks, Caker. I have a tiny linode and I make almost no demands on the system, so far at least. However, fairness is part of what you sell. It sounds like the leaky bucket in the UM kernel solves most of the problem with a minimum of effort. I’ve been implementing fairness algorithms for at least 30 years, so I have a few theoretical observations and questions:
You appear to be issueing tokens independently to each process at an absolute rate, independent of the actual resource availability. This means that a UML may get limited even if nobody else wants the resource, yes? It might be better for the host kernel to issue tokens at an over-all rate to the UMLs.That way a particular UML can use the whole resource if nobody else wants it. since everybody’s buckets are full, the instant anyone else wants to use the resource the original user is instantly throttled to 50% as the tokens are returned equally to the two users, and so on as more users are added. That is, the main kernel returns tokens to each UML with a non-full bucket equally, but does not add tokens to a bucket that is already full. The host kernel should dyamically adjust its token generation rate to just keep the resource occupied. I’ve successfully done this in the past by watching the resource: if the resource goes idle when thre are any empty buckets, slightly increase the token rate. If the resource never goes idle, slightly decrease the token rate.
Next issue: Do you “oversubscribe” the host memory? That is, does the sum of the UML memory sizes exceed the size of the host’s real application space? If so, the host swapspace is used, causing disk activity at this level. This is independent of the swap activity within each UML as the user exceeds its “real” space and begins to use its swap partition. I’m guessing that host-level swapping does not count against any UML’s bucket. but that UML-level swapping does. This would be tha fair way to do this. However, host-level swapping will reduce the overall amount of IO resource that is available to the users. The algorithm above will account for this.
Next issue: Do we have fairness issues with network bandwidth? do you intend to add a token system to address this?
Again: I’m a happy camper. These are purely theoretical questions for me.