[主机上的 Linux 2.6 [/b][/size]。
目前,20 台主机服务器中有 6 台运行的是 2.6 版本的 Linux 内核,采用 CFQ 公平排队磁盘调度程序。
现在我们已经在几台电脑上运行了一段时间,我对它的性能已经有了很好的感觉。 我注意到,与 2.4 版相比,2.6 版在某些工作负载上表现更好,而在其他工作负载上表现稍差(通过比较 2.6 版前后的 mrtg 和 vmstats 输出确定)。 我乐观地认为,某些虚拟机调整选项(/proc/sys/vm/*)还能带来一些额外的收益。
总的来说,我认为 2.6 是 "一件好事",我们最终会将其他主机转到 2.6。
[b]磁盘 I/O Thrashing 不再存在![/b][/size]
我想升级到 2.6 版的主要原因是 I/O 性能比 2.4 版有所提高。
当大量随机读/写请求填满请求队列时,Linux 很容易受到所谓的 "硬盘拒绝服务攻击"。 这会导致其他请求出现延迟问题,基本上会使系统陷入停滞状态。
这正是当 Linode 不断刷新交换设备(快速读写)以及主机面临写出这些脏页面的压力时(一段时间后总会如此)所发生的工作负载。 遗憾的是,2.6 版的 CFQ 补丁并没有解决这个问题。 (默认的预期调度程序或截止日期调度程序也没有解决这个问题)。
CFQ 在许多线程执行随机 I/O 时(如在 cron 作业方期间)确实有点帮助,但它并不能消除一个 Linode 楔入整个主机的可能性。 请继续阅读解决方案...
[UML I/O 请求令牌限制器补丁[/b][/size]
我在 UML 内的异步 UBD 驱动程序周围实现了一个简单的令牌桶过滤器/限幅器。 令牌桶方法非常简单。 下面是它的工作原理: 每秒钟有 x 个令牌被添加到令牌桶。 每个 I/O 请求都需要一个令牌,所以必须等到桶里有了一些令牌,才允许执行 I/O。
这种方法可以实现可爆破/不受限制的速度,直到水桶空了,然后开始节流。 非常完美!
链接:
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.patch]token-limiter-v1.patch[/url]
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.README]token-limiter-v1.README[/url]
[b][color=darkred]有了这个补丁,单个 Linode 不再能楔入主机![/color][/b]
这是个大问题,因为出现这种情况时,唯一的纠正方法就是我进行干预,停止违规的 Linode。
限制器补丁包含在 2.4.25-linode24-1um 内核中(2.6 内核即将发布)。
默认值设置得很高,在正常使用情况下,我怀疑你们中的任何人都不会受到影响。 我可以在运行时更改补给和水桶大小值,这样就能为每台主机设计一个监控器,根据主机负载动态更改配置文件。 这可是件大事! 🙂
Linux 2.6 for the Linodes[/b][/size] [b]Linux 2.6 for the Linodes[/b][/size].
我还没有正式公布 2.6-um 内核。 还有一些错误和性能问题有待解决。 我还不建议在生产中使用 2.6-um 内核,但一些有冒险精神的用户已经对它进行了测试,并报告了在各发行版下运行它时遇到的一些问题。 我将尝试编写一份迁移到 2.6 的指南,并在内核更加稳定后发布。
[UML 世界有什么新变化?
我们早就该发布新的 UML 补丁了。 我猜我们会在两周左右发布新的 UML 版本(2.4 和 2.6)。
除了常见的错误修复,我知道 Jeff 一直在为 UML 内的 IO 驱动程序开发 AIO 支持。 AIO 是 2.6(在主机上)实现的一项新功能。 它的一些优点如下
[列表][*] 一次系统调用可提交多个 I/O 请求。
[*] 提交 I/O 请求时无需等待请求完成,并可将请求与其他处理重叠。
[内核通过合并或重新排列成批 I/O 的单个请求,优化磁盘活动。
[*] 通过消除额外线程和减少上下文切换,提高 CPU 利用率和系统吞吐量。
[/list]
有关 AIO 的更多信息:
http://lse.sourceforge.net/io/aio.html
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf
-
仅此而已!
-克里斯
评论 (12)
This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?
[quote:9a75d3e3be=”diN0″]This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?[/quote]
It might, but that’s not the point, really. Before this patch, a single UML could consume all of the I/O (say, for a given device, like you suggested). It would still cause the same problem when other Linodes tried to access the device. The same effect can be had with “swap files” that exist on your filesystem (rather than actual ubd images) or heavy I/O on any filesystem.
With this patch, I am able to guarantee a minimum level of service. Previously that wasn’t possible.
-Chris
Great work chris, I genuinely can’t think of anything else you can improve upon! 😉
Chris,
I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.
Another forum thread had the same problem.
dev/ubd/disc0: unknown partition table
/dev/ubd/disc1: unknown partition table
I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.
It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6 (and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.
Great job man! Keep up the good work!
Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.
Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.
I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?
caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.
It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. 😛
[quote:52760ef410=”Quik”]Great work chris, I genuinely can’t think of anything else you can improve upon! :wink:[/quote]
Thanks Quik 🙂
[quote:52760ef410=”gmt”]Chris,
I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.
Another forum thread had the same problem.
dev/ubd/disc0: unknown partition table
/dev/ubd/disc1: unknown partition table[/quote]
You can always ignore this warning message — it’s just telling you that the ubd devices are not partitioned. You’re using the entire block device as one giant ‘partition’.
To get 2.6 to work under RedHat, first rename /lib/tls to something else (since 2.6-um and NPTL don’t mix yet).
-Chris
[quote:2eaacf3890=”bji”]I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.
It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6[/quote]
Not sure where you read that from my post. I’ve already patched the 2.4.25-linode24-1um kernel with the token-limiter patch, and 2.6-um to follow shortly.
[quote:2eaacf3890=”bji”](and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.[/quote]
Most/all of the repeat offenders have already been rebooted into the “linode24″ kernel (with the limiter patch). So the solution is in effect right now. But, you are correct — there are still many Linodes running un-limited.
[quote:2eaacf3890=”bji”]Great job man! Keep up the good work![/quote]
Thanks!
-Chris
[quote:f066e66db0=”bji”]Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.
Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.
I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?[/quote]
I agree — the correct solution is to get Linux fixed, or perhaps to get UML to use the host more efficiently. Some of the UML I/O rework is already under way (the AIO stuff), but that kind of thing *is* months away…
One interesting “feature” of the CFQ scheduler is an ionice priority level. But, I wasn’t able to get the syscalls working to test it.
-Chris
[quote:01c9cda963=”griffinn”]caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.
It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. :P[/quote]
I’m not sure where the bottleneck is — but as far as I can tell, CFQ and the standard scheduler in 2.4 appear equally (non)responsive in the worst-case scenario. Go figure…
One interesting thing is that UML uses the no-op elevator. Jeff and I got into a discussion about this, and he says there’s no point to UML doing any request merging, but I disagree. I’d rather have UML do some of it’s own request merging and reordering than force the host to do it all. Plus, it makes UML appear to the host as more of a streaming type load than a random load…
Think back to the last set of tiobench benchmark results you’ve seen — look how poorly the random-i/o results are compared to “streaming-read” and “streaming-write”…
So .. another hack to the UML code (one-liner) to test…
-Chris
Thanks, Caker. I have a tiny linode and I make almost no demands on the system, so far at least. However, fairness is part of what you sell. It sounds like the leaky bucket in the UM kernel solves most of the problem with a minimum of effort. I’ve been implementing fairness algorithms for at least 30 years, so I have a few theoretical observations and questions:
You appear to be issueing tokens independently to each process at an absolute rate, independent of the actual resource availability. This means that a UML may get limited even if nobody else wants the resource, yes? It might be better for the host kernel to issue tokens at an over-all rate to the UMLs.That way a particular UML can use the whole resource if nobody else wants it. since everybody’s buckets are full, the instant anyone else wants to use the resource the original user is instantly throttled to 50% as the tokens are returned equally to the two users, and so on as more users are added. That is, the main kernel returns tokens to each UML with a non-full bucket equally, but does not add tokens to a bucket that is already full. The host kernel should dyamically adjust its token generation rate to just keep the resource occupied. I’ve successfully done this in the past by watching the resource: if the resource goes idle when thre are any empty buckets, slightly increase the token rate. If the resource never goes idle, slightly decrease the token rate.
Next issue: Do you “oversubscribe” the host memory? That is, does the sum of the UML memory sizes exceed the size of the host’s real application space? If so, the host swapspace is used, causing disk activity at this level. This is independent of the swap activity within each UML as the user exceeds its “real” space and begins to use its swap partition. I’m guessing that host-level swapping does not count against any UML’s bucket. but that UML-level swapping does. This would be tha fair way to do this. However, host-level swapping will reduce the overall amount of IO resource that is available to the users. The algorithm above will account for this.
Next issue: Do we have fairness issues with network bandwidth? do you intend to add a token system to address this?
Again: I’m a happy camper. These are purely theoretical questions for me.