ブログLinodeLinode.comステータスアップデート 04/06/04

Linode.comステータスアップデート 04/06/04

2004年4月6日

[b]Linux 2.6 on the Hosts[/b][/size]

20台のホストサーバーのうち6台は、Linuxカーネルのバージョン2.6を採用しており、ディスクスケジューラーには、公正なキューイング方式であるCFQを採用しています。

2.6は、2.4と比べて、いくつかのワークロードでは良くなっていますが、他のワークロードでは少し悪くなっています。 2.4と比較して、2.6はいくつかのワークロードで優れており、他のワークロードでは少し劣っていることに気付きました（2.6の前／後のmrtgとvmstatsの出力を比較して判断）。 VMチューニングオプション(/proc/sys/vm/*)を使えば、さらに効果があるのではないかと期待しています。

全体的に見て、2.6は「良いこと」だと思いますし、他のホストもいずれ2.6に移行する予定です。

[b]ディスクI/Oスラッシングはもう起こらない！[/b][/size][/size]

2.6に移行しようと思った一番の理由は、2.4に比べてI/Oパフォーマンスが向上したことでした。

Linuxは、ランダムな読み取り/書き込み要求が大量に発生してリクエストキューがいっぱいになると、「ハードドライブのサービス拒否攻撃」と呼ばれる現象が発生します。これにより、他のリクエストの遅延問題が発生し、基本的に動作が停止してしまいます。

これはまさに、Linodeがスワップデバイスを継続的にスラッシング（高速な読み書き）し、ホストがそれらのダーティページを書き出すプレッシャーにさらされているときに発生するワークロードです（時間が経てば必ずそうなりますが）。残念ながら、2.6へのCFQパッチはこの問題を解決しませんでした。 (デフォルトの先読みスケジューラや期限付きスケジューラもそうです)。

CFQは、多くのスレッドがランダムI/Oを行う場合（cronジョブのパーティのような場合）には少し役立ちますが、1つのLinodeがホスト全体をウェッジする可能性を排除することはできません。解決策についてはこちらをご覧ください...

[b]UML I/O Request Token-Limiter patch[/b][/size]

UML内部の非同期UBDドライバの周りに、シンプルなトークンバケットフィルタ/リミッターを実装してみました。トークンバケットの方法はとてもすてきです。その仕組みは以下の通りです。毎秒、x個のトークンがバケットに追加されます。すべてのI/Oリクエストはトークンを必要とするので、I/Oの実行が許可される前に、バケットにトークンが溜まるまで待たなければなりません。

この方法では、バケツが空になるまでは、バースト可能な/制限のないレートが可能で、その後はスロットルを開始します。完璧ですね。

リンク
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.patch]token-limiter-v1.patch[/url] です。
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.README]token-limiter-v1.README[/url] です。

[b][color=darkred]このパッチでは、1台のLinodeがホストをウェッジすることができなくなりました！[/color][/b]このパッチでは、1台のLinodeがホストをウェッジすることができなくなりました。

これは大きな問題です。この問題が発生したときに修正する唯一の方法は、私が介入して問題のあるLinodeを停止させることでした。

リミッターパッチは、2.4.25-linode24-1umカーネルに含まれています（2.6もまもなくリリース予定）。

デフォルトは非常に高く設定されているので、通常の使用では影響を受ける人はいないと思います。ランタイム中にリフィルとバケットサイズの値を変更できるので、ホストの負荷に応じてプロファイルを動的に変更する各ホスト用のモニターを設計できるようになると思います。これは大変なことです！🙂。

[b]Linux 2.6 for the Linodes[/b][/size]

2.6-umのカーネルをまだ正式に発表していません。まだいくつかのバグやパフォーマンスの問題が残っています。 2.6-um カーネルを本番環境で使用することはまだお勧めしませんが、何人かの冒険的なユーザーが2.6-um カーネルをテストし、各ディストロの下で2.6-um カーネルを動作させる際の問題点を報告してくれています。 2.6への移行に関するガイドをまとめて、カーネルがより安定した時点でリリースしたいと思います。

[b]UMLの世界の新情報[/b][/size]

新しいUMLパッチのリリースがずいぶん遅れている。今後2週間ほどでUMLの新しいリリース(2.4と2.6の両方)ができると思います。

通常のバグフィックスに加えて、ジェフがUML内のIOドライバのAIOサポートに取り組んでいることを知っています。 AIOは2.6（ホスト上）で実装された新機能です。いくつかの利点があります。
1つのシステムコールで複数のI/Oリクエストを送信できるようになりました。
I/Oリクエストの完了を待たずに送信したり、他の処理とオーバーラップさせることができます。
バッチされたI/Oの個々の要求を組み合わせたり、順番を変えたりすることで、カーネルがディスクの動作を最適化すること。
余分なスレッドをなくし、コンテキストスイッチを減らすことで、CPUの使用率とシステムのスループットが向上します。
[/リスト]
AIOの詳細はこちら。
http://lse.sourceforge.net/io/aio.html
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf
—

それだけである。
クリス

こちらもどうぞ...

コメント (12)

diN0 April 6, 2004 at 11:41 pm

This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?

Reply
caker April 7, 2004 at 12:17 am

[quote:9a75d3e3be=”diN0″]This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?[/quote]
It might, but that’s not the point, really. Before this patch, a single UML could consume all of the I/O (say, for a given device, like you suggested). It would still cause the same problem when other Linodes tried to access the device. The same effect can be had with “swap files” that exist on your filesystem (rather than actual ubd images) or heavy I/O on any filesystem.

With this patch, I am able to guarantee a minimum level of service. Previously that wasn’t possible.

-Chris

Reply
Quik April 7, 2004 at 4:57 am

Great work chris, I genuinely can’t think of anything else you can improve upon! 😉

Reply
gmt April 7, 2004 at 5:03 am

Chris,

I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.

Another forum thread had the same problem.
dev/ubd/disc0: unknown partition table
/dev/ubd/disc1: unknown partition table

Reply
bji April 7, 2004 at 11:26 am

I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.

It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6 (and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.

Great job man! Keep up the good work!

Reply
bji April 7, 2004 at 11:31 am

Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.

Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.

I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?

Reply
griffinn April 9, 2004 at 10:39 am

caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.

It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. 😛

Reply
caker April 9, 2004 at 3:50 pm

[quote:52760ef410=”Quik”]Great work chris, I genuinely can’t think of anything else you can improve upon! :wink:[/quote]
Thanks Quik 🙂

[quote:52760ef410=”gmt”]Chris,

I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.

Another forum thread had the same problem.
dev/ubd/disc0: unknown partition table
/dev/ubd/disc1: unknown partition table[/quote]
You can always ignore this warning message — it’s just telling you that the ubd devices are not partitioned. You’re using the entire block device as one giant ‘partition’.

To get 2.6 to work under RedHat, first rename /lib/tls to something else (since 2.6-um and NPTL don’t mix yet).

-Chris

Reply
caker April 9, 2004 at 3:59 pm

[quote:2eaacf3890=”bji”]I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.

It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6[/quote]
Not sure where you read that from my post. I’ve already patched the 2.4.25-linode24-1um kernel with the token-limiter patch, and 2.6-um to follow shortly.

[quote:2eaacf3890=”bji”](and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.[/quote]
Most/all of the repeat offenders have already been rebooted into the “linode24″ kernel (with the limiter patch). So the solution is in effect right now. But, you are correct — there are still many Linodes running un-limited.

[quote:2eaacf3890=”bji”]Great job man! Keep up the good work![/quote]
Thanks!

-Chris

Reply
caker April 9, 2004 at 4:04 pm

[quote:f066e66db0=”bji”]Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.

Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.

I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?[/quote]
I agree — the correct solution is to get Linux fixed, or perhaps to get UML to use the host more efficiently. Some of the UML I/O rework is already under way (the AIO stuff), but that kind of thing *is* months away…

One interesting “feature” of the CFQ scheduler is an ionice priority level. But, I wasn’t able to get the syscalls working to test it.

-Chris

Reply
caker April 9, 2004 at 4:09 pm

[quote:01c9cda963=”griffinn”]caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.

It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. :P[/quote]
I’m not sure where the bottleneck is — but as far as I can tell, CFQ and the standard scheduler in 2.4 appear equally (non)responsive in the worst-case scenario. Go figure…

One interesting thing is that UML uses the no-op elevator. Jeff and I got into a discussion about this, and he says there’s no point to UML doing any request merging, but I disagree. I’d rather have UML do some of it’s own request merging and reordering than force the host to do it all. Plus, it makes UML appear to the host as more of a streaming type load than a random load…

Think back to the last set of tiobench benchmark results you’ve seen — look how poorly the random-i/o results are compared to “streaming-read” and “streaming-write”…

So .. another hack to the UML code (one-liner) to test…

-Chris

Reply
dgc April 14, 2004 at 11:33 am

Thanks, Caker. I have a tiny linode and I make almost no demands on the system, so far at least. However, fairness is part of what you sell. It sounds like the leaky bucket in the UM kernel solves most of the problem with a minimum of effort. I’ve been implementing fairness algorithms for at least 30 years, so I have a few theoretical observations and questions:

You appear to be issueing tokens independently to each process at an absolute rate, independent of the actual resource availability. This means that a UML may get limited even if nobody else wants the resource, yes? It might be better for the host kernel to issue tokens at an over-all rate to the UMLs.That way a particular UML can use the whole resource if nobody else wants it. since everybody’s buckets are full, the instant anyone else wants to use the resource the original user is instantly throttled to 50% as the tokens are returned equally to the two users, and so on as more users are added. That is, the main kernel returns tokens to each UML with a non-full bucket equally, but does not add tokens to a bucket that is already full. The host kernel should dyamically adjust its token generation rate to just keep the resource occupied. I’ve successfully done this in the past by watching the resource: if the resource goes idle when thre are any empty buckets, slightly increase the token rate. If the resource never goes idle, slightly decrease the token rate.

Next issue: Do you “oversubscribe” the host memory? That is, does the sum of the UML memory sizes exceed the size of the host’s real application space? If so, the host swapspace is used, causing disk activity at this level. This is independent of the swap activity within each UML as the user exceeds its “real” space and begins to use its swap partition. I’m guessing that host-level swapping does not count against any UML’s bucket. but that UML-level swapping does. This would be tha fair way to do this. However, host-level swapping will reduce the overall amount of IO resource that is available to the users. The algorithm above will account for this.

Next issue: Do we have fairness issues with network bandwidth? do you intend to add a token system to address this?

Again: I’m a happy camper. These are purely theoretical questions for me.

Reply

コンピュート

ストレージ

Databases

ネットワーク

開発者ツール

デリバリー

セキュリティ

サービス

産業分野

料金

コミュニティ

当社と連絡を取る

Linode.comステータスアップデート 04/06/04

こちらもどうぞ...

コメント (12)

コメントを残すコメントをキャンセル

Linode.comステータスアップデート 04/06/04

こちらもどうぞ...

クラウドのコストをどのようにコントロールしたか

新しいコアコンピュートリージョン：ロンドン・エクスパンションとメルボルンが稼動

アカマイの Stevie 受賞は、カスタマーエクセレンスへのコミットメントを裏付けるものです。

コメント (12)

コメントを残すコメントをキャンセル

”In the Node" ニュースレターを購読