메인 콘텐츠로 건너뛰기
블로그LinodeLinode.com 상태 업데이트 04/06/04

Linode.com 상태 업데이트 2004년 4월 6일

[b]호스트에 리눅스 2.6[/b][/크기]

현재 20개의 호스트 서버 중 6개가 2.6 버전의 Linux 커널에서 실행되며 CFQ는 대기열에 있습니다.

얼마 전부터 몇 개의 박스에 넣어 운영해 왔기 때문에, 저는 그것이 어떻게 작동하는지에 대해 꽤 좋은 느낌을 가지고 있습니다. 2.4(사전/사후 2.6 mrtg 및 vmstats 출력을 비교하여 파악됨)에 비해 일부 워크로드에서는 2.6이 더 낫고 다른 워크로드에서는 조금 더 나빴습니다. 일부 VM 튜닝 옵션(/proc/sys/vm/*)을 사용하면 몇 가지 추가 이점이 있을 것으로 예상됩니다.

전반적으로 2.6은 "좋은 것"이고, 나머지 호스트들은 결국 2.6으로 넘어갈 것입니다.

[b]디스크 I/O 스래싱은 더 이상 없습니다![/b] 【/사이즈】

내가 2.6으로 이동하고 싶었던 주된 이유는 2.4 이상의 I/O 성능 향상이었습니다.

리눅스는 임의 읽기/쓰기 요청이 많아 요청 대기열을 채우는 경우 소위 "하드 드라이브 서비스 거부 공격"에 취약합니다. 이로 인해 다른 요청에 대한 대기 시간 문제가 발생하며 기본적으로 크롤링에 문제가 발생합니다.

이는 정확히 Linode가 스왑 디바이스를 지속적으로 스패싱(빠른 읽기 및 쓰기)하고 호스트가 이러한 지저분한 페이지를 작성하라는 압력을 받을 때 발생하는 워크로드의 종류입니다(언제나 시간이 지나면 항상 해당). 안타깝게도 CFQ 패치가 2.6으로 이 문제를 해결하지 못했습니다. (기본 예상 스케줄러 또는 마감 스케줄러도 안 됐습니다)

CFQ는 (크론 작업 파티 때처럼) 랜덤 I/O를 수행하는 많은 스레드에 약간 도움이 되지만, 한 개의 리노드가 전체 호스트를 쐐기로 고정하는 가능성을 제거하지는 않습니다. 솔루션에 대한 읽기...

[b]UML I/O 요청 토큰 리미터 패치[/b][/크기]

UML 내부의 비동기화 UBD 드라이버 주위에 간단한 토큰 버킷 필터/리미터를 구현했습니다. 토큰 버킷 방법은 꽤 깔끔합니다. 작동 방식은 다음과 같습니다. 매 초마다 x 토큰이 버킷에 추가됩니다. 모든 I/O 요청에는 토큰이 필요하므로 I/O가 허용되기 전에 버킷에 토큰이 일부 있을 때까지 기다려야 합니다.

이 방법을 사용하면 버킷이 비어 있을 때까지 버스트/무제한 속도를 허용한 다음 스로틀이 시작됩니다. 완벽!

링크:
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.patch]token-limiter-v1.patch[/url]
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.README]token-limiter-v1.README[/url]

[b][color=darkred] 이 패치를 사용하면 단일 리노드가 더 이상 호스트를 쐐기시킬 수 없습니다!【/색상】[/b]

이건 정말 큰 문제예요. 제가 개입해서 불쾌감을 주는 Linode를 막는 방법밖에 없었으니까요.

리미터 패치는 2.4.25-linode24-1um 커널에 있습니다 (2.6곧 따라올 것).

기본값은 매우 높게 설정되어 있으며, 정상적인 사용 시 어떤 분도 영향을 받을 수 있을지 의심스럽습니다. 런타임에 리필 및 버킷 크기 값을 변경할 수 있으므로 호스트 로드에 따라 동적으로 프로파일을 변경하는 각 호스트용 모니터를 설계할 수 있습니다. 이것은 큰 문제입니다! 🙂

[b리노드를 위한 리눅스 2.6[/b][/크기]

나는 공식적으로 2.6 um 커널을 발표하지 않았습니다. 아직 해결해야 할 몇 가지 버그와 성능 문제가 있습니다. 저는 아직 2.6 um 커널을 프로덕션 용도로 사용하는 것을 권장하지 않지만, 몇몇 모험을 즐기는 사용자들은 커널을 테스트하고 각 디스트로 아래에서 실행하는데 관련된 몇 가지 문제를 보고하고 있습니다. 2.6으로 이주하는 방법에 대한 가이드를 컴파일하고 커널이 더 안정되면 해제하려고 합니다.

[b]UML의 세계에서 새로운 것은 무엇입니까?[/b] 【/사이즈】

우리는 새로운 UML 패치가 너무 늦었다. 앞으로 2주 이내에 새로운 UML 릴리스(2.4 및 2.6 모두)를 갖게 될 것입니다.

일반적인 버그 수정 외에도 Jeff가 UML 내부의 IO 드라이버에 대한 AIO 지원을 진행하고 있습니다. AIO는 호스트에서 2.6에서 구현된 새로운 기능입니다. 몇 가지 이점은 다음과 같습니다.
[목록][*] 단일 시스템 호출로 여러 I/O 요청을 제출할 수 있습니다.
[*] I/O 요청이 완료되기를 기다리지 않고 제출하고 다른 처리와 중복하여 요청을 처리할 수 있습니다.
[*] 일괄 처리된 I/O의 개별 요청을 결합하거나 재정렬하여 커널에 의한 디스크 활동을 최적화합니다.
[*] 추가 스레드를 제거하고 컨텍스트 스위치를 줄임으로써 CPU 사용률과 시스템 처리량을 더 잘 활용할 수 있습니다.
[/목록]
AIO에 대한 자세한 내용은 다음과 같은 것입니다.
http://lse.sourceforge.net/io/aio.html
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf

그게 전부입니다!
-크리스


댓글 (12)

  1. Author Photo

    This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?

  2. Author Photo

    [quote:9a75d3e3be=”diN0″]This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?[/quote]
    It might, but that’s not the point, really. Before this patch, a single UML could consume all of the I/O (say, for a given device, like you suggested). It would still cause the same problem when other Linodes tried to access the device. The same effect can be had with “swap files” that exist on your filesystem (rather than actual ubd images) or heavy I/O on any filesystem.

    With this patch, I am able to guarantee a minimum level of service. Previously that wasn’t possible.

    -Chris

  3. Author Photo

    Great work chris, I genuinely can’t think of anything else you can improve upon! 😉

  4. Author Photo

    Chris,

    I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.

    Another forum thread had the same problem.
    dev/ubd/disc0: unknown partition table
    /dev/ubd/disc1: unknown partition table

  5. Author Photo

    I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.

    It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6 (and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.

    Great job man! Keep up the good work!

  6. Author Photo

    Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.

    Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.

    I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?

  7. Author Photo

    caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.

    It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. 😛

  8. Author Photo

    [quote:52760ef410=”Quik”]Great work chris, I genuinely can’t think of anything else you can improve upon! :wink:[/quote]
    Thanks Quik 🙂

    [quote:52760ef410=”gmt”]Chris,

    I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.

    Another forum thread had the same problem.
    dev/ubd/disc0: unknown partition table
    /dev/ubd/disc1: unknown partition table[/quote]
    You can always ignore this warning message — it’s just telling you that the ubd devices are not partitioned. You’re using the entire block device as one giant ‘partition’.

    To get 2.6 to work under RedHat, first rename /lib/tls to something else (since 2.6-um and NPTL don’t mix yet).

    -Chris

  9. Author Photo

    [quote:2eaacf3890=”bji”]I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.

    It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6[/quote]
    Not sure where you read that from my post. I’ve already patched the 2.4.25-linode24-1um kernel with the token-limiter patch, and 2.6-um to follow shortly.

    [quote:2eaacf3890=”bji”](and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.[/quote]
    Most/all of the repeat offenders have already been rebooted into the “linode24″ kernel (with the limiter patch). So the solution is in effect right now. But, you are correct — there are still many Linodes running un-limited.

    [quote:2eaacf3890=”bji”]Great job man! Keep up the good work![/quote]
    Thanks!

    -Chris

  10. Author Photo

    [quote:f066e66db0=”bji”]Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.

    Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.

    I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?[/quote]
    I agree — the correct solution is to get Linux fixed, or perhaps to get UML to use the host more efficiently. Some of the UML I/O rework is already under way (the AIO stuff), but that kind of thing *is* months away…

    One interesting “feature” of the CFQ scheduler is an ionice priority level. But, I wasn’t able to get the syscalls working to test it.

    -Chris

  11. Author Photo

    [quote:01c9cda963=”griffinn”]caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.

    It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. :P[/quote]
    I’m not sure where the bottleneck is — but as far as I can tell, CFQ and the standard scheduler in 2.4 appear equally (non)responsive in the worst-case scenario. Go figure…

    One interesting thing is that UML uses the no-op elevator. Jeff and I got into a discussion about this, and he says there’s no point to UML doing any request merging, but I disagree. I’d rather have UML do some of it’s own request merging and reordering than force the host to do it all. Plus, it makes UML appear to the host as more of a streaming type load than a random load…

    Think back to the last set of tiobench benchmark results you’ve seen — look how poorly the random-i/o results are compared to “streaming-read” and “streaming-write”…

    So .. another hack to the UML code (one-liner) to test…

    -Chris

  12. Author Photo

    Thanks, Caker. I have a tiny linode and I make almost no demands on the system, so far at least. However, fairness is part of what you sell. It sounds like the leaky bucket in the UM kernel solves most of the problem with a minimum of effort. I’ve been implementing fairness algorithms for at least 30 years, so I have a few theoretical observations and questions:

    You appear to be issueing tokens independently to each process at an absolute rate, independent of the actual resource availability. This means that a UML may get limited even if nobody else wants the resource, yes? It might be better for the host kernel to issue tokens at an over-all rate to the UMLs.That way a particular UML can use the whole resource if nobody else wants it. since everybody’s buckets are full, the instant anyone else wants to use the resource the original user is instantly throttled to 50% as the tokens are returned equally to the two users, and so on as more users are added. That is, the main kernel returns tokens to each UML with a non-full bucket equally, but does not add tokens to a bucket that is already full. The host kernel should dyamically adjust its token generation rate to just keep the resource occupied. I’ve successfully done this in the past by watching the resource: if the resource goes idle when thre are any empty buckets, slightly increase the token rate. If the resource never goes idle, slightly decrease the token rate.

    Next issue: Do you “oversubscribe” the host memory? That is, does the sum of the UML memory sizes exceed the size of the host’s real application space? If so, the host swapspace is used, causing disk activity at this level. This is independent of the swap activity within each UML as the user exceeds its “real” space and begins to use its swap partition. I’m guessing that host-level swapping does not count against any UML’s bucket. but that UML-level swapping does. This would be tha fair way to do this. However, host-level swapping will reduce the overall amount of IO resource that is available to the users. The algorithm above will account for this.

    Next issue: Do we have fairness issues with network bandwidth? do you intend to add a token system to address this?

    Again: I’m a happy camper. These are purely theoretical questions for me.

댓글 남기기

이메일 주소는 게시되지 않습니다. 필수 필드가 표시됩니다 *