Advice on scaling horizontally

Hello -

I'm creating a web service for gaming that will be used by iPhone clients under my control. I have 350,000 unique iPhone users at this time, and they play about 20,000 games daily against an A.I. opponent.

Now I want them to be able to play each other over the Internet. Specifically, the game is a "turn based" game like chess, checkers, darts, connect four, bowling, etc.

With this many users, I have to be careful to scale from the outset.

If all goes well, I plan on licensing the infrastructure out to other iPhone developers and would love to scale the infrastructure along with the increase in load on the service brought on by additional iPhone developer games (and other games that I create).

I'm torn between using a service like Heroku (which is a bit pricey since I can do it all myself) vs. doing it myself with multiple Linodes. I do not mind (and rather enjoy) doing sysadmin-type work - it's fun.

I'm looking for advice on how to most effectively scale out my own little "cloud" of servers all on Linode. Any issues that you can think of, please mention them and if you care to share, how YOU solved them.

The general idea as I see it is to add linodes to match increases in load on the service. Great, but how do YOU effectively do this? DNS round robin, a single linode for a "VIP" of sorts, etc.?

The issues as I see them are:

  • deciding whether each linode is a full stack (nginx, upstream API/service handler, and FULL datastore [ALL developers, ALL iPhone users]) or shards (linode 1 gets iPhone users with username A-D, linode 2 gets users from E-G, etc)

  • bandwidth constraints (sure I can upgrade but I'd like to use bandwidth effectively)

  • database load (I'm using Tokyo Tyrant which supports master-master replication)

  • reducing impact of hardware failures and DoS attacks on a particular data center

As I understand it, in order to have replication between database instances that does not kill my bandwidth transfer limits, I'd use private IPs. But this means that the linodes need to reside in the same data center. How do YOU handle this?

I'm hoping for as much specific advice as you can offer. Thanks!

:)

23 Replies

While I'm not qualified to give advice how how to achieve your goals, it may be useful to remember that the Linode specs scale linearly (a linode with double the cost has double the RAM, double the bandwidth, and double the CPU allotment). The CPU allotment part may be important.

Guspaz - Thanks. That would be scaling vertically. It's good to be able to scale vertically but 1) you should spread the load horizontally to keep performance "good" and 2) you should have some level of redundancy in case there's a hardware failure (HA).

I'm scouring the forums for stories of people doing this on linode.

@z8000:

Guspaz - Thanks. That would be scaling vertically. It's good to be able to scale vertically but 1) you should spread the load horizontally to keep performance "good" and 2) you should have some level of redundancy in case there's a hardware failure (HA).

where are you getting 1? as long as the box isn't bounded by something, what's the difference between horizontal and vertical?

If HA is your goal, that's a very different thing from performance. If you really want HA, get your performance up first, then multiply everything by 2.

I am getting 1 from exactly that - at some point a single box will not be able to handle infinite load.

I'm looking for a mix of HA and performance. Sorry if I was unclear.

I have thought about this a bit myself and I would probably end up utilizing all the options you bring up in your ports.

1 smallLlinode at each data center to act as the vip with whatever linux vip software you like.

X of servers on the back end behind the vip at whatever linode plan give you the performance you need.

X of servers as the DB back ends(if you need this sort of architecture).

Once you have 2 data centers setup then you can us DNS round robin between the two vip's and you now have scalability and redundancy. You could add in other data centers as you see fit or just add more server to the exiting setups to scale.

I'm sure there is a gotcha to this setup but I thnik with careful planning and monitoring you would get what you want. Im sure HA and the private network could also be added into the mix for less public transfer and greater redundancy.

Sean

> DNS round robin, a single linode for a "VIP" of sorts, etc.?
DNS round robin works fine until you have half a dozen servers. After that you'll need to designate one or more of your own servers for load-balancing duties (Pound, etc.)

> - deciding whether each linode is a full stack (nginx, upstream API/service handler, and FULL datastore [ALL developers, ALL iPhone users]) or shards (linode 1 gets iPhone users with username A-D, linode 2 gets users from E-G, etc)
I'd put at least the database on a separate server/cluster, and make the application servers connect to it over the private IP. Try to stay away from shredding up your data, as things tends to become rather difficult to maintain that way. You'll be able to scale quite a bit with replication alone.

> - bandwidth constraints (sure I can upgrade but I'd like to use bandwidth effectively)
Linode has a fantastic feature known as bandwidth pooling. For example, if you have 5 Linode 2880s, you have 8TB of bandwidth to divide any way you want among those linodes. You can have one front-end server use all of it, and it'll make no difference from the billing point of view. But please do try to use the private network for all inter-Linode traffic. It's free and more secure.

> - database load (I'm using Tokyo Tyrant which supports master-master replication)
No SQL? Good for you ;)

If you're into Tokyo Cabinet/Tyrant, you might want to look into Redis. Similar features, just a lot faster.

> - reducing impact of hardware failures and DoS attacks on a particular data center
Hardware failures are OK if your infrastructure is spread out across multiple linodes on multiple host machines and your load balancing solution can detect an outage. (You can specifically request that no two linodes reside on the same machine.) As for the second part, you'll need to set up an identical cluster in a different location, make sure that each cluster is capable of operating while the other is offline, but at the same time keep both in sync while both are online. Tricky :D

> As I understand it, in order to have replication between database instances that does not kill my bandwidth transfer limits, I'd use private IPs. But this means that the linodes need to reside in the same data center. How do YOU handle this?
AFAIK, there's no way around using your public bandwidth if you want geographical redundancy. But if you set up replication properly and only sync the changes, the bandwidth usage should be negligible compared to what your clients chew up. Just make sure that whatever traffic between the datacenters goes through an SSH tunnel, for security's sake.

@z8000:

I am getting 1 from exactly that - at some point a single box will not be able to handle infinite load.

Sure, but you rejected the idea of a bigger linode based on this.

What I'm getting at is if you need more then a 360 and your options are 2 360's vs 1 720, the 720 shouldn't be ruled out for performance reasons. Also, 2 360's does not give you HA if 1 360 wouldn't be sufficient on its own.

Thanks for the great reply.

I love Redis – it's so simple and yet very powerful. I wrote the AUTH command

for it, as well as a few open source clients for it (C++, Node.js). I maintain

the Perl client for Redis as well. These are all on GitHub. This is a little

OT but: I like Tokyo Cabinet/Tyrant because Redis keeps the entire dataset in

memory which just doesn't scale well since there's considerable memory overhead

per key/value pair on average. Also nice with TT is that one

can use an in-memory version of TT which has an in-memory database as master with a local disk-based slave; writes are

async to the original TT write. Thus, reads and writes fly and are disk-backed… Redis only

writes every so often to disk and that coupled with excessive memory usage is a

showstopper for me. Sure you can configure Redis to write sync/save to disk

but it's not optimized for that since it handles a single connection request at

a time. In other words, it blocks, and it would also block on disk I/O. Bad,

bad, bad.

I just learned about Linode's bandwidth pooling feature this morning actually

in another thread. It's killer! That does open some doors for running a

single linode as a dedicated "vip" or reverse proxy like nginx which routes

over the private network to other linodes (in the same data center). Nginx

ignores upstream processes that have stopped responding.

OK with your input, this is what I envision. This is in one data center:

Linode A - reverse-proxy (nginx) and datastore (tokyo tyrant)

Linode B - "upstream" processes (e.g. application logic)

Linode C - "upstream" processes (e.g. application logic)

Linode D - "upstream" processes (e.g. application logic)

Linode Z - "upstream" processes (e.g. application logic)

Linode A reverse-proxies to Linodes B-Z over the private network, and Linodes

B-Z talk to the datastore on Linode A, also over the private network. I'd

request Linode to put each Linode (as much as possible) on different physical

machines like you mentioned.

This seems like a nice solution because 1) nginx scales to many thousands of

simultaneous requests thanks to its use of async I/O (e.g. epoll) and 2) I can

scale horizontally by adding more linodes (e.g. E, F, G, … Z) as needed.

> As for the second part, you'll need to set up an identical cluster in a

different location, make sure that each cluster is capable of operating while

the other is offline, but at the same time keep both in sync while both are

online. Tricky

Right! Tokyo Tyrant is nice here too. You can setup master-master

("dual-master") replication between instances. I suppose OpenVPN or a SSH

tunnel between the "linode A" in each data center would work.

See the "Replication" section here:

http://tokyocabinet.sourceforge.net/tyrantdoc/#tutorial

The only thing that I'm not sure about still is how to implement the redundancy

between the two data centers for web service requests – the TT replication covers the datastore part.

For instance, if the VIP/reverse-proxy goes down

due to a hardware failure? I'd want all requests to go to the

VIP/reverse-proxy of the other data center until I get the reverse-proxy/datastore

up again in the first data center.

As I understand it, DNS round robin wouldn't really work for that. It would

just cause 50% of the requests to fail. How would you do this? Just point DNS

to data center one and if there's a problem, switch DNS to point to the other data

center? What do I need to know here? Is a short TTL all this requires? Have any

other ideas?

@glg:

@z8000:

I am getting 1 from exactly that - at some point a single box will not be able to handle infinite load.

Sure, but you rejected the idea of a bigger linode based on this.

What I'm getting at is if you need more then a 360 and your options are 2 360's vs 1 720, the 720 shouldn't be ruled out for performance reasons. Also, 2 360's does not give you HA if 1 360 wouldn't be sufficient on its own.

gig, these are all valid points.

I would definitely not rule out the 720 for performance reasons. My unstated (sorry) assumption is that I can do what I need in a 360 (or 720 vs 1440, etc.). I would potentially rule it out for HA.

@z8000:

For instance, if the VIP/reverse-proxy goes down

due to a hardware failure? I'd want all requests to go to the

VIP/reverse-proxy of the other data center until I get the reverse-proxy/datastore up again in the first data center.

As I understand it, DNS round robin wouldn't really work for that. It would just cause 50% of the requests to fail. How would you do this? Just point DNS to data center one and if there's a problem, switch DNS to point to the other data center? What do I need to know here? Is a short TTL all this requires? Have any other ideas?

Short TTL will work well enough, I guess. If your TTL is something like 5 minutes (or even 5 seconds if you're crazy), you theoretically won't get more than 5 minutes of downtime even if one of the datacenters were flattened by a hurricane. And if you don't want that to happen while you're asleep, you can write a script to monitor your servers and change the DNS entries automatically when a failure is detected.

@hybinet:

And if you don't want that to happen while you're asleep, you can write a script to monitor your servers and change the DNS entries automatically when a failure is detected.

Or just use the linode failover feature :)

@freedomischaos:

@hybinet:

And if you don't want that to happen while you're asleep, you can write a script to monitor your servers and change the DNS entries automatically when a failure is detected.

Or just use the linode failover feature :)

That only works if your nodes are in the same data center.

You seem to have a big blindspot to the operational considerations like configuration management & monitoring, or at least you haven't mentioned anything about them.

You want to be able to bring up a new node (and your entire cluster, for that matter) quickly and with minimal manual effort.

You want to be able to upgrade or reconfigure existing nodes en-mass. You want to be able to track resource utilization, response times, errors, and availability and notify and/or take automatic action when things are out of acceptable ranges.

A major advantage of TokyoTyrant over Redis is that it's code lineage has been in high-volume production use for a while. Even so, you need to test its replication and failover features before you start developing on top of them, and then all along the way to deployment and beyond.

Personally, I would not put the datastore on the same linode as the front-end proxy. On the one hand, their resource needs are pretty complementary. The front-end proxy requires minimal memory CPU or disk I/O, the datastore will need lots of RAM & disk I/O, and, potentially CPU. On the other hand, I'd rather minimize the exposure of the datastore machine to the Internet at large.

I'd probably set up nginx as a front end proxy on all my application nodes, make one of them the master with the failover IP, and come up with a way to handle failover to one of the other nodes, at least until it is simpler to have a couple of dedicated Linode 360s.

For datacenter redundancy, consider partitioning your users between datacenters in some easy way so that you don't need multimaster replication and instead can have a hot spare. If their is a problem with the primary datacenter for those users, you can fail them over en-masse to the backup. This can simplify replication setup, reduce data transfer between datacenters, and simplify dealing with latency issues.

For multi-data center, you are going to be limited to the shortcomings of a DNS based solution unless you want to wade into getting your own IP address range and worry about specialized routing, or maybe there is a service, like Akami, that would handle this for you. For the DNS solution, you could start with round-robin, but you'd want the ability to automatically remove the record for the non-functioning datacenter. I can think of various ways to get this to work, and I know that there are inexpensive services like zoneedit that can provide it as a service. Even better is to have some better logic for allocating requests when both datacenters are active. I know that their are commercial products that pick the optimal datacenter based on geography or network hops, there are probably services too, and there might even be some opensource software.

At some point you will outgrow having your datastore on a single box. Spend a little time up-front figuring out how you'll deal with that. If you are using a key-value store, then you can probably come up with a simple layer to partition data among multiple machines, but it doesn't hurt to think up front about how you'd split things up.

Back to configuration management, some tools people use: Cfengine, Puppet, Chef.

For monitoring, trending and alerting peices of the puzzle include: Munin, Nagios, Monit, Ganglia, Zenoss, and others.

Configuration and monitoring are not something that I'm blind to. I just wanted to figure out the overall architecture first. I'm familiar with Puppet and Munin.

Re: partitioning/sharding. You can't have sharding and suddenly fail everyone over to another box. Their data won't be there. Am I missing something?

Thanks for the comments!

A solution I've used in the past that worked very very well in combination with DNS round robin is wackamole (http://www.backhand.org/wackamole/). But you need to be able to install stuff on your own and know how to fix compile errors if you want to run it.

As for protection from DOS, I could be wrong on this but my impression is the Dallas Linode's are behind ThePlanet DOS mitigation service (Cisco Guard XT and Arbor Peakflow). No idea about the other Linode datacenters.

Let me ask you a question, if scaling is such an important issue for you, why are you trying to build this all from scratch on a couple of Linodes?

It sounds like you're in the iphone app business, not in the cloud building business so why waste your time and resources with building something that you can perfectly buy from some place else ?

If I'd build an internet-based iPhone application I'd probably host it on something like EC2 (or maybe maybe google app engine).

You wont have to worry about fall-over, DDOS, adding linodes to your "cloud", DNS balancing and administration in general but can focus on the application work.

Just my 2c…

@oliver:

Let me ask you a question, if scaling is such an important issue for you, why are you trying to build this all from scratch on a couple of Linodes?

It sounds like you're in the iphone app business, not in the cloud building business so why waste your time and resources with building something that you can perfectly buy from some place else ?

If I'd build an internet-based iPhone application I'd probably host it on something like EC2 (or maybe maybe google app engine).

You wont have to worry about fall-over, DDOS, adding linodes to your "cloud", DNS balancing and administration in general but can focus on the application work.

Just my 2c…

This is of course a great question. There's a couple of reasons:

  • flexibility

  • cost

By flexibility I mean I want to be able to architect whatever I want. Case in point. GAE is nice and all but they don't let you do keep-alive or any form of chunked transfer / streaming back to connected clients. Thus, you are left with polling. The associated connection overhead for iPhones in the wild (not just on Wi-Fi networks) is horrible. Something like EC2 would work for this but this leads to my second point, cost.

I am a "startup" and have no funding. I do "OK" with the iPhone stuff and want to make it grow. This service might help that a lot. However, I have to do it with as little cost as I can manage, at least up front. While everyone says EC2 is not expensive, they are comparing it to the cost for larger companies than my own. To me, EC2 becomes pretty expensive compared with doing it myself on Linode. I don't need a true "cloud" solution really. I'm fine with over-provisioning on Linode precisely because it's cost-effective.

If I can hook up 3-4 linode 360s in 2 data centers, replicate the database between these 2 data centers, and somehow load-share between the main/public-facing linodes in each data center, I'd be thrilled.

I started this thread in hopes that others would have done this already and would offer some advice to a "scaling newbie" :)

I enjoy this kind of work, have always wanted to learn it, and like what I see at Linode. That, coupled with cost savings makes me at least want to design the thing head to toe before saying "yeah, screw that, let's just spend the money on EC2 or Heroku or XYZ".

Thanks!

And you will also learn a lot from doing this … so did you come up with an elegant solution that you can share with others looking to do similar?

Nope!

Started an iPhone consulting business instead!

Although z8000 has shelved his plans, I still found this thread really useful and I wanted to add something re: using DNS to manage "vip" failover across datacenters (which was only lightly touched upon).

One solution is to opt for a managed enterprise DNS platform. Apart from automatic failover support you also get other benefits like anycast routing of DNS queries which is useful if you have users from multiple continents.

These guys are awesome:

http://dyn.com/enterprise-dns/dynect-platform

and there are a few others offering something similar (UltraDNS?).

I feel that "Enterprise Linode" is an oxymoron.

Haaha true "enterprise" does tend to get overused. In the case of managed dns, I treat the term as meaning a dedicated service aimed at technical users rather than some fluffy web interface that lets you input A records…

@gpuk:

Haaha true "enterprise" does tend to get overused. In the case of managed dns, I treat the term as meaning a dedicated service aimed at technical users rather than some fluffy web interface that lets you input A records…

enterprise |ˈentərˌprīz| noun 1. expensive 2. non-fluffy

:roll:

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct