As a young boy I grew up around sheep farms in country Western Australia. There were always one or two unique puppies, and hundreds or thousands of near identical sheep. This weeks online Census failure can be explained by looking at the way different people set up their systems. In essence there are two ways that you can set up your web servers. You can treat them like puppies, or you can treat them like sheep.
When you have a puppy it becomes part of your family. You give it a name. You nurture it and care for it. And if it gets sick you take it to a vet and nurse it back to health. The old way of managing web servers, the IBM way, is to lovingly care for them in the same way.
It’s the way I did things when I started my IT career. My servers had names, the first two were Tweety and Sylvester and I cared for them every day. In order to meet greater demand you would purchase more servers at considerable expense. Then spend time, money and effort setting them up from scratch. But that makes sense right?
Now of course the smarter you are the better you plan out your infrastructure. Knowing that there would be a huge demand on Census night, far in advance, IBM decided to prepare a number of web servers. But if you guess the demand wrong then you’re stuck. It can take months to get another server in place. By the time you realise you have a problem on the night it’s too late for you to be able to fix it.
This isn’t new of course. Remember the first Click Frenzy in 2012? Same deal. From the Wikipedia article on Click Frenzy: “Organisers boasted of their preparedness to deal with the expected popularity. “We’re expecting up to 1 million site visits [to clickfrenzy.com.au] and we’re prepared for this,” the spokesperson said. However the site failed almost immediately after the sale period starting. “
The other way of managing servers is to treat them like sheep.
Millions of Australians were saying “Of course the Census site isn’t working, we’re all trying to access it at the same time!”. They were saying such comments on Twitter and Facebook, all at the same time. With those websites working just fine of course. Meeting huge demand is a solved problem.
Modern internet companies like Facebook, Amazon and Netflix set up their servers like sheep in a paddock. The web site can be delivered by any sheep/server so you can have many hundreds of them. They’re all the same, right? Unlike puppies you don’t treat them as individuals and you can bring in more or get rid of some without any drama.
Stretching the analogy, but when your web server responds like sheep in a paddock then it doesn’t matter which individual sheep is involved at the time. To the casual observer all sheep are near identical. If you have designed your web servers to run on Amazon Web Services, for example, you can use their Auto Scaling service and have a new server up and running in a matter of seconds. You can grow your paddock to essentially infinite capacity within seconds.
With that degree of flexibility you only pay for what you use, when you use it. With a lot of smarts behind it Auto Scaling allows you to set rules to the effect of “If I need more servers, pay for them. Then when the load goes down again, turn them off”.
In a world where you can scale to meet the demands of million of users, why limit yourself to a more expensive and slower option?
To actually set up your servers in such a way can be a challenge, it does take time and effort by a lot of smart people. In the startup world it can be overkill and a waste of limited resources, a small business is unlikely to get millions of people banging on their door all of a sudden. But the Census project was a multimillion dollar project outsourced to one of the biggest companies on the planet with significant resources at their disposal. Why did they decide to do things the way they did? I’m afraid that I don’t know.
Lastly on the topic of attacks to the servers. The smallest bank gets frequent and severe attacks, as does almost every non-trivial website in the world. But in any case mitigating against such attacks is again a solved problem. This wasn’t the first website to have been targeted, and there are proven, repeatable and scalable techniques that can be used to work against those attacks.
Whatever findings from the Census failure are released to the Australian public, we’ll only be told what happened. I have no doubt the biggest question will remain: Why?