Dark Side of the Cloud: Problems with Storage
We recently moved from Amazon on-demand “cloud” hosting to our own
dedicated servers. It took about three months to order and set up the
new servers versus a few minutes to get servers on Amazon. However, the
new servers are 2.5X faster and so far, more reliable.
We love Amazon for fostering development and innovation. Cloud
computing systems are great at getting you new servers. This helps a
lot when you are trying to innovate because you can quickly get new
servers for your new services. If you are in a phase of trying new
things, cloud hosts will help you.
Cloud hosts also help a lot when you are testing. It’s amazing how
many servers it takes to run an Internet service. You don’t just need
production systems. You need failover systems. You need development
systems. You need staging/QA systems. You will need a lot of servers,
and you may need to go to a cloud host.
However, there are problems with cloud hosting that emerge if you
need high data throughput. The problems aren’t with the servers but
instead, with storage and networking. To see why, let’s look at how a
cloud architecture differs from a local box architecture. You can’t
directly attach each storage location to the box that it servers. You
have to use network attached storage.
DEDICATED ARCHITECTURE: Server Box -> bus or lan or SAN -> Storage
CLOUD ARCHITECTURE: Server Box -> Mesh network -> Storage cluster with network replication
1) Underlying problem: Big data, slow networks
Network attached storage becomes a problem because there is a
fundamental mismatch between networking and storage. Storage capacity
almost doubles every year. Networking speed grows by a factor of ten
about every 10 years – 100 times lower. The net result is that storage
gets much bigger than network capacity, and it takes a really long time
to copy data over a network. I first heard this trend analyzed by John
Landry, who called it “Landry’s law.” In my experience, this problem
has gotten to the point where even sneakernet (putting on sneakers and
carrying data on big storage media) cannot save us because after you
lace up your sneakers, you have to copy the data OVER A NETWORK to get
it onto the storage media and then copy it again to get it off. When we
replicated the Assembla data to the new datacenter, we realized that it
would be slower to do those two copies than to replicate over the
Internet, which is slower than sneakernet for long distance transport
but only requires one local network copy.
2) Mesh network inconsistency
The Internet was designed as a hub and spoke network, and that part
of it works great. When you send a packet up from your spoke, it
travels a predictable route through various hubs to its destination.
When you plug dedicated servers into the Internet, you plug a spoke into
the hub, and it works in the traditional way. The IP network inside a
cloud datacenter is more of a “mesh.” Packets can take a variety of
routes between the servers and the storage. The mesh component is
vulnerable to both packet loss and capacity problems. I can’t present
any technical reason why this is true, but in our observation, it is
true. We have seen two different issues:
* Slowdowns and brownouts: This is a problem at both Amazon and
GoGrid, but it is easier to see at Amazon. Their network, and
consequently their storage, has variable performance, with slow periods
that I call “brownouts.”
* Packet loss: This is related to the capacity problems as routers
will throw away packets when they are overloaded. However, the source
of the packet loss seems to be much harder to debug in a mesh network.
We see these problems on the GoGrid network, and their attempts to
diagnose it are often ineffectual.
3) Replication stoppages
The second goal of cloud computing is to provide high availability.
The first goal is to never lose data. When there is a failure in the
storage cluster, the first goal (don’t lose data) kicks in and stomps on
the second goal (high availability). Systems will stop accepting new
data and make sure that old data gets replicated. Network attached
storage will typically start replicating data to a new node. It may
either refuse new data until it can be replicated reliably, or it will
absorb all network capacity and block normal operation in the mesh.
Note that in a large complex systems, variations in both network
speed and storage capacity will follow a power law distribution. This
happens "chaotically." When the variation reaches a certain low level
of performance, the system fails because of the replication problem.
I think that we should be able to predict the rate of major failures
by observing the smaller variations and extrapolating them with a power
law. Amazon had a major outage in April 2011. Throughout the previous
18 months, they had performance brownouts, and I think the frequency of
one could be predicted from the other.
So, if your application is storage intensive and high availability, you must either:
1) Design it so that lots of replication is running all of the time,
and you can afford to lose access to any specific storage node. This
places limits on the speed that your application can absorb data because
you need to reserve a big percentage of scarce network capacity for
replication. So, you will have only a small percentage of network
capacity available to for absorbing external data. However, it is the
required architecture for very large systems. It works well if you
have a high ratio of output to input, since output just uses the
replicated data rather than adding to it.
If you try this replication strategy, you will need to deal with two
engineering issues. First, you will think through replication
specifically for your application. There are many new database
architectures that make this tradeoff in various ways. Each has
strengths and weaknesses, so if you design a distributed system, you
will probably end up using several of these new architectures. Second,
you will need to distribute across multiple mesh network locations. It's
not enough just to have several places to get your data, in the same
network neighborhood. If there is a problem, the entire mesh will jam
up. Ask about this.
2) Use local storage