Blog

April 26, 2010

Terabytes on demand? Cloud storage options

If you use cloud virtual servers, you will sooner or later need a big pile of hosted storage to go with them.  Assembla.com uses a lot of storage for repositories. We also have customers of our cloud development/outsourcing practice that use a lot of storage for photos and other media.  So, I have been doing some research.  If you need storage, read on for a description of the types of options that I found for cloud hosted storage.

In working with hosted storage, you will need to remember that disks are much bigger than network pipes. Disk capacity doubles every two years or so. Network capacity doubles every four years or so - much, much more slowly. As a result, you can get a 1 terabyte drive for a few hundred dollars, but it will take two whole days to transfer the contents of this disk over an expensive Internet connection with an average speed of 50 mbit per second.

If you have many terabytes, as we do, your only practical option for moving data around the Internet is to use the old sneakernet - actually putting on your sneakers and delivering the media. Make sure that your storage provider will handle this type of delivery, both in and out.

Even if you don't intend to move between hosting locations, you will find that the network places a huge constraint on restore times. You can backup your data to remote locations, or even to a second device in the same datacenter, by using incremental processes like Rsync or backup software that moves only the changes. However, if you ever need to restore in a disaster recovery scenario, it will take you a long time, even over gigabit internal connections. If your customers are as demanding as Assembla customers, you might find yourself out of business before the restore completes between two devices. You will need to use storage which has redundancy inside the storage device.

S3 bucket-style file storage
Amazon created this category with their S3 service that allows you to add and read files in a type of filesystem that they call "buckets". Other vendors have followed. You can use various API's and protocol to put and get files, including HTTP get direct from storage. It is big, expandable storage, and it is highly available on the Internet, and it is cheap. This type of storage has a simplification that makes it almost, but not quite like, a filesystem. You can add files, replace files, and read files, but you can't modify the files. The vendor can use home-grown caching, layering, and redundancy without having to worry about locking any single version of the file. It's great for photos, videos, messages, message attachments, document repositories, and backup. On a byte count measure, it probably will dominate Internet storage. It's not useful for databases, repositories, indexes, or other systems that update, append, and modify files.

Single-mount storage
Amazon offers "Elastic Block Store", and many of the new cloud vendors offer even more integrated storage for your virtual servers. This is mounted like a local disk, but it is stored on a SAN or fileserver somewhere. If you need to restart your virtual server, it gets reattached automatically (in the integrated version) or manually (in the EBS version). This is a nice hosted version of a traditional hard disk, and it will satisfy most storage needs. It is what the cloud market is providing now.

These network-mounted volumes have the advantage of using RAID and/or SAN for underlying storage, so they presumably have redundancy and seldom need any backup or restore operations. However, in this case you do need to ask about the backup and restore plan if the underlying storage device fails. Don't just take it on faith that it will be well managed or rapidly restored, because some of these systems use file servers that can fail. You may find that you need an external backup, and this will introduce the long restore times.

My biggest complaint about this type of storage is size limits. For example, on Amazon you can get a volume up to 1 TB in size, and it can be mounted on one virtual computer. If you have more than 1 TB, you are going to do a lot of work to allocate files between multiple servers.

Shared cloud storage on a home-made file server
I have worked with vendors that sell "Cloud storage" that you can mount as a shared file system. Typically these can go bigger than single-mount storage, and sharing is a great advantage. It allows you to increase capacity by adding front-end servers, and provide higher reliability through failover. You can run two servers for everything, both connected to the same storage, and if one goes offline, the second one is still running, and you don't have to re-attach anything to keep going. Unfortunately, I have had bad experiences with the reliability and capacity of these devices, because they often are "home grown" and not the expensive NAS devices that I will describe later.

Dedicated file server or SAN
Many virtual server hosting companies have a "hybrid" option where they can offer you a dedicated fileserver, or a SAN (storage area network), which is a device that is like a fileserver, but shareable, and attached with special high speed fiber. This is, in fact, the best way to get high performance. And, if you are using databases that have a lot of file locking, it might be your only realistic option. It also may be the cheapest option over a two-year device lifetime.

However, this option requires an up-front investment in fixed hardware that might look expensive and archaic if you are used to buying on-demand services. It also causes a reliability problems. You are responsible for performance management, capacity management, and failover planning for this device. If the hosting company has trouble maintaining the reliability of "cloud storage" options in their own datacenter, it doesn't make sense that you will do a better job remotely.

Big, shared, modern network attached storage
In the last few months, some hosting companies have started installing modern "Network Attached Storage" devices like Netapp and EMC Atmos. These devices provide all of the advantages of the previous options. They are mounted as real filesystems, so they are easy to use and support any type of application. With a gigabit network you can use them for locking-intensive applications. They can give you hundrds of terabytes. They provide all of the failover and scaling advantages of shared volumes, supporting up to hundreds of clients. The only case where this isn't the easiest storage to work with is when you need the performance of a dedicated, fiber SAN. In any other case, you will probably find that this storage is the simplest option. However, there is one catch: Cost.

One magic trick of these devices is the way that they do "snapshots" and backup. You can ask them to take a snapshot of a database or filesystem, and they will copy and save an image, without any interruption, even while taking new changes and locks from dozens of client machines. This is a technical trick that requires a lot of software and redundancy at all hardware levels. It makes them expensive. I am not sure exactly how much this storage will cost because the hosting vendors don't have much experience with it, and I haven't gotten firm quotes.

My guess is that the future of cloud storage will boil down to
1)A commodity market of S3 bucket-style storage for discrete files.
2)Big shared NAS volumes for systems that require locally modified filesystems. These will get cheaper and eventually go open source.

Until then, you can use this list to find the storage that is right for you.

Andy Singleton is the Founder & President of Assembla.  This blog post was originally published on March 26, 2010.  You can find this post, as well as additional content on the Assembla Blog.