Distributed Storage on the Premises

| Comments

The holy grail of storage is a system that is cheap, reliable, and resilient. Over the past decade, we’ve gone from large enterprise SANs based storage systems to much more reasonable storage systems built on the normal principles of distributed systems.

Essentially, we take a consistency model and mix it with a block storage system. Take a consensus algorithm, mix it with reliable local storage. Add a fairly reliable network and you’ve managed to build a delectable storage system.

If you’d like a fairly simple introduction to practical distributed systems, I’d suggest reading Distributed systems for fun and profit.

It’s likely that you can find an open source distributed storage system to meet your needs, whether they be private or cloud based and whether you need to store documents, rows, or blocks. I imposed a slightly unrealistic set of criteria on myself and ended up with a solution that meets my needs.

What are we replacing?

My personal infrastructure falls somewhere between slightly beyond reasonable and over-the-top. However, it presents a fairly reasonable testing ground for new software technology. Over the course of several expansions, upgrades, replacements, and general evolution, I ended up with an infrastructure built around:

  • A 2U server with eight hot-swap drive bays running Oracle Solaris 11.1. The eight drive bays were populated with 500GB drives configured using ZFS in RAIDZ2 mode. In addition to serving as a NAS and backup server, the system also hosted a virtualized (using VirtualBox) 32-bit Microsoft Windows Server 2008 domain controller. Virtual machines are stored on local ZFS volumes.
  • An AMD Opteron 6274 mid-tower running Debian Linux to serve all other virtualization and experimentation needs. This server also hosted a virtualized 64-bit Microsoft Windows Server 2012 domain controller. Virtual machines are stored on local disk.

Aside: Why are you running a Windows Domain? The short answer is that it provides a fair amount of infrastructure without a lot of hassle. The slightly longer answer will be the subject of a future post

At first glance, this seems like a fairly reasonable configuration. And it was. But my 2U server was quite loud, drew unreasonable amounts of power, and was aging into obsolescence. It also served as a single point of failure for my family’s storage needs.

What now?

The goals for my redesign were fairly simple. I wanted to easily perform backups, migrate VMs across physical machines, expand storage as necessary, and do it while being able to withstand a reasonable modicum of hardware failure.

I’d like to have one system to store everything. Every component that can be consolidated means one less thing to configure, monitor, and repair when it breaks.

My fallback plan was to build a newer version of my previous infrastructure based on Linux rather than Solaris. If I ended up with too little time to do even that, replacing the network storage with a Synology was my last resort.

Thinking about requirements

Block device support

Since I wanted to host VMs on my storage cluster, I needed a storage system with support for distributed block devices. Looking at the qemu source code 1 this means using Gluster, RBD, or Sheepdog.

Ideally, these block devices would support copy-on-write snapshots and could be resized.

Network filesystem support

…for at least Windows, Mac, and Linux. While I could re-export a FUSE based filesystem via NFS or CIFS to accomplish this goal, it wouldn’t be an ideal solution. The same applies for building a NAS system on a VM stored on a distribued block device.

Expandable storage

One of the major drawbacks of my ZFS based system was the inability to easily expand. The new storage system should support online addition of more storage.

Resiliency in the face of failure

At minimum, the failure of an individual disk or machine shouldn’t take a storage cluster offline. The storage system should have some way to detect blocks with bad content (bitflips, read errors, etc).

Efficient storage

Most systems are using replication for resiliency. Erasure coding can provide the same guarantees while making much more efficient use of storage. 2

Simple storage interface

There are so many clients written for Amazon’s S3 API. It would be ideal if my storage system could use some of those clients. This would be especially pleasant for things like photo management, as my wife and I are both avid photographers.

Integrated cloud backup

If the storage system could back itself up to a cloud storage provider (encrypted, of course, that would be even better.

Accelerated by faster hardware

One of my favorite features of ZFS was the ability to improve write speeds by offloading the ZFS intent log (ZIL) to an SSD and to improve read speeds by providing an SSD based cache. Support for these operations without breaking consistency guarantees.

Filtering the candidates

As mentioned in the section above, only a few systems met my requirement of block device access. Of these, Gluster and Ceph are the most mature. Sheepdog appears to have come a long way, but I didn’t find enough of a community backing the project for me to feel comfortable depending on the system.

Other projects

If you have a different set of requirements, I suggest you examine some of the following projects:

  • etcd
    • key-value store with a simpler interface than Zookeeper
    • simple HTTP interface
    • consensus based using Raft3
    • written in Go, so deployments are very simple (single binary)
  • Riak, Cassandra
    • based on the Amazon Dynamo paper5
    • client-resolved conflicts
    • riak is document based, cassandra is row based
  • CouchDB
    • eventual consistency
    • offline access via replicas with later conflict resolution
  • Tahoe-LAFS[^tahoe]
    • Encrypted, efficient, offline cloud storage [^tahoe]:http://eprint.iacr.org/2012/524.pdf

You’ll notice I didn’t mention MongoDB. I’ve had poor performance experiences with MongoDB and find it is only a reasonable fit if your dataset will never exceed the size of physical memory, your workload is read intensive, and it doesn’t perform queries that would benefit from an index. However, this is only my current opinion - the guys at MongoDB, Inc. might make some improvements in the future. If you insist on using MongoDB, I’d suggest you take a look at the TokuMX distribution from Tokutek, implemented using fractal trees67.

If your requirements include POSIX filesytem style access, but not network access or a distributed backend, I suggest you make sure you are using a filesytem with block checksumming and volume support. This usually means one of ZFS8, BtrFS9, or Hammer10.

Gluster

Designed first and foremost as a highly parallel filesystem for large compute clusters. The native configuration of Gluster is striped, with no resiliency or replication. Files placed into Gluster are distributed according to the rules of the current distribution graph. To add additional functionality, another translator is added to the graph. For example, multiple volumes are combined into a single logical volume using by adding a distribute translator to the file distribution graph. A real world setup with resiliency would look have both a distribute and replicate translator in the distribution graph.

There are some experimental translators for providing erasure coding support, but nothing is ready for use.

Since Gluster doesn’t support an S3 style API, copy-on-write, or have a good story for preventing corruption of individual files, I chose not to use it for my storage system. Note that you can use Gluster as an S3 backend if you are willing to back use a Gluster backed Swift implementation11.

Ceph

As you’ve probably guessed by now, I ended up building a small Ceph cluster to meet my storage needs. The current versions of Ceph don’t meet every requirement I had, but it’s important to note that, as far as I can tell, there isn’t a single open source project that meets them all.

The key differentiators of Ceph are all services are built on a RADOS12, a common object storage layer, and that block location is determined by CRUSH13, a real-world object location algorithm. For larger clusters, CRUSH can model actual failure domains, which means avoiding some of the problems exposed in the CopyFiles paper.

Ceph (the project) builds three major services on top of RADOS:

  1. rbd - network block device with Linux and qemu drivers
  2. radosgw - S3/Swift compatible HTTP based storage
  3. cephfs - POSIX network filesystem

though at this time, only the first two are both well-tested and performant. Developers are also free to write directly to RADOS itself.

One of the main reasons I’m excited about Ceph is the sane architecture built around each component. For example, cluster management is handled by the ceph-mon` processes, using Paxos14/Raft 15 for consensus. This metadata is stored directly in RADOS.

Since cephfs has only FUSE and Linux kernel drivers and has not received the same development efforts as the other two components, I was not comfortable deploying it for real world use. This means I had to find an alternate method for providing filesystem access from Mac and Windows clients. The best solution I’ve come up with currently was to build a NAS device in a VM until such time as I build out radosgw. For the time being, I’ve selected FreeNAS to provide CIFS based storage.

Ceph does not currently support erasure coding, though support for erasure coded pools should be in their February 2014 release. In the short term, I’m fine with this, as there are always tradeoffs involved with the use of erasure coding16 Their implementation is fairly interesting, as it relies on having a tiered storage configuration. Erasure coded blocks are never written directly. Instead, blocks are written to a regular replicated pool. As they age out of this pool, they are transformed into erasure coded blocks and stored in the appropriate lower tier pool. Upon read, the blocks are promoted to the higher tier and the operation repeats.

Ceph also does not support direct backups to cloud storage. My backup provider of choice is CrashPlan, which is not well supported on FreeBSD, so I’ve built another virtual machine to serve as a backup aggregator. All necessary files are synced to that server, then uploaded to CrashPlan.

What’s next?

My storage system is currently a three node Ceph cluster providing RBD services using six Ceph OSDs (two per node) and a replication factor of three. Virtualization uses the same cluster with all virtual machine disks accessed using RBD. A future post may provide some additional technical details on setup and configuration.

I expect this project to evolve over the next several years. So far, I’ve realized that my images are RBDv1 and need to be migrated to RBDv2 to add support for more advanced cloning features. So far, the performance is adequate, but not excellent. In order to improve this, I’ll likely need to upgrade my interconnect beyond bonded 1Gbit links, add some SSDs for journaling, and possible add additional OSDs.

I’m excited about the upcoming erasure coding features and hope to be an early adopter there. I expect my first target for erasure coding would be my backup aggregator and possibly even my FreeNAS system, as those files do not change frequently.

In terms of providing additional features locally, I’d like to deploy radosgw and migrate my family photo management to Ceph using the S3 API.

  1. http://git.qemu.org/?p=qemu.git;a=tree;f=block;h=d6a72c872754bd6fdccdc73f6ea4261328989d7b;hb=master

  2. http://www.cs.rice.edu/Conferences/IPTPS02/170.pdf

  3. https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf

  4. http://hyperdex.org/papers/hyperdex.pdf

  5. http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

  6. http://en.oreilly.com/mysql2010/public/schedule/detail/13265

  7. http://www.cs.rutgers.edu/~farach/pubs/cache-oblivious-Btree-full.pdf

  8. http://research.cs.wisc.edu/adsl/Publications/zfs-corruption-fast10.pdf

  9. http://domino.watson.ibm.com/library/CyberDig.nsf/1e4115aea78b6e7c85256b360066f0d4/6e1c5b6a1b6edd9885257a38006b6130!OpenDocument&Highlight=0,BTRFS

  10. http://www.dragonflybsd.org/hammer/

  11. https://www.gluster.org/2013/06/glusterfs-3-4-and-swift-where-are-all-the-pieces/#more-2597

  12. http://ceph.com/papers/weil-rados-pdsw07.pdf

  13. http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

  14. http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf

  15. http://www.spinics.net/lists/ceph-devel/msg16578.html

  16. http://iptps05.cs.cornell.edu/PDFs/CameraReady_223.pdf

Comments