Tuesday, June 26, 2012

Introducing ElephantDB: a distributed database specialized in exporting data from Hadoop

ElephantDB is a database that specializes in exporting key/value data from Hadoop. We have been running it in production at BackType for over half a year now and are excited to be open-sourcing it. In this post, I'll introduce ElephantDB, show how to use it, and then compare it to other databases out there. ElephantDB is hosted on GitHub here.

Unlike most other databases, ElephantDB dissassociates the creation of a database index from the serving of that index. ElephantDB is comprised of two components. The first is a library that is used in a MapReduce job to create an indexed key/value dataset that is stored on a distributed filesystem. The second component, ElephantDB server, is a daemon that downloads a subset of a dataset and serves it in a read-only, random-access fashion. A group of ElephantDB servers working together to serve a full dataset is called a ring. Both the creation and serving of a dataset are done in a fully distributed fashion.

99.9% of the complexity of distributed databases comes from supporting random writes. Since ElephantDB doesn't support random writes, it is extremely simple -- only a few thousand lines of code. Once a server loads up its subset of data, it does very little. This leads to ElephantDB being rock-solid in production, since there's almost no moving parts.

Why did we create ElephantDB?

We rely heavily on batch computation via Hadoop at BackType. We have over 25TB of social data (tweets, blog comments, etc.) stored on a distributed filesystem, and we use Hadoop to compute views of that data that we serve in our product. For example, one view is influence scores for everyone on Twitter. Another view is tweet counts for each url. We needed a database that could serve these various views to our product.

We looked into using Voldemort, but it lacked the incremental capabilities we wanted. We then realized that a database that doesn't support random writes is really simple and would be easy to write ourselves. We also thought that a database focused on only handling the batch exporting workflow could be made extremely simple to configure and operate. ElephantDB only took a few weeks to create and get into production.

High level overview

Before we dive into ElephantDB, let's get some terminology out of the way. An ElephantDB "domain" is analagous to a "table" from relational databases -- a single, related set of data. An ElephantDB domain is comprised of a fixed number of "shards" of a "Local Persistence". A "shard" is a subset of a domain, and a "Local Persistence" is a regular key/value database that reads and writes to disk. ElephantDB comes bundled with a "Local Persistence" implementation for Berkeley DB Java Edition. An ElephantDB "ring" is a group of servers that work together to serve a set of ElephantDB domains. Each server serves a subset of the data, and each server knows what data the other servers have. 

The following picture illustrates how an ElephantDB domain is created. Key/value pairs are sharded and indexed in a MapReduce job and stored on a distributed filesystem:

Overview1
To serve the data, you point an ElephantDB ring where the domain is stored on the distributed filesystem. Each ElephantDB server will then download a subset of the shards and serve them, as illustrated in the following picture:

Overview2

As you can see, the distributed filesystem is used as a distribution point for an ElephantDB domain.

Creating a domain of data via MapReduce

Let's look at how ElephantDB works in more detail. Let's start with how ElephantDB creates or updates an ElephantDB domain.

A domain is created or updated via a MapReduce job and stored on the distributed filesystem. Here's what a domain looks like on the distributed filesytem:

Edb-domain

There can be multiple versions of a domain. Any given version of a domain is immutable, and everytime a domain is updated, a new version of the domain is created. In this picture, each numeric directory is a different version of the domain. A domain also contains a metadata file, domain-spec.yaml, that indicates what Local Persistence this domain is comprised of and how many shards are in this domain. The contents of the domain-spec.yaml for this domain is:

1 2 3 
---
local_persistence: elephantdb.persistence.JavaBerkDB
num_shards: 32

If we dive into one of these versions, we see the following directory structure. Each of these folders is a shard for this version of the domain:

Edb-domain-version

Diving into one of these shards, we can see the Berkeley DB files for the shard:

Edb-shard

When given a set of key/value pairs to turn into an ElephantDB domain, ElephantDB performs the following steps in a MapReduce job to create a version for the domain:

  1. Select a shard for each key using consistent hashing (take the hash of the key and mod it by the number of shards)
  2. Group all the key/value pairs by their shard number
  3. In the reduce task, open a local persistence locally and stream all the key/value pairs for the shard into it.
  4. At end of reduce task, copy the files for the local persistence to the distributed filesystem.

At no point in the creation of a domain is an ElephantDB server involved. The creation of a domain happens independently from serving the domain.

The code to create an ElephantDB domain is really simple. Here's an example using Cascading:

1 2 3 4 5 6 
Tap source = new Hfs(new SequenceFile(new Fields("key", "value")), "/tmp/key-value-pairs");
DomainSpec spec = new DomainSpec(new JavaBerkDB(), 32);
ElephantDBTap sink = new ElephantDBTap("/data/output/my-edb-domain", spec);
Pipe p = new Pipe("pipe");
p = new ElephantTailAssembly(p, sink);
new FlowConnector().connect(source, sink, p).complete();

First, we create an ElephantDBTap and configure it with how many shards we want and what LocalPersistence to use. At the end of our Cascading flow, we pipe the data through an ElephantTailAssembly which shards and groups the data to prepare it for the output tap.

ElephantDB is even easier to use with Cascalog:

1 2 3 
(?-
  (elephant-tap "/data/output/my-edb-domain" {:num-shards 32 :persistence-factory (JavaBerkDB.)} {})
  (name-vars (hfs-seqfile "/tmp/key-value-pairs") ["?key" "?value"]))

"elephant-tap" from elephantdb-cascalog abstracts away the sharding/grouping process completely so that you can write key/value pairs to it just like you would write them to a regular file-based tap.  

Serving ElephantDB domains with ElephantDB server

Now let's take a look at how ElephantDB server works. A group of ElephantDB servers work together to serve a set of domains. Each ElephantDB server will serve a subset of each domain. The first thing you have to do is configure what domains the ring should be serving. The configuration is stored on the distributed filesystem and looks something like this:

1 2 3 4 5 6 7 8 
{ :replication 1
  :hosts ["edb1.mycompany.com" "edb2.mycompany.com" "edb3.mycompany.com"]
  :port 3578
  :domains {"tweet-counts" "/data/output/tweet-counts-edb"
            "influenced-by" "/data/output/influenced-by-edb"
            "influencer-of" "/data/output/influencer-of-edb"
            }
}

This configuration is as simple as it gets. You just have to list all the hosts in the ring, the replication factor, what domains to serve, and what port each server should use.

ElephantDB server decides what shards each host should serve by running a deterministic algorithm that assigns hosts to shards. Since every ElephantDB server gets the same input, they all make the same conclusion and know where all the data is located across the ring.

Each ElephantDB server also has a simple local configuration containing where to cache shards locally as well as machine-specific configuration for the local persistences. See the example conf in the project for an example of this.

ElephantDB server has a Thrift interface, so you can access it from any language. The interface contains methods for doing gets, multi-gets, and getting the status as to whether domains have been loaded from the distributed filesystem yet. Here's an example of doing a get to ElephantDB from Clojure:

1 2 
(with-elephant-connection "edb1.mycompany.com" 3578 handler
  (.getString handler "tweet-counts" "http://backtype.com"))

When you do a get to an ElephantDB server, it will route your request to the server that has the data for that key. This happens transparently.

When ElephantDB starts up, it asynchronously downloads shards for each domain from the distributed filesystem locally. You can use the status methods on its Thrift interface to know when it's fully loaded.

Currently, updating an ElephantDB server with new data requires taking downtime on the server. We are working on adding hot-swapping so that the server does this automatically in the background without taking downtime. 

Additional Features

Incremental updates

ElephantDB supports doing incremental updates of a domain. The steps ElephantDB performs to do an incremental update are just a little bit different than the steps ElephantDB uses to create a brand new domain:

  1. Select a shard for each key using consistent hashing (just like before)
  2. Group all the key/value pairs by their shard number (just like before)
  3. Download the most recent version of the shard from the distributed filesystem
  4. Execute updating logic for each new key/value pair
  5. Upload the updated shard to the distributed filesystem as a new version

The updating logic is completely pluggable. By default, it just does a replace of whatever key/value pair was already in the local persistence. At BackType, we do all sorts of custom logic in the updater, from incrementing counts to merges of the old value with the new value.

All the incrementalization happens on the MapReduce side. Incrementally updating shards in this manner is a dramatic improvement over re-indexing the entire domain from scratch. 

Note that ElephantDB server updates itself by downloading all the new shards from scratch -- it's not able to download only the "new stuff". This typically isn't so bad as copying is quite fast, but it's certainly an area for improvement.

ElephantDB as an input source

Those ElephantDB domains sitting on the distributed filesystem can be used as an input source to your jobs. This is really useful. This lets you do analytics on your data without touching the servers that are serving the data to your application.

Furthermore, with this feature ElephantDB is useful even without ElephantDB server. ElephantDB can be used as an indexed key/value file format on top of Hadoop. There are lots and lots of applications for this.

Here's an example of using ElephantDB as an input source with Cascading. As you can see, the same tap that's used to create ElephantDB domains can be used as a source. The tap will emit key/value pairs into the processing flow.

1 2 3 4 
ElephantDBTap source = new ElephantDBTap("/data/output/my-edb-domain");
Pipe p = new Pipe("pipe");
p = new Each(p, new Fields("key", "value"), new ProcessKeyValuePairs(), Fields.RESULTS);
...

Comparison to other database technologies

There are a lot of distributed databases out there. Most of these databases, like Cassandra and Riak, aren't comparable as they are read/write databases that create the index in the same place they serve the index. The most comparable database is Voldemort, which has a read-only mode with similarities to ElephantDB (Voldemort can also be used as a Dynamo-style read/write distributed database). Here's a tutorial from LinkedIn on how to use Voldemort's exporting features. Below is a comparison of Voldemort's read-only mode to ElephantDB. I'm sure I'm missing important aspects of Voldemort below. If so, let me know and I'll update the comparison.

Voldemort advantages:

  1. Better performance: Voldemort has had a lot more optimization work done to it than ElephantDB, so it will have better performance. This is an area where ElephantDB may "borrow" from Voldemort (in particular, we'd like to try out Krati as an alternative to Berkeley DB JE).
  2. Supports hot swapping: Voldemort databases can do live swaps of new versions of a domain in the background. This is functionality we're currently adding to ElephantDB so this difference will not last long.

ElephantDB advantages:

  1. Simpler to configure: Since ElephantDB is a more specialized database, it benefits from being easier to configure. Whereas with Voldemort you need to manually assign hosts to partitions, ElephantDB just figures out the hosts to shard mapping on its own.
  2. Support for incremental updates to a domain.
  3. Support for reading ElephantDB domains stored on the distributed filesystem from MapReduce.
  4. Very small codebase: ElephantDB is only a few thousand lines of code. This makes it easy to understand and extend.

The future

ElephantDB has been a very successful project at BackType. Future improvements we plan or would like to make to ElephantDB include the following:

  1. Performance optimization: In particular, we'd like to try out Krati as an alternative to Berkeley DB JE.
  2. Hot swapping: Currently, to update a domain without taking downtime, you'd need to do a rolling update of multiple ElephantDB rings. To simplify this workflow, we'd instead like for a live ElephantDB ring to download new shards and swap them in the background.
  3. Richer data model: We'd like to explore adding a richer data model to ElephantDB beyond key/value.

Otherwise, we want to keep ElephantDB simple. We believe that a simple database is more reliable and makes life easier for everyone.

Conclusion

We spend almost no time on operational or maintenance issues with ElephantDB at BackType. It just works. As a small startup, it is important for us to eliminate complexity from our systems so that we can focus on customer problems. ElephantDB removes all the friction from serving batch-computed views of our 25TB dataset into our application. Its incremental capabilities let us keep those views updated with minimal cost, and it wears multiple hats by letting us use an ElephantDB domain as an input source to our jobs.

If you're interested in using ElephantDB, be sure to join the user group and follow the project on GitHub.

You should follow the BackType tech team on Twitter here.

 

Permalink | Leave a comment  »

Source: http://tech.backtype.com/introducing-elephantdb-a-distributed-database

computer repair prices computer repair jobs computer repair services computer repair technician

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home