Skip to content

[RT] MyCouch

October 7, 2009

The below post is an edited version of a $work e-mail, re-posted here at request of some colleagues that wanted to forward the story. My apologies if some of the bits are unclear due to lack-of-context. In particular, let me make clear:

  • we have had a production CouchDB setup for months that works well
  • we are planning to keep that production setup roughly intact for many more months and we are not currently planning to migrate away from CouchDB at all
  • overall we are big fans of the CouchDB project and its community and we expect great things to come out of it

Nevertheless using pre-1.0 software based on an archaic language with rather crappy error handling can get frustrating πŸ™‚

Subject: [RT] MyCouch
From: Leo Simons 
To: Forge Engineering 

This particular RT gives one possible answer to the question “what would be a good way to make this KV debugging somewhat less frustrating?” (we have been fighting erratic response times from CouchDB under high load while replicating and compacting)

That answer is “we could probably replace CouchDB with java+mysql, and it might even be easy to do so”. And, then, “if it really is easy, that’s extra cool (and _because of_ CouchDB)”.)

Why replace CouchDB?

Things we really like about CouchDB (as the backend for our KV service):

  • The architecture: HTTP/REST all the way down, MVCC, many-to-many replication, scales without bound, neat composable building blocks makes an evolvable platform.
  • Working system: Its in production, its running, its running pretty well.
  • Community: open source, active project, know the developers, “cool”.
  • Integrity: it hasn’t corrupted or lost any data yet, and it probably won’t anytime soon.

Things we like less:

  • Debugging: cryptic error messages, erlang stack straces, process deaths.
  • Capacity planning: many unknown and changing performance characteristics.
  • Immaturity: pre-1.0.
  • Humanware: lack of erlang development skills, lack of DBA-like skills, lack of training material (or trainers) to gain skills.
  • Tool support: JProfiler for erlang? Eclipse for erlang? Etc.
  • Map/Reduce and views: alien concept to most developers, hard to audit and manage free-form javascript from tenants, hard to use for data migrations and aggregations.
  • JSON: leads to developers storing JSON which is horribly inefficient.

Those things we don’t like about couch unfortunately aren’t going to change very quickly. For example, the effort required to train up a bunch of DBAs so they can juggle CouchDB namespaces and instances and on-disk data structures is probably rather non-trivial.

The basic idea

It is not easy to see what other document storage system out there would be a particularly good replacement. Tokyo Cabinet, Voldemort, Cassandra, … all of these are also young and immature systems with a variety of quirks. Besides, we really really like the CouchDB architecture.

So why don’t we replace CouchDB with a re-implemented CouchDB? We keep the architecture almost exactly the same, but re-implement the features we care about using technology that we know well and is in many ways much more boring. “HTTP all the way down” should mean this is possible.

We could use mysql underneath (but not use any of its built-in replication features). The java program on top would do the schema and index management, and most importantly implement the CouchDB replication and compaction functionality.

We could even keep the same deployment structure. Assuming one java server is paired with one mysql database instance, we’d end up with 4 tomcat instances on 4 ports (5984-5987) and 4 mysql services on 4
other ports (3306-3309). Use of mysqld_multi probably makes sense. Eventually we could perhaps optimize a bit more by having one tomcat process and one mysql process – it’ll make better use of memory.

Now, what is really really really cool about the CouchDB architecture and its complete HTTP-ness is that we should be able to do any actual migration one node at a time, without downtime. Moving the data across
is as simple as running a replication. Combined with the fact that we’ve been carefully avoiding a lot of its features, CouchDB is probably one of the _easiest_ systems to replace πŸ˜€

Database implementation sketch

How would we implement the database? If we think of our KV data as having the form

  ns1:key1 [_rev=1-12345]: { ...}
  ns1:key2 [_rev=2-78901]: { subkey1: ..., }
  ns2:key3 [_rev=1-43210]: { subkey1: ..., subkey2: ...}

where the first integer part of the _rev is dubbed “v” and the remainder part as “src”, then a somewhat obvious database schema looks like (disclaimer: schema has not been tested, do not use :-)):

CREATE TABLE namespace (
  id varchar(64) NOT NULL PRIMARY KEY
      CHARACTER SET ascii COLLATE ascii_bin,
  state enum('enabled','disabled','deleted') NOT NULL
) ENGINE=InnoDB;

CREATE TABLE {namespace}_key (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,

  PRIMARY KEY (ns, key, v, src),
  FOREIGN KEY (ns) REFERENCES namespace(id)
) ENGINE=InnoDB;

CREATE TABLE {namespace}_value (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,
  subkey varchar(255) NOT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci,
  small_value varchar(512) DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if it fits',
  large_value mediumtext DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if its big',

  PRIMARY KEY (ns, key, v, src, subkey),
  FOREIGN KEY (ns) REFERENCES namespace(id),
  FOREIGN KEY (ns, key, v, src)
      REFERENCES {namespace}_key(ns, key, v, src)
      ON DELETE CASCADE
) ENGINE=InnoDB;

With obvious queries including

  SELECT id FROM namespace WHERE state = 'enabled';

  SELECT key FROM {namespace}_key WHERE namespace_id = ?;
  SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ? ORDER BY version DESC LIMIT 1;
  SELECT subkey, small_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?;
  SELECT large_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?
      AND subkey = ?;

  BEGIN;
  CREATE TABLE {namespace}_key (...);
  CREATE TABLE {namespace}_value (...);
  INSERT INTO namespace(id) VALUES (?);
  COMMIT;

  UPDATE namespace SET state = 'disabled' WHERE id = ?;
  UPDATE namespace SET state = 'deleted' WHERE id = ?;

  BEGIN;
  DROP TABLE {namespace}_value;
  DROP TABLE {namespace}_key;
  DELETE FROM namespace WHERE id = ?;
  COMMIT;

  INSERT INTO {namespace}_key (ns,key,v,src)
      VALUES (?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,small_value)
      VALUES (?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,large_value)
      VALUES (?,?,?,?,?);

  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v < ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v = ? AND src =?;

The usefulness for {namespace}_value is debatable; it helps a lot when implementing CouchDB views or some equivalent functionality (“get my all the documents in this namespace where subkey1=…”), but if we decide not to care, then its redundant and {namespace}_key can grow some additional small_value (which should then be big enough to contain a typical JSON document, i.e. maybe 1k) and large_value columns instead.

Partitioning the tables by {namespace} manually isn’t needed if we use MySQL 5.1 or later; table partitions could be used instead.

I’m not sure if we should have a ‘state’ on the keys and do soft-deletes; that might make actual DELETE calls faster; it could also reduce the impact of compactions.

Webapp implementation notes

The java “CouchDB” webapp also does not seem that complicated to build (famous last words?). I would probably build it roughly the same way as [some existing internal webapps].

The basic GET/PUT/DELETE operations are straightforward mappings onto queries that are also rather straightforward.

The POST /_replicate and POST /_compact operations are of course a little bit more involved, but not that much. Assuming some kind of a pool of url fetchers and some periodic executors…

Replication:

  1. get last-seen revision number for source
  2. get list of updates from source
  3. for each update
    • INSERT key
    • if duplicate key error, ignore and don’t update values
    • INSERT OR REPLACE all the values

Compaction:

  1. get list of namespaces
  2. for each namespace:
    • SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ? ORDER BY key ASC, v DESC, src DESC;
    • skip the first row for each key
    • if the second row for the key is the same v, conflict, don’t compact for this key
    • DELETE IGNORE FROM {namespace}_key WHERE ns = ? AND key = ? AND v = ? AND src =?;

So we need some kind of a replication record; once we have mysql available using “documents” seems awkward; let’s use a database table. We might as well have one more MySQL database on each server with a
full copy of a ‘kvconfig’ database, which is replicated around (using mysql replication) to all the nodes. Might also want to migrate away from NAMESPACE_METADATA documents…though maybe not, it is nice and flexible that way.

Performance notes

In theory, the couchdb on-disk format should be much faster than innodb for writes. In practice, innodb has seen quite a few years of tuning. More importantly, in our tests on our servers raw mysql performance seems to be rather better than couchdb. Some of that is due to the extra fsyncs in couchdb, but not all of it.

In theory, the erlang OTP platform should scale out much better than something java-based. In practice, the http server inside couchdb is pretty much a standard fork design using blocking I/O. More importantly, raw tomcat can take >100k req/s on our hardware, which is much much more than our disks can do.

In theory, having the entire engine inside one process should be more efficient than java talking to mysql over TCP. In practice, I doubt this will really show up if we run java and mysql on the same box. More importantly, if this does become an issue, longer-term we may be able to “flatten the stack” by pushing the java “CouchDB” up into the service layer and merging it with the KV service, at which point java-to-mysql will be rather more efficient than java-to-couch.

In theory and in practice innodb has better indexes for the most common SELECTs/GETs so it should be a bit faster. It also is better at making use of large chunks of memory. I suspect the two most common requests (GET that returns 200, GET that returns 404) will both be faster, which incidentally are the most important for us to optimize, too.

We might worry java is slow. That’s kind-of silly :). In theory and in practice garbage collection makes software go faster. We just need to avoid doing those things that make it slow.

The overhead of ACID guarantees might be a concern. Fortunately MySQL is not _really_ a proper relational database if you don’t want it to be. We can probably set the transaction isolation level to READ UNCOMMITTED safely, and the schema design / usage pattern is such that we don’t need transactions in most places. More importantly we are keeping the eventual consistency model, with MVCC and all, on a larger scale. Any over-ACID-ness will be local to the particular node only.

Most importantly, this innodb/mysql thing is mature/boring technology that powers a lot of the biggest websites in the world. As such, you can buy books and consultancy and read countless websites about mysql/innodb/tomcat tuning. Its performance characteristics are pretty well-known and pretty predictable, and lots of people (including here at $work) can make those predictions easily.

So when are we doing this?

No no, we’re not, that’s not the point, this is just a RT! I woke up (rather early) with this idea in my head so I wrote it down to make space for other thoughts. At a minimum, I hope the above helps propagate some ideas:

  • just how well we applied REST and service-oriented architecture here and the benefits its giving us
  • in particular because we picked the right architecture we are not stuck with / tied to CouchDB, now or later
  • we can always re-engineer things (though we should have good enough reasons)
  • things like innodb and/or bdb (or any of the old dbs) are actually great tools with some great characteristics

Just like FriendFeed?

Bret Taylor has a good explanation how FriendFeed built a non-relational database on top of a relational one. The approach outlined above reminds rathe a lot of the solution they implemented, though there’s also important differences.

Advertisements
2 Comments
  1. Chris Smith permalink
    October 14, 2009 13:41

    Excellent idea. I’ve looked at the same thing (using MySQL as a storage engine) and building a similar solution on top using C++, Boost’s network server stuff, Google’s V8 JavaScript engine for processing and protocol buffers for node-to-node communications rather than HTTP for performance.

    One of the frustrations I had with CouchDB is that the performance (apart from reading) was beyond terrible due to the amount of times it has to spawn other processes. Keeping it all in-process and using V8 would eliminate a chunk of that. Protobufs is smaller over the wire and more people “get” C++.

    And all the advantages of Erlang can be thrown in the bin when you think of the fact they can be replicated easily. Look at retlang for C#: http://code.google.com/p/retlang/ – the same could be applied to C++ with some templates and some thought.

  2. Leo Simons permalink*
    October 14, 2009 14:47

    Hey Chris, interesting πŸ™‚

    Are you sure you are getting hit by process spawns, though? We haven’t been able to really see the cost of those — because instead on writes you’ll normally pay a hefty price due to having 2 fsync() ops per write (which with 0.10 becomes configurable I think) making all write operations pretty firmly disk-speed-bound.

    Just about any other storage system that does 2 fsyncs for one write (for example log file and data file) will give you just about the same speed. No programming language can change that πŸ™‚

Comments are closed.

%d bloggers like this: