PUT *heart* MySQL

MySQL has a non-standard extension called INSERT ... ON DUPLICATE KEY UPDATE. When you’re building a RESTful application, it really rocks, because you can use it quite easily to have just the right PUT semantics. For example, you can have queries such as

INSERT INTO Book
           (id, deleted, a, b, c)
    VALUES (#id#, 0, #a#, #b#, #c#)
    ON DUPLICATE KEY UPDATE
        a = #a#, b = #b#, c=#c#,
        deleted = 0,
        revision = revision + 1

which you might call from java (example uses iBatis) using something like

    /* PUT a book, return an HTTP status code or throw an exception */
    public int putBookPart(final BookPart bookPart) throws SQLException {
        final int affected = sqlMapClient.update("putBook", book);
        assert affected == 1 || affected == 2;
        return 202 - affected;
    }

Someone at MySQL was clever enough to hijack the affected value – it will be 1 if this was an insert (resulting in 201 Created) or 2 if it was an update (resulting in 200 OK)). I don’t know about you, but to me that looks pretty sweet!

In this simple case, there is only one SQL query that happens to do this PUT, and so I don’t even have to bother with transactions.

Oh, just realized there’s one other trick in this example: the revision column is int(11) UNSIGNED NOT NULL DEFAULT 1 which is the basis of my ETag.

Web application platform technology choices

The hardest bit in the web application platform challenge is making reasonable choices. Here’s a stab at some of them…

Hosting models

I see these basic choices:

  1. LAMP virtual hosting. If you can build everything you need with mysql+php and you have few enough users that you need only one database server, by far the easiest and cheapest.
  2. Application hosting. Code on github, project management with basecamp or hosted jira, build on AppEngine or Heroku or force.com. You don’t have to do your own infrastructure but you’re limited in what you can build. Also comes with a large chance of lock-in.
  3. Managed hosting. Rent (virtual) servers with pre-installed operating systems and managed networking. Expensive for large deployments but you don’t need all web operations team skills and you have a lot of flexibility (famously, twitter do this).
  4. Dedicated hosting. Buy or rent servers, rent rackspace or build your own data center. You need network engineers and people that can handle hardware. Usually the only cost-effective option beyond a certain size.

Given our stated requirements, we are really only talking about option #4, but I wanted to mention the alternatives because they will make sense for a lot of people. Oh, and I think all the other options are these days called cloud computing 🙂

Hardware platform

I’m not really a hardware guy, normally I leave this kind of stuff to others. Anyone have any good hardware evaluation guides? Some things I do know:

  • Get at least two of everything.
  • Get quality switches. Many of the worst outages have something to do with blown-up switches, and since you usually have only a few, losing one during a traffic spike is uncool.
  • Get beefy database boxes. Scaling databases out is hard, but they scale up nicely without wasting resources.
  • Get beefy (hardware) load balancers. Going to more than 2 load balancers is complicated, and while the load balancers have spare capacity they can help with SSL, caching, etc.
  • Get beefy boxes to run your monitoring systems (remember, two of everything). In my experience most monitoring systems suffer from pretty crappy architectures, and so are real resource hogs.
  • Get hardware RAID (RAID 5 seems common) with a battery-backed write-through cache, for all storage systems. That is, unless you have some other redundancy architecture and you don’t need RAID for redundancy.
  • Don’t forget about hardware for backups. Do you need tape?

Other thoughts:

  • Appliances. I really like the idea. Things like the schooner appliances for mysql and memcache, or the kickfire appliance for mysql analytics. I have no firsthand experience with them (yet) though. I’m guessing oracle+sun is going to big in this space.
  • SSD. It is obviously the future, but right now they seem to come with limited warranties, and they’re still expensive enough that you should only use them for data that will actually get hot.

Operating system

Choice #1: unix-ish or windows or both. The Microsoft Web Platform actually looks pretty impressive to me these days but I don’t know much about it. So I’ll go for unix-ish.

Choice #2: ubuntu or red hat or freebsd or opensolaris.

I think Ubuntu is currently the best of the debian-based linuxes. I somewhat prefer ubuntu to red hat, primarily because I really don’t like RPM. Unfortunately red hat comes with better training and certification programs, better hardware vendor support and better available support options.

FreeBSD and solaris have a whole bunch of advantages (zfs, zones/jails, smf, network stack, many-core, …) over linux that make linux seem like a useless toy, if it wasn’t for the fact that linux sees so much more use. This is important: linux has the largest array of pre-packaged software that works on it out of the box, linux runs on more hardware (like laptops…), and many more developers are used to linux.

One approach would be solaris for database (ZFS) and media (ZFS!) hosting, and linux for application hosting. The cost of that, of course, would be the complexity in having to manage two platforms. The question then is whether the gain in manageability offsets the price paid in complexity.

And so, red hat gains another (reluctant) customer.

Database

As much sympathy as I have for the NoSQL movement, the relational database is not dead, and it sure as hell is easier to manage. When dealing with a wide variety of applications by a wide variety of developers, and a lot of legacy software, I think a SQL database is still the default model to go with. There’s a large range of options there.

Choice #1: clustered or sharded. At some point some application will have more data than fits on one server, and it will have to be split. Either you use a fancy database that supports clustering (like Oracle or SQL Server), or you use some fancy clustering middleware (like continuent), or you teach your application to split up the data (using horizontal partitioning or sharding) and you use a more no-frills open source database (mysql or postgres).

I suspect that the additional cost of operating an oracle cluster may very well be worth paying for – besides not having to do application level clustering, the excellent management and analysis tools are worth it. I wish someone did a model/spreadsheet to prove it. Anyone?

However, it is much easier to find developers skilled with open source databases, and it is much easier for developers to run a local copy of their database for development. Again there’s a tradeoff.

The choice between mysql and postgres has a similar tradeoff. Postgres has a much more complete feature set, but mysql is slightly easier to get started with and has significantly easier-to-use replication features.

And so, mysql gains another (reluctant) customer.

With that choice made, I think its important to invest early on in providing some higher-level APIs so that while the storage engine might be InnoDB and the access to that storage engine might be MySQL, many applications are coded to talk to a more constrained API. Things like Amazon’s S3, SimpleDB and the Google AppEngine data store provide good examples of constrained APIs that are worth emulating.

HTTP architecture

Apache HTTPD. Easiest choice so far. Its swiss army knife characteristic is quite important. Its what everyone knows. Things like nginx are pretty cool and can be used as the main web server, but I suspect most people that switch to them should’ve spent some time tuning httpd instead. Since I know how to do that…I’ll stick with what I know.

As easy as that choice is, the choice of what to put between HTTPD and the web seems to be harder than ever. The basic sanctioned architecture these days seems to use BGP load sharing to have the switches direct traffic at some fancy layer 7 load balancers where you terminate SSL and KeepAlive. Those fancy load balancers then may point at a layer of caching reverse proxies like which then point at the (httpd) app servers.

I’m going to assume we can afford a pair of F5 Big-IPs per datacenter. Since they can do caching, too, we might avoid building that reverse proxy layer until we need it (at which point we can evaluate squid, varnish, HAProxy, nginx and perlbal, with that evaluation showing we should go with Varnish 🙂 ).

Application architecture

Memcache is nearly everywhere, obviously. Or is it? If you’re starting mostly from scratch and most stuff can be AJAX, http caching in front of the frontends (see above) might be nearly enough.

Assuming a 3-tier (web, middleware, db) system, reasonable choices for the front-end layer might include PHP, WSGI+Django, and mod_perl. I still can’t see myself rolling out Ruby on Rails on a large scale. Reasonable middelware choices might include java servlets, unix daemons written in C/C++ and more mod_perl. I’d say Twisted would be an unreasonable but feasible choice 🙂

Communication between the layers could be REST/HTTP (probably going through the reverse proxy caches) but I’d like to try and make use of thrift. Latency is a bitch, and HTTP doesn’t help.

I’m not sure whether considering a 2-tier system (i.e. PHP direct to database, or perhaps PHP link against C/C++ modules that talk to the database) makes sense these days. I think the layered architecture is usually worth it, mostly for organizational reasons: you can have specialized backend teams and frontend teams.

If it was me personally doing the development, I’m pretty sure I would go 3-tier, with (mostly) mod_wsgi/python frontends using (mostly) thrift to connect to (mostly) daemonized python backends (to be re-written in faster/more concurrent languages as usage patterns dictate) that connect to a farm of (mostly) mysql databases using raw _mysql, with just about all caching in front of the frontend layer. I’m not so sure its easy to teach a large community of people that pattern; it’d be interesting to try 🙂

As for the more boring choice…PHP frontends with java and/or C/C++ backends with REST in the middle seems easier to teach and evangelize, and its also easier to patch up bad apps by sticking custom caching stuff (and, shudder, mod_rewrite) in the middle.

Messaging

If there’s anything obvious in today’s web architecture it is that deferred processing is absolutely key to low-latency user experiences.

The obvious way to do asynchronous work is by pushing jobs on queues. One hard choice at the moment is what messaging stack to use. Obvious contenders include:

  • Websphere MQ (the expensive incumbent)
  • ActiveMQ (the best-known open source system with stability issues)
  • OpenAMQ (AMQP backed by interesting startup)
  • 0MQ (AMQP bought up by same startup)
  • RabbitMQ (AMQP by another startup; erlang yuck)
  • MRG (or QPid, AMQP by red hat which is not exactly a startup).

A less obvious way to do asynchronous work is through a job architecture such as gearman, app engine cron or quartz, where the queue is not explicit but rather exists as a “pending connections” set of work.

I’m not sure what I would pick right now. I’d probably still stay safe and use AMQ with JMS and/or STOMP with JMS semantics. 2 months from now I might choose differently.

[RT] MyCouch

The below post is an edited version of a $work e-mail, re-posted here at request of some colleagues that wanted to forward the story. My apologies if some of the bits are unclear due to lack-of-context. In particular, let me make clear:

  • we have had a production CouchDB setup for months that works well
  • we are planning to keep that production setup roughly intact for many more months and we are not currently planning to migrate away from CouchDB at all
  • overall we are big fans of the CouchDB project and its community and we expect great things to come out of it

Nevertheless using pre-1.0 software based on an archaic language with rather crappy error handling can get frustrating 🙂

Subject: [RT] MyCouch
From: Leo Simons 
To: Forge Engineering 

This particular RT gives one possible answer to the question “what would be a good way to make this KV debugging somewhat less frustrating?” (we have been fighting erratic response times from CouchDB under high load while replicating and compacting)

That answer is “we could probably replace CouchDB with java+mysql, and it might even be easy to do so”. And, then, “if it really is easy, that’s extra cool (and _because of_ CouchDB)”.)

Why replace CouchDB?

Things we really like about CouchDB (as the backend for our KV service):

  • The architecture: HTTP/REST all the way down, MVCC, many-to-many replication, scales without bound, neat composable building blocks makes an evolvable platform.
  • Working system: Its in production, its running, its running pretty well.
  • Community: open source, active project, know the developers, “cool”.
  • Integrity: it hasn’t corrupted or lost any data yet, and it probably won’t anytime soon.

Things we like less:

  • Debugging: cryptic error messages, erlang stack straces, process deaths.
  • Capacity planning: many unknown and changing performance characteristics.
  • Immaturity: pre-1.0.
  • Humanware: lack of erlang development skills, lack of DBA-like skills, lack of training material (or trainers) to gain skills.
  • Tool support: JProfiler for erlang? Eclipse for erlang? Etc.
  • Map/Reduce and views: alien concept to most developers, hard to audit and manage free-form javascript from tenants, hard to use for data migrations and aggregations.
  • JSON: leads to developers storing JSON which is horribly inefficient.

Those things we don’t like about couch unfortunately aren’t going to change very quickly. For example, the effort required to train up a bunch of DBAs so they can juggle CouchDB namespaces and instances and on-disk data structures is probably rather non-trivial.

The basic idea

It is not easy to see what other document storage system out there would be a particularly good replacement. Tokyo Cabinet, Voldemort, Cassandra, … all of these are also young and immature systems with a variety of quirks. Besides, we really really like the CouchDB architecture.

So why don’t we replace CouchDB with a re-implemented CouchDB? We keep the architecture almost exactly the same, but re-implement the features we care about using technology that we know well and is in many ways much more boring. “HTTP all the way down” should mean this is possible.

We could use mysql underneath (but not use any of its built-in replication features). The java program on top would do the schema and index management, and most importantly implement the CouchDB replication and compaction functionality.

We could even keep the same deployment structure. Assuming one java server is paired with one mysql database instance, we’d end up with 4 tomcat instances on 4 ports (5984-5987) and 4 mysql services on 4
other ports (3306-3309). Use of mysqld_multi probably makes sense. Eventually we could perhaps optimize a bit more by having one tomcat process and one mysql process – it’ll make better use of memory.

Now, what is really really really cool about the CouchDB architecture and its complete HTTP-ness is that we should be able to do any actual migration one node at a time, without downtime. Moving the data across
is as simple as running a replication. Combined with the fact that we’ve been carefully avoiding a lot of its features, CouchDB is probably one of the _easiest_ systems to replace 😀

Database implementation sketch

How would we implement the database? If we think of our KV data as having the form

  ns1:key1 [_rev=1-12345]: { ...}
  ns1:key2 [_rev=2-78901]: { subkey1: ..., }
  ns2:key3 [_rev=1-43210]: { subkey1: ..., subkey2: ...}

where the first integer part of the _rev is dubbed “v” and the remainder part as “src”, then a somewhat obvious database schema looks like (disclaimer: schema has not been tested, do not use :-)):

CREATE TABLE namespace (
  id varchar(64) NOT NULL PRIMARY KEY
      CHARACTER SET ascii COLLATE ascii_bin,
  state enum('enabled','disabled','deleted') NOT NULL
) ENGINE=InnoDB;

CREATE TABLE {namespace}_key (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,

  PRIMARY KEY (ns, key, v, src),
  FOREIGN KEY (ns) REFERENCES namespace(id)
) ENGINE=InnoDB;

CREATE TABLE {namespace}_value (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,
  subkey varchar(255) NOT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci,
  small_value varchar(512) DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if it fits',
  large_value mediumtext DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if its big',

  PRIMARY KEY (ns, key, v, src, subkey),
  FOREIGN KEY (ns) REFERENCES namespace(id),
  FOREIGN KEY (ns, key, v, src)
      REFERENCES {namespace}_key(ns, key, v, src)
      ON DELETE CASCADE
) ENGINE=InnoDB;

With obvious queries including

  SELECT id FROM namespace WHERE state = 'enabled';

  SELECT key FROM {namespace}_key WHERE namespace_id = ?;
  SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ? ORDER BY version DESC LIMIT 1;
  SELECT subkey, small_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?;
  SELECT large_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?
      AND subkey = ?;

  BEGIN;
  CREATE TABLE {namespace}_key (...);
  CREATE TABLE {namespace}_value (...);
  INSERT INTO namespace(id) VALUES (?);
  COMMIT;

  UPDATE namespace SET state = 'disabled' WHERE id = ?;
  UPDATE namespace SET state = 'deleted' WHERE id = ?;

  BEGIN;
  DROP TABLE {namespace}_value;
  DROP TABLE {namespace}_key;
  DELETE FROM namespace WHERE id = ?;
  COMMIT;

  INSERT INTO {namespace}_key (ns,key,v,src)
      VALUES (?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,small_value)
      VALUES (?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,large_value)
      VALUES (?,?,?,?,?);

  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v < ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v = ? AND src =?;

The usefulness for {namespace}_value is debatable; it helps a lot when implementing CouchDB views or some equivalent functionality (“get my all the documents in this namespace where subkey1=…”), but if we decide not to care, then its redundant and {namespace}_key can grow some additional small_value (which should then be big enough to contain a typical JSON document, i.e. maybe 1k) and large_value columns instead.

Partitioning the tables by {namespace} manually isn’t needed if we use MySQL 5.1 or later; table partitions could be used instead.

I’m not sure if we should have a ‘state’ on the keys and do soft-deletes; that might make actual DELETE calls faster; it could also reduce the impact of compactions.

Webapp implementation notes

The java “CouchDB” webapp also does not seem that complicated to build (famous last words?). I would probably build it roughly the same way as [some existing internal webapps].

The basic GET/PUT/DELETE operations are straightforward mappings onto queries that are also rather straightforward.

The POST /_replicate and POST /_compact operations are of course a little bit more involved, but not that much. Assuming some kind of a pool of url fetchers and some periodic executors…

Replication:

  1. get last-seen revision number for source
  2. get list of updates from source
  3. for each update
    • INSERT key
    • if duplicate key error, ignore and don’t update values
    • INSERT OR REPLACE all the values

Compaction:

  1. get list of namespaces
  2. for each namespace:
    • SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ? ORDER BY key ASC, v DESC, src DESC;
    • skip the first row for each key
    • if the second row for the key is the same v, conflict, don’t compact for this key
    • DELETE IGNORE FROM {namespace}_key WHERE ns = ? AND key = ? AND v = ? AND src =?;

So we need some kind of a replication record; once we have mysql available using “documents” seems awkward; let’s use a database table. We might as well have one more MySQL database on each server with a
full copy of a ‘kvconfig’ database, which is replicated around (using mysql replication) to all the nodes. Might also want to migrate away from NAMESPACE_METADATA documents…though maybe not, it is nice and flexible that way.

Performance notes

In theory, the couchdb on-disk format should be much faster than innodb for writes. In practice, innodb has seen quite a few years of tuning. More importantly, in our tests on our servers raw mysql performance seems to be rather better than couchdb. Some of that is due to the extra fsyncs in couchdb, but not all of it.

In theory, the erlang OTP platform should scale out much better than something java-based. In practice, the http server inside couchdb is pretty much a standard fork design using blocking I/O. More importantly, raw tomcat can take >100k req/s on our hardware, which is much much more than our disks can do.

In theory, having the entire engine inside one process should be more efficient than java talking to mysql over TCP. In practice, I doubt this will really show up if we run java and mysql on the same box. More importantly, if this does become an issue, longer-term we may be able to “flatten the stack” by pushing the java “CouchDB” up into the service layer and merging it with the KV service, at which point java-to-mysql will be rather more efficient than java-to-couch.

In theory and in practice innodb has better indexes for the most common SELECTs/GETs so it should be a bit faster. It also is better at making use of large chunks of memory. I suspect the two most common requests (GET that returns 200, GET that returns 404) will both be faster, which incidentally are the most important for us to optimize, too.

We might worry java is slow. That’s kind-of silly :). In theory and in practice garbage collection makes software go faster. We just need to avoid doing those things that make it slow.

The overhead of ACID guarantees might be a concern. Fortunately MySQL is not _really_ a proper relational database if you don’t want it to be. We can probably set the transaction isolation level to READ UNCOMMITTED safely, and the schema design / usage pattern is such that we don’t need transactions in most places. More importantly we are keeping the eventual consistency model, with MVCC and all, on a larger scale. Any over-ACID-ness will be local to the particular node only.

Most importantly, this innodb/mysql thing is mature/boring technology that powers a lot of the biggest websites in the world. As such, you can buy books and consultancy and read countless websites about mysql/innodb/tomcat tuning. Its performance characteristics are pretty well-known and pretty predictable, and lots of people (including here at $work) can make those predictions easily.

So when are we doing this?

No no, we’re not, that’s not the point, this is just a RT! I woke up (rather early) with this idea in my head so I wrote it down to make space for other thoughts. At a minimum, I hope the above helps propagate some ideas:

  • just how well we applied REST and service-oriented architecture here and the benefits its giving us
  • in particular because we picked the right architecture we are not stuck with / tied to CouchDB, now or later
  • we can always re-engineer things (though we should have good enough reasons)
  • things like innodb and/or bdb (or any of the old dbs) are actually great tools with some great characteristics

Just like FriendFeed?

Bret Taylor has a good explanation how FriendFeed built a non-relational database on top of a relational one. The approach outlined above reminds rathe a lot of the solution they implemented, though there’s also important differences.

The headache of mapping shards to servers

At work we had a lot of headache figuring out how to reshard our CouchDB data. We have 2 data centers with 16 CouchDB instances each. One server holds 4 CouchDB nodes. At the moment each data center has one copy of the data. We want to improve resilience so we are changing this so that each data center has two copies of the data (on different nodes of course). Figuring out how to reshard was not so simple at all.

This inspired some thinking about how we would do the same to our MySQL instances. Its not a challenge yet (we have much more KV data than relational data) but if we’re lucky we’ll get much more data pretty soon, and the issue will pop up in a few months.

I was thinking about what makes this so hard to think about (for someone with a small brain like me at least). It is probably about letting go of symmetry. Do you have the same trouble?

Imagine you have a database with information about users and perhaps data for/by those users. You might use horizontal partitioning with consistent hashing to distribute this data across two machines, which use master/slave replication between them for resilience. You might partition the data into four shards so you can scale out to 4 physical master servers later without repartitioning. It’d look like this:

diagram of shards on 2 mirrored servers

Now imagine that you add a new data center and you need to split the data between the two data centers. Assuming your database only does master/slave (like MySQL) and you prefer to daisy-chain the replication it might look like this:

diagram of shards on 2 servers in 2 data centers

Now imagine adding a third data center and an additional machine in each data center to provide extra capacity for reads. Maybe:

diagram of shards on 3 servers in 3 data centers

You can see that the configuration has suddenly become a bit unbalanced and also somewhat non-obvious. Given this availability of hardware, the most resilient distribution of masters and slaves is not symmetric at all. When you lose symmetry the configuration becomes much more complicated to understand and manage.

Now imagine a bigger database. To highlight how to deal with asymmetry, imagine 4 data centers with 3 nodes per data center. Further imagine having 11 shards to distribute, wanting at least one slave in the same data center as its master. Further imagine wanting 3 slaves for each master. Further imagine wanting to have the data split as evenly as possible across all data centers, so that you can lose up to half of them without losing any data.

Can you work out how to do that configuration? Maybe:

diagram showing many shards on many servers in many data centers

Can you work out how to do it for, oh, 27 data centers, 400 shards with at least 3 copies of each shard, with 7 data centers having 15% beefier boxes and one data center having twice the number of boxes?

As the size of the problem grows, a diagram of a good solution looks more and more like a circle that quickly becomes hard to draw.

When you add in the need to be able to reconfigure server allocations, to fail over masters to one of their slaves, and more, it turns out you eventually need software assistance to solve the mapping of shards to servers. We haven’t written such software yet; I suspect we can steal it from Voldemort by the time we need it 🙂

The tipping point is when you lose symmetry (the 3rd data centre, the 5th database node, etc).

Consistent hashing helps make it easier to do resharding and node rewiring, especially when you’re not (only) dependent on something like mysql replication. But figuring out what goes in which bucket is not as easy as figuring out which bucket goes where. Unless you have many hundreds of buckets, then maybe you can assume your distribution of buckets is even enough if you hash the bucket id to find the server id.