Improving accessibility of this blog

I’ve been studying accessibility. After reading a whole bunch of stuff I decided I should try to make my blog as accessible as reasonably possible. This blog post chronicles some of the changes I’ve made. If this helps you enjoy my blog more, please do let me know. If I made anything worse, please also let me know 🙂

Different wordpress theme

This new theme features:

  • High-contrast design (good old black-on-white)
  • relative font sizes in CSS
  • reasonable serif font
  • clean “pure” CSS for layout
  • layout/ordering of elements of html on the page is reasonable

Different theme configuration

Some of the changes:

  • Re-ordered the widgets
  • Replaced the calendar widget with a list of archived posts
  • Added more descriptive titles to the widgets
  • Changed the custom HTML for the RSS link (relative font size, alt tag for icon)
  • Stop using drop down menus for lists of links
  • Made the blog tagline more descriptive

I also tried replacing the search widget with something more reasonable but failed to hack that up. If someone from wordpress could please make it resemble something like

<form method="get" id="search_form" action="https://lsimons.wordpress.com/">
    <div>
        <label for="s">Search:</label><br />
        <input type="text" name="s" id="s" />
        <input type="submit" value="Go" />
    </div>
</form>

that would be great. Or, if someone knows how to do an accessible search box on a wordpress.com blog, please let me know 🙂

Secondly, it would be nice if the theme could be modified to have “skip to content” and “skip to navigation” links, which I seem to have no way of doing myself.

Change of blog contents

Some of the changes:

  • Fixed some bits of broken HTML
  • Added reasonable (descriptive but less than 100 characters) alt tags to images
  • Replaced use of <h4> with <h3> in blog post contents (so that the header structure goes h1 – h2 – h3 properly)
  • Cleaned up the page navigation to have just Home and About pages
  • Cleaned up the About page, in particular getting rid of the strike-through links which seems to be a subtlety that is easily lost when using a screen reader

“Doing” accessibility using tools

I got rather depressed a couple of years ago with the complete lack of tool support, and basically just gave up. Recently I talked to some front-end engineers at the BBC (which cares a great deal about accessibility) who gave me some useful pointers. Tool support has now gotten a lot better. You should take a look around!

Here’s three links:

  • WebAIM WAVE, an accessibility validator. I feel this validator is much better than any of the others; for example it doesn’t accuse me of using ASCII art when it encounters a block of java code. It also doesn’t sounds like a stern kindergarten teacher or some haughty user experience expert.
  • Fangs, an easy-to-use firefox plugin to do “screen reader emulation”. Basically it processes a web page and then spits out text that is roughly what a screen reader would say.
  • WebAIM Screen reader simulator, gives you an idea of what a screen reader does, good to try if you’ve never seen an actual screen reader in action.

Dive into accessibility

If you’ve never bothered about accessibility before but you’re interested now, I suggest you start your reading with Dive into Accessibility, a free online book with lots of clear practical advice by someone that knows his stuff and actually builds websites out there in the real world. From the introduction:

Don’t panic if you are not an HTML expert. Don’t panic if the only web site you have is a personal weblog, you picked your template out of a list on your first day of blogging, and you’ve never touched it since. I am not here to tell you that you need to radically redesign your web site from scratch, rip out all your nested tables, and convert to XHTML and CSS. This is about taking what you have and making it better in small but important ways.

First look at open source IntelliJ

IntelliJ IDEA was open sourced yesterday!

Codebase overview

  • over 20k java source files, totalling just over 2M lines
  • over 150 jar files
  • over 500 xml files
  • build system based on ant, gant, and a library called jps for running intellij builds for which the source apparently is not available yet (see IDEA-25160)
  • Apache license header applied to most of the files, copyrights both jetbrains and a variety of individuals, license data not quite complete, no NOTICE.txt (see IDEA-25161)
  • ./platform is the core system
  • ./plugins plug into the core platform
  • ./java and ./xml are bigger plugin-collection-ish subsystems

Building…

  • Install ant (there is an ant in ./lib/ant)
  • Run ant
  • Build takes about 7 minutes on my macbook

Running…

On Mac OS X I run into 64 bit problems. Falling back to a 32-bit version of JDK 5.0 works for me…seems like jetbrains may have just fixed it.

cd /System/Library/Frameworks/JavaVM.framework/Versions/1.5.0/Home/bin
sudo bash
mv java java.orig
lipo java -remove x86_64 -output java_x32
ln -s java_32 java
cd -
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.5.0/Home
export PATH=$JAVA_HOME/bin:$PATH
rm -Rf out
ant
cd out/artifacts
unzip ideaIC-90.SNAPSHOT.mac.zip
open ./Maia-IC-90.SNAPSHOT.app

Loading the idea source code into your just-built ide works seemlessly (just navigate to your git repo, an intellij project is already set up in the .idea directory.

Reading the code

com.intellij.idea.Main uses Boostrap and MainImpl to invoke IdeaApplication.run(). We’re in IntelliJ OpenAPI land now. Somewhere further down the call stack something creates an ApplicationImpl which uses PicoContainer. w00t! That makes much more sense to me than the heavyweight OSGi/equinox that’s underpinning eclipse. Its where plugins and extensions get loaded, after which things become very fluid and multi-threaded and harder to follow.

So now I’m thinking I should find a way to hook up IntelliJ into a debugger inside another IntelliJ…though it’d be cool if intellij was somehow “self-hosting” in that sense. Here’s hoping the intellij devs will write some how-to-hack docs soon!

java 1.6 exposes the system load average

See the javadoc.

Example usage:

import java.lang.management.ManagementFactory;
import java.lang.management.OperatingSystemMXBean;

public class LoadAverage {
    public static void main(String[] args) {
        final OperatingSystemMXBean osStats =
                ManagementFactory.getOperatingSystemMXBean();
        final double loadAverage = osStats.getSystemLoadAverage();
        System.out.println(String.format("load average: %f", loadAverage));
    }
}

This is a rather useful feature if you are writing software that should do less when the overall system load is high.

For example, if you’re me, you might be working on a java daemon that is instructing some CouchDB instances on the same box to do database compactions and/or replications, and you could use this to tune down the concurrency or the frequency if the load average is above a threshold.

You don’t know and you don’t understand

You know much less than you think you know. You misunderstand many more things than you think you do. You’re also much more wrong much more often than you think.

(Don’t worry, it’s not just you, the same is true for everyone else.)

Even better, this is how science works. Being a scientist is all about actively trying to be wrong (and proving everyone else wrong), all the time. When you do science, you don’t know, and what you learn doing the science, you don’t ever know for sure.

The scientific method

Here’s the basic steps in the scientific method:

  1. Based on past experience of you and others, try and make some sense of a problem
  2. Try to find a reasonable explanation for the problem
  3. If the explanation is correct, what else would you be able to see or measure?
  4. Try to disprove the explanation by doing the observation and measuring

Scientists do this all day every day, they do it together on a world-wide scale, and they do it to each other.

Experimentation

In uni, studying applied physics, I was trained in a specific application of the scientific method to experimentation, which went something like:

  1. Define a question to answer.
  2. Define what you already know (or will assume) that is related.
  3. Form a hypothesis of what the answer may be.
  4. Figure out what you can measure.
  5. Define how those measurements could be interpreted to verify or disprove the hypothesis.
  6. Do the experiments and collect the measurements.
  7. Analyze the data.
  8. Assert the internal consistency of the experimental data by applying statistics.
  9. Draw conclusions from the analysis.

The course was called Introduction to Experimentation, and it included many more specifics than just that process. For example, it was also about teamwork basics, the use of lab journals, safe lab practices, how to think about accuracy and precision, and quite a lot of engineering discipline.

The course was nearly completely free of actually interesting math or physics content. For example, the first two 4-hour practicums of the course centered around the measurement of the resistance of a 10 ohm resistor. Some of the brighest 18- and 19-year olds in the country would leave that practicum feeling properly stupid for the first time, very frustrated that they had “proven” the resistor to have a resistance of 11+/-0.4 Ohm (where in reality the resistor was “known” to be something like 10.000+/-0.001 Ohm).

The art of being wrong

Teaching that same course (some 2 years later) has turned out to be one of the most valuable things I’ve ever done in my life. One of the key things that students learned in that course was that the teacher might not know either – after all a lab is a strange and wonderful place, and volt meters can in fact break! The teacher in turn learned that even when teaching something seemingly trivial it is possible to be utterly wrong. Powerful phrases that I learned to use included “I don’t know either”, “You are probably right, but I really don’t understand what’s going on”, “Are you sure?”, “I’m not sure”, “How can you be so sure?”, “How can we test that?”, and the uber-powerful “Ah yes, so I was wrong” (eclipsed in power only by “Ok, enough of this, let’s go drink beer”).

This way of inquisitive thinking, with its fundamental acceptance of uncertainty and being wrong, was later amplified by studying things like quantum mechanics with its horrible math and even more horrible concepts. “I don’t know” became my default mind-state. Today, it is one of the most important things I contribute to my work environment (whether it is doing software development, project management, business analytics doesn’t matter) – the power to say “I don’t know” and/or “I was wrong”.

For the last week or two I’ve had lots of fun working closely with a similarly schooled engineer (he really doesn’t know anything either…) to try and debug and change a complex software system. It’s been useful staring at the same screen, arguing with each other that really we don’t know enough about X or Y or Z to even try and form a hypothesis. Communicating out to the wider group, I’ve found that almost everyone cringes at the phrase “we don’t know” or my recent favorite “we still have many unknown unknowns”. Not knowing seems to be a horrible state of mind, rather than the normal one.

Bits and bytes don’t lie?

I have a hypothesis about that aversion to the unknown: people see computers as doing simple boolean logic on bits and bytes, so it should be quite possible to just know everything about a software system. As they grow bigger, all that changes is that there are more operations on more data, but you never really stop knowing. A sound and safe castle of logic!

In fact, I think that’s a lot of what computer science teaches (as far as I know, I never actually studied computer science in university, I just argued a lot with the computer so-called-scientists). You start with clean discrete math and through state machines and automata and functional programming you can eventually find your way to the design of distributed systems and all the way to the nirvana of the artificial intelligence. (AI being much better than the messy biological reality of forgetting things and the like.) Dealing with uncertainty and unknowns is not what computer science seems to be about.

The model of “clean logic all the way down” is completely useless when doing actual software development work. Do you really know which compiler was used on which version of the source code that led to the firmware that is now in your raid controller, and that there are no relevant bugs in it or in that compiler? Are you sure the RAM memory is plugged in correctly in all your 200 boxes? Is your data centre shielded enough from magnetic disturbances? Is that code you wrote 6 months ago really bug-free? What about that open source library you’re using everywhere?

In fact, this computer scientist focus on logic and algorithms and a high appreciation building systems is worse than just useless. It creates real problems. It means the associated industry sees its output in terms of lines of code written, features delivered, etc. The most revered super star engineers are those that crank out new software all the time. Web frameworks are popular because you can build an entire blog with them in 5 minutes.

Debugging and testing, that’s what people that make mistakes have to do. Software design is a group activity but debugging is something you do on your own without telling anyone that you can’t find your own mistake. If you are really good you will make fewer mistakes, will have to spend less time testing, and so produce more and better software more quickly. If you are really really good you might do test-driven development and with your 100% test coverage you just know that you cannot be wrong…

The environment in which we develop software is not nearly as controlled as we tend to assume. Our brains are not nearly as powerful as we believe. By not looking at the environment, by not accepting that there is quite a lot we don’t know, we become very bad at forming a reasonable hypothesis, and worse at interpreting our test data.

Go measure a resistor

So here’s my advice to people that want to become better software developers: try and measure some resistors. Accept that you’re wrong, that you don’t know, and that you don’t understand.

[RT] MyCouch

The below post is an edited version of a $work e-mail, re-posted here at request of some colleagues that wanted to forward the story. My apologies if some of the bits are unclear due to lack-of-context. In particular, let me make clear:

  • we have had a production CouchDB setup for months that works well
  • we are planning to keep that production setup roughly intact for many more months and we are not currently planning to migrate away from CouchDB at all
  • overall we are big fans of the CouchDB project and its community and we expect great things to come out of it

Nevertheless using pre-1.0 software based on an archaic language with rather crappy error handling can get frustrating 🙂

Subject: [RT] MyCouch
From: Leo Simons 
To: Forge Engineering 

This particular RT gives one possible answer to the question “what would be a good way to make this KV debugging somewhat less frustrating?” (we have been fighting erratic response times from CouchDB under high load while replicating and compacting)

That answer is “we could probably replace CouchDB with java+mysql, and it might even be easy to do so”. And, then, “if it really is easy, that’s extra cool (and _because of_ CouchDB)”.)

Why replace CouchDB?

Things we really like about CouchDB (as the backend for our KV service):

  • The architecture: HTTP/REST all the way down, MVCC, many-to-many replication, scales without bound, neat composable building blocks makes an evolvable platform.
  • Working system: Its in production, its running, its running pretty well.
  • Community: open source, active project, know the developers, “cool”.
  • Integrity: it hasn’t corrupted or lost any data yet, and it probably won’t anytime soon.

Things we like less:

  • Debugging: cryptic error messages, erlang stack straces, process deaths.
  • Capacity planning: many unknown and changing performance characteristics.
  • Immaturity: pre-1.0.
  • Humanware: lack of erlang development skills, lack of DBA-like skills, lack of training material (or trainers) to gain skills.
  • Tool support: JProfiler for erlang? Eclipse for erlang? Etc.
  • Map/Reduce and views: alien concept to most developers, hard to audit and manage free-form javascript from tenants, hard to use for data migrations and aggregations.
  • JSON: leads to developers storing JSON which is horribly inefficient.

Those things we don’t like about couch unfortunately aren’t going to change very quickly. For example, the effort required to train up a bunch of DBAs so they can juggle CouchDB namespaces and instances and on-disk data structures is probably rather non-trivial.

The basic idea

It is not easy to see what other document storage system out there would be a particularly good replacement. Tokyo Cabinet, Voldemort, Cassandra, … all of these are also young and immature systems with a variety of quirks. Besides, we really really like the CouchDB architecture.

So why don’t we replace CouchDB with a re-implemented CouchDB? We keep the architecture almost exactly the same, but re-implement the features we care about using technology that we know well and is in many ways much more boring. “HTTP all the way down” should mean this is possible.

We could use mysql underneath (but not use any of its built-in replication features). The java program on top would do the schema and index management, and most importantly implement the CouchDB replication and compaction functionality.

We could even keep the same deployment structure. Assuming one java server is paired with one mysql database instance, we’d end up with 4 tomcat instances on 4 ports (5984-5987) and 4 mysql services on 4
other ports (3306-3309). Use of mysqld_multi probably makes sense. Eventually we could perhaps optimize a bit more by having one tomcat process and one mysql process – it’ll make better use of memory.

Now, what is really really really cool about the CouchDB architecture and its complete HTTP-ness is that we should be able to do any actual migration one node at a time, without downtime. Moving the data across
is as simple as running a replication. Combined with the fact that we’ve been carefully avoiding a lot of its features, CouchDB is probably one of the _easiest_ systems to replace 😀

Database implementation sketch

How would we implement the database? If we think of our KV data as having the form

  ns1:key1 [_rev=1-12345]: { ...}
  ns1:key2 [_rev=2-78901]: { subkey1: ..., }
  ns2:key3 [_rev=1-43210]: { subkey1: ..., subkey2: ...}

where the first integer part of the _rev is dubbed “v” and the remainder part as “src”, then a somewhat obvious database schema looks like (disclaimer: schema has not been tested, do not use :-)):

CREATE TABLE namespace (
  id varchar(64) NOT NULL PRIMARY KEY
      CHARACTER SET ascii COLLATE ascii_bin,
  state enum('enabled','disabled','deleted') NOT NULL
) ENGINE=InnoDB;

CREATE TABLE {namespace}_key (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,

  PRIMARY KEY (ns, key, v, src),
  FOREIGN KEY (ns) REFERENCES namespace(id)
) ENGINE=InnoDB;

CREATE TABLE {namespace}_value (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,
  subkey varchar(255) NOT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci,
  small_value varchar(512) DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if it fits',
  large_value mediumtext DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if its big',

  PRIMARY KEY (ns, key, v, src, subkey),
  FOREIGN KEY (ns) REFERENCES namespace(id),
  FOREIGN KEY (ns, key, v, src)
      REFERENCES {namespace}_key(ns, key, v, src)
      ON DELETE CASCADE
) ENGINE=InnoDB;

With obvious queries including

  SELECT id FROM namespace WHERE state = 'enabled';

  SELECT key FROM {namespace}_key WHERE namespace_id = ?;
  SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ? ORDER BY version DESC LIMIT 1;
  SELECT subkey, small_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?;
  SELECT large_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?
      AND subkey = ?;

  BEGIN;
  CREATE TABLE {namespace}_key (...);
  CREATE TABLE {namespace}_value (...);
  INSERT INTO namespace(id) VALUES (?);
  COMMIT;

  UPDATE namespace SET state = 'disabled' WHERE id = ?;
  UPDATE namespace SET state = 'deleted' WHERE id = ?;

  BEGIN;
  DROP TABLE {namespace}_value;
  DROP TABLE {namespace}_key;
  DELETE FROM namespace WHERE id = ?;
  COMMIT;

  INSERT INTO {namespace}_key (ns,key,v,src)
      VALUES (?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,small_value)
      VALUES (?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,large_value)
      VALUES (?,?,?,?,?);

  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v < ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v = ? AND src =?;

The usefulness for {namespace}_value is debatable; it helps a lot when implementing CouchDB views or some equivalent functionality (“get my all the documents in this namespace where subkey1=…”), but if we decide not to care, then its redundant and {namespace}_key can grow some additional small_value (which should then be big enough to contain a typical JSON document, i.e. maybe 1k) and large_value columns instead.

Partitioning the tables by {namespace} manually isn’t needed if we use MySQL 5.1 or later; table partitions could be used instead.

I’m not sure if we should have a ‘state’ on the keys and do soft-deletes; that might make actual DELETE calls faster; it could also reduce the impact of compactions.

Webapp implementation notes

The java “CouchDB” webapp also does not seem that complicated to build (famous last words?). I would probably build it roughly the same way as [some existing internal webapps].

The basic GET/PUT/DELETE operations are straightforward mappings onto queries that are also rather straightforward.

The POST /_replicate and POST /_compact operations are of course a little bit more involved, but not that much. Assuming some kind of a pool of url fetchers and some periodic executors…

Replication:

  1. get last-seen revision number for source
  2. get list of updates from source
  3. for each update
    • INSERT key
    • if duplicate key error, ignore and don’t update values
    • INSERT OR REPLACE all the values

Compaction:

  1. get list of namespaces
  2. for each namespace:
    • SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ? ORDER BY key ASC, v DESC, src DESC;
    • skip the first row for each key
    • if the second row for the key is the same v, conflict, don’t compact for this key
    • DELETE IGNORE FROM {namespace}_key WHERE ns = ? AND key = ? AND v = ? AND src =?;

So we need some kind of a replication record; once we have mysql available using “documents” seems awkward; let’s use a database table. We might as well have one more MySQL database on each server with a
full copy of a ‘kvconfig’ database, which is replicated around (using mysql replication) to all the nodes. Might also want to migrate away from NAMESPACE_METADATA documents…though maybe not, it is nice and flexible that way.

Performance notes

In theory, the couchdb on-disk format should be much faster than innodb for writes. In practice, innodb has seen quite a few years of tuning. More importantly, in our tests on our servers raw mysql performance seems to be rather better than couchdb. Some of that is due to the extra fsyncs in couchdb, but not all of it.

In theory, the erlang OTP platform should scale out much better than something java-based. In practice, the http server inside couchdb is pretty much a standard fork design using blocking I/O. More importantly, raw tomcat can take >100k req/s on our hardware, which is much much more than our disks can do.

In theory, having the entire engine inside one process should be more efficient than java talking to mysql over TCP. In practice, I doubt this will really show up if we run java and mysql on the same box. More importantly, if this does become an issue, longer-term we may be able to “flatten the stack” by pushing the java “CouchDB” up into the service layer and merging it with the KV service, at which point java-to-mysql will be rather more efficient than java-to-couch.

In theory and in practice innodb has better indexes for the most common SELECTs/GETs so it should be a bit faster. It also is better at making use of large chunks of memory. I suspect the two most common requests (GET that returns 200, GET that returns 404) will both be faster, which incidentally are the most important for us to optimize, too.

We might worry java is slow. That’s kind-of silly :). In theory and in practice garbage collection makes software go faster. We just need to avoid doing those things that make it slow.

The overhead of ACID guarantees might be a concern. Fortunately MySQL is not _really_ a proper relational database if you don’t want it to be. We can probably set the transaction isolation level to READ UNCOMMITTED safely, and the schema design / usage pattern is such that we don’t need transactions in most places. More importantly we are keeping the eventual consistency model, with MVCC and all, on a larger scale. Any over-ACID-ness will be local to the particular node only.

Most importantly, this innodb/mysql thing is mature/boring technology that powers a lot of the biggest websites in the world. As such, you can buy books and consultancy and read countless websites about mysql/innodb/tomcat tuning. Its performance characteristics are pretty well-known and pretty predictable, and lots of people (including here at $work) can make those predictions easily.

So when are we doing this?

No no, we’re not, that’s not the point, this is just a RT! I woke up (rather early) with this idea in my head so I wrote it down to make space for other thoughts. At a minimum, I hope the above helps propagate some ideas:

  • just how well we applied REST and service-oriented architecture here and the benefits its giving us
  • in particular because we picked the right architecture we are not stuck with / tied to CouchDB, now or later
  • we can always re-engineer things (though we should have good enough reasons)
  • things like innodb and/or bdb (or any of the old dbs) are actually great tools with some great characteristics

Just like FriendFeed?

Bret Taylor has a good explanation how FriendFeed built a non-relational database on top of a relational one. The approach outlined above reminds rathe a lot of the solution they implemented, though there’s also important differences.

The headache of mapping shards to servers

At work we had a lot of headache figuring out how to reshard our CouchDB data. We have 2 data centers with 16 CouchDB instances each. One server holds 4 CouchDB nodes. At the moment each data center has one copy of the data. We want to improve resilience so we are changing this so that each data center has two copies of the data (on different nodes of course). Figuring out how to reshard was not so simple at all.

This inspired some thinking about how we would do the same to our MySQL instances. Its not a challenge yet (we have much more KV data than relational data) but if we’re lucky we’ll get much more data pretty soon, and the issue will pop up in a few months.

I was thinking about what makes this so hard to think about (for someone with a small brain like me at least). It is probably about letting go of symmetry. Do you have the same trouble?

Imagine you have a database with information about users and perhaps data for/by those users. You might use horizontal partitioning with consistent hashing to distribute this data across two machines, which use master/slave replication between them for resilience. You might partition the data into four shards so you can scale out to 4 physical master servers later without repartitioning. It’d look like this:

diagram of shards on 2 mirrored servers

Now imagine that you add a new data center and you need to split the data between the two data centers. Assuming your database only does master/slave (like MySQL) and you prefer to daisy-chain the replication it might look like this:

diagram of shards on 2 servers in 2 data centers

Now imagine adding a third data center and an additional machine in each data center to provide extra capacity for reads. Maybe:

diagram of shards on 3 servers in 3 data centers

You can see that the configuration has suddenly become a bit unbalanced and also somewhat non-obvious. Given this availability of hardware, the most resilient distribution of masters and slaves is not symmetric at all. When you lose symmetry the configuration becomes much more complicated to understand and manage.

Now imagine a bigger database. To highlight how to deal with asymmetry, imagine 4 data centers with 3 nodes per data center. Further imagine having 11 shards to distribute, wanting at least one slave in the same data center as its master. Further imagine wanting 3 slaves for each master. Further imagine wanting to have the data split as evenly as possible across all data centers, so that you can lose up to half of them without losing any data.

Can you work out how to do that configuration? Maybe:

diagram showing many shards on many servers in many data centers

Can you work out how to do it for, oh, 27 data centers, 400 shards with at least 3 copies of each shard, with 7 data centers having 15% beefier boxes and one data center having twice the number of boxes?

As the size of the problem grows, a diagram of a good solution looks more and more like a circle that quickly becomes hard to draw.

When you add in the need to be able to reconfigure server allocations, to fail over masters to one of their slaves, and more, it turns out you eventually need software assistance to solve the mapping of shards to servers. We haven’t written such software yet; I suspect we can steal it from Voldemort by the time we need it 🙂

The tipping point is when you lose symmetry (the 3rd data centre, the 5th database node, etc).

Consistent hashing helps make it easier to do resharding and node rewiring, especially when you’re not (only) dependent on something like mysql replication. But figuring out what goes in which bucket is not as easy as figuring out which bucket goes where. Unless you have many hundreds of buckets, then maybe you can assume your distribution of buckets is even enough if you hash the bucket id to find the server id.