Setting up a new mac

I got a new macbook pro. Rather than migrate my settings automatically I decided I’d go for a little spring cleaning by buillding it up from scratch.

Initial setup stuff

  • start software update
  • go through system preferences, add appropriate paranoia
  • open keychain utility, add appropriate paranoia
  • run software update
  • insert Mac OS X CD, install XCode
  • remove all default bookmarks from safari
  • open iTunes, sign in, turn off annoying things like ping
  • download, extract, install, run once, enter serial (where needed)
    • Chrome
    • Firefox
      • set master password
    • Thunderbird
      • set master password
    • Skype
    • OmniGraffle
    • OmniPlan
    • TextMate
    • VMware Fusion
    • Colloquy
    • IntelliJ IDEA; disable most enterpriseish plugins; then open plugin manager and install
      • python plugin
      • ruby plugin
      • I have years of crap in my intellij profile which I’m not adding to this machine
    • VLC
    • Things
  • download stunnel, extract, sudo mkdir /usr/local && sudo chown $USER /usr/local, ./configure --disable-libwrap, make, sudo make install
  • launch terminal, customize terminal settings (font size, window size, window title, backgroundc color) and save as default

Transfer from old mac

  • copy over secure.dmg, mount, set to automount on startup
  • import certs into keychain, firefox, thunderbird
  • set up thunderbird with e-mail accounts
  • copy over ~/.ssh and ~/.subversion
  • set up stunnel for colloquy, run colloquy, add localhost as server, set autojoin channels, change to tabbed interface
  • copy over keychains
  • copy over Office, run office setup assistant
  • copy over documents, data
  • copy over virtual machines
  • open each virtual machine, selecting “I moved it”
  • copy over itunes library
  • plug in, pair and sync ipad and iphone

Backup

  • set up time machine

Development packages

I like to know what is in my /usr/local, so I don’t use MacPorts or fink or anything like it.

  • download and extract pip, python setup.py install
  • pip install virtualenv Django django-piston south django-audit-log django-haystack httplib2 lxml
    • Ah mac 10.6 has a decent libxml2 and libxslt so lxml just installs. What a breeze.
  • download and install 64-bit mysql .dmg from mysql.com, also install preference pane
  • Edit ~/.profile, set CLICOLOR=yes, set PATH, add /usr/local/bin/mysql to path, source ~/.profile
    • again, I have years of accumulated stuf in my bash profile that I’m dumping. It’s amazing how fast bash starts when it doesn’t have to execute a gazillion lines of shell script…
  • code>pip install MySQL-python, install_name_tool -change libmysqlclient.16.dylib /usr/local/mysql/lib/libmysqlclient.16.dylib /Library/Python/2.6/site-packages/_mysql.so
  • take care of the many dependencies for yum, mostly standard ./configure && make && make install of
    • pkg-config
    • gettext
    • libiconv
    • gettext again, --with-libiconv-prefix=/usr/local
    • glib, --with-libiconv=gnu
    • popt
    • db-5.1.19.tar.gz, cd build_unix && ../dist/configure --enable-sql && make && make install, cp sql/sqlite3.pc /usr/local/lib/pkgconfig/, cd /usr/local/BerkeleyDB.5.1/include && mkdir db51 && cd db51 && ln -s ../*.h .
    • neon, ./configure --without-gssapi --with-ssl
    • rpm (5.3), export CPATH=/usr/local/BerkeleyDB.5.1/include, sudo mkdir /var/local && sudo chown $USER /var/local, ./configure --with-db=/usr/local/BerkeleyDB.5.1 --with-python --disable-nls --disable-openmp --with-neon=/usr/local && make && make install
    • pip install pysqlite pycurl
    • urlgrabber, sudo mkdir /System/Library/Frameworks/Python.framework/Versions/2.6/share && chown $USER sudo mkdir /System/Library/Frameworks/Python.framework/Versions/2.6/share && python setup.py install
    • intltool
    • yum-metadata-parser, export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/usr/lib/pkgconfig; python setup.py install
  • install yum. Yum comes without decent install scripts (the idea is you install it from rpm). I did some hacks to get somewhere reasonable:
    cat Makefile | sed -E -e 's/.PHONY(.*)/.PHONY\1 install/' > Makefile.new
    mv Makefile.new Makefile
    cat etc/Makefile | sed -E -e 's/install -D/install/' > etc/Makefile.new
    mv etc/Makefile.new etc/Makefile
    make DESTDIR=/usr/local PYLIBDIR=/Library/Python/2.6 install
    mv /usr/local/Library/Python/2.6/site-packages/* /Library/Python/2.6/site-packages/
    mv /usr/local/usr/bin/* /usr/local/bin/
    mkdir /usr/local/sbin
    mv /usr/local/usr/sbin/* /usr/local/sbin/
    rsync -av /usr/local/usr/share/ /usr/local/share/
    rm -Rf /usr/local/usr/
    cat /usr/local/bin/yum | sed -E -e 's|/usr/share/yum-cli|/usr/local/share/yum-cli|' > /tmp/yum.new
    mv /tmp/yum.new /usr/local/bin
    chmod +x /usr/local/bin/yum
    

That’s how far I got last night, resulting in 86GB of disk use (most of which is VMs and iTunes library), just enough to be productive on my current $work project. I’m sure there’s weeks of tuning in my future.

Engineering IT Supply Manifesto

I came across an Engineering IT Supply Manifesto. It’s a long rant from a former engineering manager at facebook who makes the case that taking longer than a business day to supply IT equipment (laptops, monitors, keyboards, …) is stupid.

Amen to that.

When I worked at Joost the IT folks kept a stack of spare preconfigured macbooks and thinkpads in a cupboard (famously the macbook stack was higher than the thinkpad stack, because the macs broke more). The experience was pretty great for the developers (going from the downer of “crap my laptop broke I hope I committed recently” to the upper of “ooh shiny new laptop” so quickly), and also for the sysadmins (actually getting thanked for their hard work and careful planning), and I’m pretty sure it meant a net save of money.

As a contractor I’m used to keeping a spare laptop around. I don’t have one right now, but OTOH there’s an apple store 20 mins from work and 30 mins from home, so I know I can get away without that spare. I also tend to buy faster/better gear pretty frequently. At 2 years my current laptop really deserves replacing, just haven’t gotten around to it yet because I can’t decide what size to switch to.

Hmm. I suspect EU public sector organizations of a certain size have to do “big chunk” EU procurement for their IT…anyone know of public sector organizations that have drafted decent framework agreements compatible with this manifesto? 🙂

Using long-lived stable branches

For the last couple of years I’ve been using subversion on all the commercial software projects I’ve done. At joost and after that at the BBC we’ve usually used long-lived stable branches for most of the codebases. Since I cannot find a good explanation of the pattern online I thought I’d write up the basics.

Working on trunk

Imagine a brand new software project. There’s two developers: Bob and Fred. They create a new project with a new trunk and happily code away for a while:

Example flow diagram of two developers committing to trunk

Stable branch for release

Flow diagram of two developers creating a stable branch to cut releases

At some point (after r7, in fact) the project is ready to start getting some QA, and its Bob’s job to cut a first release and get it to the QA team. Bob creates the new stable branch (svn cp -r7 ../trunk ../branches/stable, resulting in r8). Then he fixes one last thing (r9), which he merges to stable (using svnmerge, r10). (Not paying much attention to the release work, Fred’s continued working and fixed a bug in r11) Bob then makes a tag of the stable branch (svn cp brances/stable tags/1.0.0, r12) to create the first release.

QA reproduce the bug Fred has already fixed, so Bob merges that change to stable (r14) and tags 1.0.1 (r15). 1.0.1 passes all tests and is eventually deployed to live.

Release branch for maintenance

Flow diagram of creating a release branch for hosting a bug fix

A few weeks later, a problem is found on the live environment. Since it looks like a serious problem, Bob and Fred both drop what they were doing (working on the 1.1 release) and hook up on IRC to troubleshoot. Fred finds the bug and commits the fix to trunk (r52), tells Bob on IRC, and then continues hacking away at 1.1 (r55). Bob merges the fix to stable (r53) and makes the first 1.1 release (1.1.0, r54) so that QA can verify the bug is fixed. It turns out Fred did fix the bug, so Bob creates a new release branch for the 1.0 series (r56), merges the fix to the 1.0 release branch (r57) and tags a new release 1.0.2 (r58). QA run regression tests on 1.0.2 and tests for the production bug. All seems ok so 1.0.2 is rolled to live.

Interaction with continuous integration

Flow diagram showing what continuous integration projects use what branch

Every commit on trunk may trigger a trunk build. The trunk build has a stable period of just a few minutes. Every successful trunk build may trigger an integration deploy. The integration deploy has a longer stable period, about an hour or two. It is also frequently triggered manually when an integration deploy failed or deployed broken software.

Ideally the integration deploy takes the artifacts from the latest successful trunk build and deploys those, but due to the way maven projects are frequently set up it may have to rebuild trunk before deploying it.

Every merge to stable may trigger a stable build. The stable build also has a stable period of just a few minutes, but it doesn’t run as frequently as the trunk build simply because merges are not done as frequently as trunk commits. The test deploy is not automatic – an explicit decision is made to deploy to the test environment and typically a specific version or svn revisions is deployed.

Reflections

Main benefits of this approach

  • Reasonably easy to understand (even for the average java weenie that’s a little scared of merging, or the tester that doesn’t touch version control at all).
  • Controlled release process.
  • Development (on trunk) never stops, so that there is usually no need for feature branches (though you can still use them if you need to) and communication overhead between developers is limited.
  • Subversion commit history tells the story of what actually happened reasonably well.

Why just one stable?

A lot of people seeing this might expect to see a 1.0-STABLE, 1.1-STABLE, and such and so forth. The BSDs and mozilla do things that way, for example. The reason not to have those comes down to tool support – with a typical svn / maven / hudson / jira toolchain, branching is not quite as cheap as you’d like it to be, especially on large crufty java projects. It’s simpler to work with just one stable branch, and you can often get away with it.

From a communication perspective it’s also just slightly easier this way – rather than talk about “the current stable branch” or “the 1.0 stable branch”, you can just say “the stable branch” (or “merge to stable”) and it is not ever ambiguous.

Why a long-lived stable?

In the example above, Bob and Fred have continued to evolve stable as they worked on the 1.1 release series – for example we can see that Bob merged r46,47,49 to stable. When continuously integrating on trunk, it’s quite common to see a lot of commits to trunk that in retrospect are best grouped together and considered a single logical change set. By identifying and merging those change sets early on, the story of the code evolution on stable gives a neat story of what features were code complete when, and it allows for providing QA with probably reasonably stable code drops early on.

This is usually not quite cherry-picking — it’s more likely melon-picking, where related chunks of code are kept out of stable for a while and then merged as they become stable. The more coarse-grained chunking tends to be rather necessary on “agile” java projects where there can be a lot of refactoring, which tends to make merging hard.

Why not just release from trunk?

The simplest model does not have a stable branch, and it simply cuts 1.0.0 / 1.0.1 / 1.1.0 from trunk. When a maintenance problem presents itself, you then branch from the tag for 1.0.2.

The challenge with this approach is sort-of shown in these examples — Fred’s commit r13 should not make it into 1.0.1. By using a long-lived stable branch Bob can essentially avoid creating the 1.0 maintenance branch. It doesn’t look like there’s a benefit here, but when you consider 1.1, 1.2, 1.3, and so forth, it starts to matter.

The alternative trunk-only approach (telling Fred to hold off committing r13 until 1.0 is in production) is absolutely horrible for what are hopefully obvious reasons, and I will shout at you if you suggest it to me.

For small and/or mature projects I do often revert back to having just a trunk. When you have high quality code that’s evolving in a controlled fashion, with small incremental changes that are released frequently, the need to do maintenance fixes becomes very rare and you can pick up some speed by not having a stable branch.

What about developing on stable?

It’s important to limit commits (rather than merges) that go directly to stable to an absolute minimum. By always committing to trunk first, you ensure that the latest version of the codebase really has all the latest features and bugfixes. Secondly, merging in just one direction greatly simplifies merge management and helps avoid conflicts. That’s relatively important with subversion because its ability to untangle complex merge trees without help is still a bit limited.

But, but, this is all massively inferior to distributed version control!

From an expert coders’ perspective, definitely.

For a team that incorporates people that are not all that used to version control and working with multiple parallel versions of a code base, this is very close to the limit of what can be understood and communicated. Since 80% of the cost of a typical (commercial) software project has nothing to do with coding, that’s a very significant argument. The expert coders just have to suck it up and sacrifice some productivity for the benefit of everyone else.

So the typical stance I end up taking is that those expert coders can use git-svn to get most of what they need, and they assume responsibility for transforming their own many-branches view back to a trunk+stable model for consumption by everyone else. This is quite annoying when you have three expert coders that really want to use git together. I’ve not found a good solution for that scenario; the cost of setting up decent server-side git hosting is quite difficult to justify even when you’re not constrained by audit-ability rules.

But, but this is a lot of work!

Usually when explaining this model to a new group of developers they realize at some point someone (like Bob) or some people will have to do the work of merging changes from trunk to stable, and that the tool support for stuff like that is a bit limited. They’ll also need extra hudson builds and worry a great deal how on earth to deal with maven’s need to have the version number inside the pom.xml file.

To many teams it just seems easier to avoid all this branching mess altogether, and instead they will just be extra good at their TDD and their agile skills. Surely it isn’t that much of a problem to avoid committing for a few hours and working on your local copy while people are sorting out how to bake a release with the right code in it. Right?

The resolution usually comes from the project managers, release managers, product managers, and testers. In service-oriented architecture setups it can also come from other developers. All those stakeholders quickly realize that all this extra work that the developers don’t really want to do is exactly the work that they do want the developers to do. They can see that if the developers spend some extra effort as they go along to think about what is “stable” and what isn’t, the chance of getting a decent code drop goes up.

New year, new job

I’m still at the BBC, but instead of dealing with doing dynamic web applications at scale, I’m going back to this other thing I really like doing: digital media. I’ve joined the Digital Media Initiative, which is all about changing the BBC’s TV production workflow to be more digital.

DMI is a huge, huge undertaking that has already been ongoing for a few years. I’ve joined a small team of architects who have joint responsibility for the architecture of the whole system. It’s great to be back to working with and thinking about video all day every day. I get to peek into TV studios, talk to TV producers, mess about with MXF and AAF and mind-boggingly messy data models.

Even if perhaps there’s not as many requests per second flowing around DMI as there are to be found heading for the BBC website, the scaling things theme in my work is still firmly there – we are planning to archive petabyte upon petabyte of 100mbit video on digital tape, to have 100s if not 1000s of professionals depending on the system for their daily workflow, etc etc.

Unfortunately DMI cannot currently be quite as open in its outward-facing communication as the web platform side of FM&T tends to be, so my blog is probably going to go rather quiet for a while. Don’t worry, I’m not dead, I’m just having fun elsewhere 🙂

Capacity planning for the network

Lesson learned: on our HP blades, with the standard old crappy version of memcached that comes with red hat, when we use them as the backend for PHP’s object cache, we can saturate a 1gbit ethernet connection with CPU usage of about 20-70%:

Zenoss/RRD graph of memcache I/O plateau at 1gbit

No, we did not learn this lesson in a controlled load test, and no, we didn’t know this was going to be our bottleneck. Fortunately, it seems we degraded pretty gracefully, and so as far as we know most of the world didn’t really notice 🙂

Immediate response:

  1. take some frontend boxes out of load balancer to reduce pressure on memcached
  2. repurpose some servers for memcached, add frontend boxes back into pool
  3. tune object cache a little

Some of the follow-up:

  • redo some capacity planning paying more attention to network
  • see if we can get more and/or faster interfaces into the memcached blades
  • test if we can/should make the object caches local to the frontend boxes
  • test if dynamically turning on/off object caches in some places is sensible

I have to say it’s all a bit embarrassing – forgetting about network capacity is a bit of a rookie mistake. In our defense, most people doing scalability probably don’t deal with applications that access over 30 megs of object cache memory to service one request. The shape of our load spikes (when we advertise websites on primetime on BBC One) is probably a little unique, too.

update: I was mistakenly using “APC” in the above to mean “cache” but APC is just the “opcode cache” and is completely disjoint from “object cache”. D’oh!

You don’t know and you don’t understand

You know much less than you think you know. You misunderstand many more things than you think you do. You’re also much more wrong much more often than you think.

(Don’t worry, it’s not just you, the same is true for everyone else.)

Even better, this is how science works. Being a scientist is all about actively trying to be wrong (and proving everyone else wrong), all the time. When you do science, you don’t know, and what you learn doing the science, you don’t ever know for sure.

The scientific method

Here’s the basic steps in the scientific method:

  1. Based on past experience of you and others, try and make some sense of a problem
  2. Try to find a reasonable explanation for the problem
  3. If the explanation is correct, what else would you be able to see or measure?
  4. Try to disprove the explanation by doing the observation and measuring

Scientists do this all day every day, they do it together on a world-wide scale, and they do it to each other.

Experimentation

In uni, studying applied physics, I was trained in a specific application of the scientific method to experimentation, which went something like:

  1. Define a question to answer.
  2. Define what you already know (or will assume) that is related.
  3. Form a hypothesis of what the answer may be.
  4. Figure out what you can measure.
  5. Define how those measurements could be interpreted to verify or disprove the hypothesis.
  6. Do the experiments and collect the measurements.
  7. Analyze the data.
  8. Assert the internal consistency of the experimental data by applying statistics.
  9. Draw conclusions from the analysis.

The course was called Introduction to Experimentation, and it included many more specifics than just that process. For example, it was also about teamwork basics, the use of lab journals, safe lab practices, how to think about accuracy and precision, and quite a lot of engineering discipline.

The course was nearly completely free of actually interesting math or physics content. For example, the first two 4-hour practicums of the course centered around the measurement of the resistance of a 10 ohm resistor. Some of the brighest 18- and 19-year olds in the country would leave that practicum feeling properly stupid for the first time, very frustrated that they had “proven” the resistor to have a resistance of 11+/-0.4 Ohm (where in reality the resistor was “known” to be something like 10.000+/-0.001 Ohm).

The art of being wrong

Teaching that same course (some 2 years later) has turned out to be one of the most valuable things I’ve ever done in my life. One of the key things that students learned in that course was that the teacher might not know either – after all a lab is a strange and wonderful place, and volt meters can in fact break! The teacher in turn learned that even when teaching something seemingly trivial it is possible to be utterly wrong. Powerful phrases that I learned to use included “I don’t know either”, “You are probably right, but I really don’t understand what’s going on”, “Are you sure?”, “I’m not sure”, “How can you be so sure?”, “How can we test that?”, and the uber-powerful “Ah yes, so I was wrong” (eclipsed in power only by “Ok, enough of this, let’s go drink beer”).

This way of inquisitive thinking, with its fundamental acceptance of uncertainty and being wrong, was later amplified by studying things like quantum mechanics with its horrible math and even more horrible concepts. “I don’t know” became my default mind-state. Today, it is one of the most important things I contribute to my work environment (whether it is doing software development, project management, business analytics doesn’t matter) – the power to say “I don’t know” and/or “I was wrong”.

For the last week or two I’ve had lots of fun working closely with a similarly schooled engineer (he really doesn’t know anything either…) to try and debug and change a complex software system. It’s been useful staring at the same screen, arguing with each other that really we don’t know enough about X or Y or Z to even try and form a hypothesis. Communicating out to the wider group, I’ve found that almost everyone cringes at the phrase “we don’t know” or my recent favorite “we still have many unknown unknowns”. Not knowing seems to be a horrible state of mind, rather than the normal one.

Bits and bytes don’t lie?

I have a hypothesis about that aversion to the unknown: people see computers as doing simple boolean logic on bits and bytes, so it should be quite possible to just know everything about a software system. As they grow bigger, all that changes is that there are more operations on more data, but you never really stop knowing. A sound and safe castle of logic!

In fact, I think that’s a lot of what computer science teaches (as far as I know, I never actually studied computer science in university, I just argued a lot with the computer so-called-scientists). You start with clean discrete math and through state machines and automata and functional programming you can eventually find your way to the design of distributed systems and all the way to the nirvana of the artificial intelligence. (AI being much better than the messy biological reality of forgetting things and the like.) Dealing with uncertainty and unknowns is not what computer science seems to be about.

The model of “clean logic all the way down” is completely useless when doing actual software development work. Do you really know which compiler was used on which version of the source code that led to the firmware that is now in your raid controller, and that there are no relevant bugs in it or in that compiler? Are you sure the RAM memory is plugged in correctly in all your 200 boxes? Is your data centre shielded enough from magnetic disturbances? Is that code you wrote 6 months ago really bug-free? What about that open source library you’re using everywhere?

In fact, this computer scientist focus on logic and algorithms and a high appreciation building systems is worse than just useless. It creates real problems. It means the associated industry sees its output in terms of lines of code written, features delivered, etc. The most revered super star engineers are those that crank out new software all the time. Web frameworks are popular because you can build an entire blog with them in 5 minutes.

Debugging and testing, that’s what people that make mistakes have to do. Software design is a group activity but debugging is something you do on your own without telling anyone that you can’t find your own mistake. If you are really good you will make fewer mistakes, will have to spend less time testing, and so produce more and better software more quickly. If you are really really good you might do test-driven development and with your 100% test coverage you just know that you cannot be wrong…

The environment in which we develop software is not nearly as controlled as we tend to assume. Our brains are not nearly as powerful as we believe. By not looking at the environment, by not accepting that there is quite a lot we don’t know, we become very bad at forming a reasonable hypothesis, and worse at interpreting our test data.

Go measure a resistor

So here’s my advice to people that want to become better software developers: try and measure some resistors. Accept that you’re wrong, that you don’t know, and that you don’t understand.

[RT] MyCouch

The below post is an edited version of a $work e-mail, re-posted here at request of some colleagues that wanted to forward the story. My apologies if some of the bits are unclear due to lack-of-context. In particular, let me make clear:

  • we have had a production CouchDB setup for months that works well
  • we are planning to keep that production setup roughly intact for many more months and we are not currently planning to migrate away from CouchDB at all
  • overall we are big fans of the CouchDB project and its community and we expect great things to come out of it

Nevertheless using pre-1.0 software based on an archaic language with rather crappy error handling can get frustrating 🙂

Subject: [RT] MyCouch
From: Leo Simons 
To: Forge Engineering 

This particular RT gives one possible answer to the question “what would be a good way to make this KV debugging somewhat less frustrating?” (we have been fighting erratic response times from CouchDB under high load while replicating and compacting)

That answer is “we could probably replace CouchDB with java+mysql, and it might even be easy to do so”. And, then, “if it really is easy, that’s extra cool (and _because of_ CouchDB)”.)

Why replace CouchDB?

Things we really like about CouchDB (as the backend for our KV service):

  • The architecture: HTTP/REST all the way down, MVCC, many-to-many replication, scales without bound, neat composable building blocks makes an evolvable platform.
  • Working system: Its in production, its running, its running pretty well.
  • Community: open source, active project, know the developers, “cool”.
  • Integrity: it hasn’t corrupted or lost any data yet, and it probably won’t anytime soon.

Things we like less:

  • Debugging: cryptic error messages, erlang stack straces, process deaths.
  • Capacity planning: many unknown and changing performance characteristics.
  • Immaturity: pre-1.0.
  • Humanware: lack of erlang development skills, lack of DBA-like skills, lack of training material (or trainers) to gain skills.
  • Tool support: JProfiler for erlang? Eclipse for erlang? Etc.
  • Map/Reduce and views: alien concept to most developers, hard to audit and manage free-form javascript from tenants, hard to use for data migrations and aggregations.
  • JSON: leads to developers storing JSON which is horribly inefficient.

Those things we don’t like about couch unfortunately aren’t going to change very quickly. For example, the effort required to train up a bunch of DBAs so they can juggle CouchDB namespaces and instances and on-disk data structures is probably rather non-trivial.

The basic idea

It is not easy to see what other document storage system out there would be a particularly good replacement. Tokyo Cabinet, Voldemort, Cassandra, … all of these are also young and immature systems with a variety of quirks. Besides, we really really like the CouchDB architecture.

So why don’t we replace CouchDB with a re-implemented CouchDB? We keep the architecture almost exactly the same, but re-implement the features we care about using technology that we know well and is in many ways much more boring. “HTTP all the way down” should mean this is possible.

We could use mysql underneath (but not use any of its built-in replication features). The java program on top would do the schema and index management, and most importantly implement the CouchDB replication and compaction functionality.

We could even keep the same deployment structure. Assuming one java server is paired with one mysql database instance, we’d end up with 4 tomcat instances on 4 ports (5984-5987) and 4 mysql services on 4
other ports (3306-3309). Use of mysqld_multi probably makes sense. Eventually we could perhaps optimize a bit more by having one tomcat process and one mysql process – it’ll make better use of memory.

Now, what is really really really cool about the CouchDB architecture and its complete HTTP-ness is that we should be able to do any actual migration one node at a time, without downtime. Moving the data across
is as simple as running a replication. Combined with the fact that we’ve been carefully avoiding a lot of its features, CouchDB is probably one of the _easiest_ systems to replace 😀

Database implementation sketch

How would we implement the database? If we think of our KV data as having the form

  ns1:key1 [_rev=1-12345]: { ...}
  ns1:key2 [_rev=2-78901]: { subkey1: ..., }
  ns2:key3 [_rev=1-43210]: { subkey1: ..., subkey2: ...}

where the first integer part of the _rev is dubbed “v” and the remainder part as “src”, then a somewhat obvious database schema looks like (disclaimer: schema has not been tested, do not use :-)):

CREATE TABLE namespace (
  id varchar(64) NOT NULL PRIMARY KEY
      CHARACTER SET ascii COLLATE ascii_bin,
  state enum('enabled','disabled','deleted') NOT NULL
) ENGINE=InnoDB;

CREATE TABLE {namespace}_key (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,

  PRIMARY KEY (ns, key, v, src),
  FOREIGN KEY (ns) REFERENCES namespace(id)
) ENGINE=InnoDB;

CREATE TABLE {namespace}_value (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,
  subkey varchar(255) NOT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci,
  small_value varchar(512) DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if it fits',
  large_value mediumtext DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if its big',

  PRIMARY KEY (ns, key, v, src, subkey),
  FOREIGN KEY (ns) REFERENCES namespace(id),
  FOREIGN KEY (ns, key, v, src)
      REFERENCES {namespace}_key(ns, key, v, src)
      ON DELETE CASCADE
) ENGINE=InnoDB;

With obvious queries including

  SELECT id FROM namespace WHERE state = 'enabled';

  SELECT key FROM {namespace}_key WHERE namespace_id = ?;
  SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ? ORDER BY version DESC LIMIT 1;
  SELECT subkey, small_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?;
  SELECT large_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?
      AND subkey = ?;

  BEGIN;
  CREATE TABLE {namespace}_key (...);
  CREATE TABLE {namespace}_value (...);
  INSERT INTO namespace(id) VALUES (?);
  COMMIT;

  UPDATE namespace SET state = 'disabled' WHERE id = ?;
  UPDATE namespace SET state = 'deleted' WHERE id = ?;

  BEGIN;
  DROP TABLE {namespace}_value;
  DROP TABLE {namespace}_key;
  DELETE FROM namespace WHERE id = ?;
  COMMIT;

  INSERT INTO {namespace}_key (ns,key,v,src)
      VALUES (?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,small_value)
      VALUES (?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,large_value)
      VALUES (?,?,?,?,?);

  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v < ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v = ? AND src =?;

The usefulness for {namespace}_value is debatable; it helps a lot when implementing CouchDB views or some equivalent functionality (“get my all the documents in this namespace where subkey1=…”), but if we decide not to care, then its redundant and {namespace}_key can grow some additional small_value (which should then be big enough to contain a typical JSON document, i.e. maybe 1k) and large_value columns instead.

Partitioning the tables by {namespace} manually isn’t needed if we use MySQL 5.1 or later; table partitions could be used instead.

I’m not sure if we should have a ‘state’ on the keys and do soft-deletes; that might make actual DELETE calls faster; it could also reduce the impact of compactions.

Webapp implementation notes

The java “CouchDB” webapp also does not seem that complicated to build (famous last words?). I would probably build it roughly the same way as [some existing internal webapps].

The basic GET/PUT/DELETE operations are straightforward mappings onto queries that are also rather straightforward.

The POST /_replicate and POST /_compact operations are of course a little bit more involved, but not that much. Assuming some kind of a pool of url fetchers and some periodic executors…

Replication:

  1. get last-seen revision number for source
  2. get list of updates from source
  3. for each update
    • INSERT key
    • if duplicate key error, ignore and don’t update values
    • INSERT OR REPLACE all the values

Compaction:

  1. get list of namespaces
  2. for each namespace:
    • SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ? ORDER BY key ASC, v DESC, src DESC;
    • skip the first row for each key
    • if the second row for the key is the same v, conflict, don’t compact for this key
    • DELETE IGNORE FROM {namespace}_key WHERE ns = ? AND key = ? AND v = ? AND src =?;

So we need some kind of a replication record; once we have mysql available using “documents” seems awkward; let’s use a database table. We might as well have one more MySQL database on each server with a
full copy of a ‘kvconfig’ database, which is replicated around (using mysql replication) to all the nodes. Might also want to migrate away from NAMESPACE_METADATA documents…though maybe not, it is nice and flexible that way.

Performance notes

In theory, the couchdb on-disk format should be much faster than innodb for writes. In practice, innodb has seen quite a few years of tuning. More importantly, in our tests on our servers raw mysql performance seems to be rather better than couchdb. Some of that is due to the extra fsyncs in couchdb, but not all of it.

In theory, the erlang OTP platform should scale out much better than something java-based. In practice, the http server inside couchdb is pretty much a standard fork design using blocking I/O. More importantly, raw tomcat can take >100k req/s on our hardware, which is much much more than our disks can do.

In theory, having the entire engine inside one process should be more efficient than java talking to mysql over TCP. In practice, I doubt this will really show up if we run java and mysql on the same box. More importantly, if this does become an issue, longer-term we may be able to “flatten the stack” by pushing the java “CouchDB” up into the service layer and merging it with the KV service, at which point java-to-mysql will be rather more efficient than java-to-couch.

In theory and in practice innodb has better indexes for the most common SELECTs/GETs so it should be a bit faster. It also is better at making use of large chunks of memory. I suspect the two most common requests (GET that returns 200, GET that returns 404) will both be faster, which incidentally are the most important for us to optimize, too.

We might worry java is slow. That’s kind-of silly :). In theory and in practice garbage collection makes software go faster. We just need to avoid doing those things that make it slow.

The overhead of ACID guarantees might be a concern. Fortunately MySQL is not _really_ a proper relational database if you don’t want it to be. We can probably set the transaction isolation level to READ UNCOMMITTED safely, and the schema design / usage pattern is such that we don’t need transactions in most places. More importantly we are keeping the eventual consistency model, with MVCC and all, on a larger scale. Any over-ACID-ness will be local to the particular node only.

Most importantly, this innodb/mysql thing is mature/boring technology that powers a lot of the biggest websites in the world. As such, you can buy books and consultancy and read countless websites about mysql/innodb/tomcat tuning. Its performance characteristics are pretty well-known and pretty predictable, and lots of people (including here at $work) can make those predictions easily.

So when are we doing this?

No no, we’re not, that’s not the point, this is just a RT! I woke up (rather early) with this idea in my head so I wrote it down to make space for other thoughts. At a minimum, I hope the above helps propagate some ideas:

  • just how well we applied REST and service-oriented architecture here and the benefits its giving us
  • in particular because we picked the right architecture we are not stuck with / tied to CouchDB, now or later
  • we can always re-engineer things (though we should have good enough reasons)
  • things like innodb and/or bdb (or any of the old dbs) are actually great tools with some great characteristics

Just like FriendFeed?

Bret Taylor has a good explanation how FriendFeed built a non-relational database on top of a relational one. The approach outlined above reminds rathe a lot of the solution they implemented, though there’s also important differences.