Writing one line shell scripts with bash

If you are using ruby with bundler and Gemfiles properly, you probably know about running commands with bundle exec. However, sometimes this does not get you quite the right results, in particular if your Gemfile is not quite precise enough.

For example, I had an issue with cucumber + autotest + rails where I had both rails 3.1 and rails 3.2 apps using the same RVM. Since I was confused, and in a hurry, I figured one brute force option to unconfuse me would be to simply remove all old versions of all gems from the environment. I did just that, and thought I’d explain the process of incrementally coming up with the right shell script one liner.

First things first, let’s figure out what we have installed:

$ gem
...
  Usage:
...
    gem command [arguments...] [options...]
...
  Examples:
...
    gem list --local
...

Ok, I guess we want gem list:

$ gem list
*** LOCAL GEMS ***

actionmailer (3.2.2, 3.1.1)
...
ZenTest (4.7.0)

actionmailer is part of rails, and we can see there are two versions installed. Let’s figure out how to remove one of them…

$ gem help commands
GEM commands are:
...
    uninstall         Uninstall gems from the local repository
...
$ gem uninstall --help
Usage: gem uninstall GEMNAME [GEMNAME ...] [options]

  Options:
...
    -I, --[no-]ignore-dependencies   Ignore dependency requirements while
                                     uninstalling
...
    -v, --version VERSION            Specify version of gem to uninstall
...

Great. Let’s try it:

$ gem uninstall actionmailer -v 3.1.1

You have requested to uninstall the gem:
	actionmailer-3.1.1
rails-3.1.1 depends on [actionmailer (= 3.1.1)]
If you remove this gems, one or more dependencies will not be met.
Continue with Uninstall? [Yn]  y
Successfully uninstalled actionmailer-3.1.1

Ok, so we need to have it not ask us that question. From studying the command line options, the magic switch is to add -I.

So once we have pinpointed a version to uninstall, our command becomes something like gem uninstall -I $gem_name -v $gem_version. Now we need the list of gems to do this on, so we can run that command a bunch of times.

We’ll now start building our big fancy one-line script. I tend to do this by typing the command, executing it, and then pressing the up arrow to go back in the bash history to re-edit the same command.

Looking at the gem list output again, we can see that any gem with multiple installed versions has a comma in the output, and gems with just one installed version do not. We can use grep to filter the list:

$ gem list | grep ','
actionmailer (3.2.2, 3.1.1)
...
sprockets (2.1.2, 2.0.3)

Great. Now we need to extract out of this just the name of the gem and the problematic version. One way of looking at the listing is as a space-seperated set of fields: gemname SPACE (version1, SPACE version2), so we can use cut to pick fields one and three:

$ gem list | grep ',' | cut -d ' ' -f 1,3
...
gherkin 2.5.4)
jquery-rails 2.0.1,
...

Wait, why does the jquery-rails line look different?

$ gem list | grep ',' | grep jquery-rails
jquery-rails (2.0.2, 2.0.1, 1.0.16)

Ok, so it has 3 versions. Really in this instance we need to pick out fields 3,4,5,… and loop over them, uninstalling all the old versions. But that’s a bit hard to do. The alternative is to just pick out field 3 anyway, and run the same command a few times. The first time will remove jquery-rails 2.0.1, and then the second time the output will become something like

jquery-rails (2.0.2, 1.0.16)

and we will remove jquery-=rails 1.0.16.

We’re almost there, but we still need to get rid of the ( and , in our output.

$ gem list | grep ',' | cut -d ' ' -f 1,3 | sed 's/,//' | sed 's/)//'
...
childprocess 0.2.2
...
rack 1.3.5

Looking nice and clean.

To run our gem uninstall command, we know we need to prefix the version with -v

$ gem list | grep ',' | cut -d ' ' -f 1,3 | sed 's/,//' | sed 's/)//' | sed 's/ / -v /'
...
childprocess -v 0.2.2
...

Ok, so now at the start of the list we want to put gem uninstall -I . We can use the regular expression ‘^’ to match the beginning of the line. We’ll need sed to evaluate our regular expressions…

$ gem list | grep ',' | cut -d ' ' -f 1,3 | sed 's/,//' | sed 's/)//' | sed 's/ / -v /' | sed -r 's/^/gem uninstall -I/'
sed: illegal option -- r
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]

Ugh. -r is the switch used in the GNU version of sed. I’m on Mac OS X which comes with BSD sed, which uses -E.

$ gem list | grep ',' | cut -d ' ' -f 1,3 | sed 's/,//' | sed 's/)//' | sed 's/ / -v /' | sed -E 's/^/gem uninstall -I /'
...
gem uninstall -I childprocess -v 0.2.2
...

Ok. That looks like its the list of commands that we want to run. Since the next step will be the big one, before we actually run all the commands, let’s check that we can do so safely. A nice trick is to echo out the commands.

$ gem list | grep ',' | cut -d ' ' -f 1,3 | sed 's/,//' | sed 's/)//' | sed 's/ / -v /' | sed -E 's/^/echo gem uninstall -I /' | sh
gem uninstall -I childprocess -v 0.2.2

Ok, so evaluating through sh works. Let’s remove the echo:

$ gem list | grep ',' | cut -d ' ' -f 1,3 | sed 's/,//' | sed 's/)//' | sed 's/ / -v /' | sed -E 's/^/gem uninstall -I /' | sh
...
Successfully uninstalled childprocess-0.2.2
Removing rails
Successfully uninstalled rails-3.1.1
...

I have no idea why that rails gem gets the extra line of output. But it looks like it all went ok. Let’s remove the ‘sh’ again and check:

$ gem list | grep ',' | cut -d ' ' -f 1,3 | sed 's/,//' | sed 's/)//' | sed 's/ / -v /' | sed -E 's/^/gem uninstall /'
gem uninstall jquery-rails -v 1.0.16

Oh, that’s right, the jquery-rails gem had two versions. Let’s uninstall that, then. We press the up arrow twice to get back the command line that ends with | sh, and run that. Great: we’re all done!

Let’s look at the final command again:

gem list | grep ',' | cut -d ' ' -f 1,3 | sed 's/,//' | sed 's/)//' | sed 's/ / -v /' | sed -E 's/^/gem uninstall -I /' | sh
  1. gem list shows us the locally installed gems
  2. | grep ',' limits that output to lines with a comma in it, that is, gems with multiple versions installed
  3. | cut -d ' ' -f 1,3 splits the remaining lines by spaces, then picks fields one and three
  4. | sed 's/,//' removes all , from the output
  5. | sed 's/)//' removes all ) from the output
  6. | sed 's/ / -v /' replaces the (one remaining) space with -v
  7. | sed -E 's/^/gem uninstall -I /' puts gem uninstall -I at the start of every line
  8. | sh evaluates all the lines in the output as commands

Note that, as a reusable program or installable shell script, this command really isn’t good enough. For example:

  • It does not check the exit codes / statuses of the commands ran, instead it just assumes they run successfully
  • It assumes the output of gem list will always match our expectations (for example, that the output header does not have a comma in it, or that gem names or versions cannot contain space, comma, or -- this may be true but I wouldn't know for sure)
  • It assumes that -I is the only switch needed to prevent gem list from ever asking us questions

The best approach for a reusable command would probably be to write a script in ruby that used the RubyGems API. However, that would be much more work than writing our one-liner, which is “safe enough” for this use-once approach.

(For the record, this didn’t solve my rails + cucumber woes. Instead, it turns out I had run bundle install --without test at some point, and bundler remembers the --without. So I needed rm Gemfile.lock; bundle install --without ''.)

Web application platform technology choices

The hardest bit in the web application platform challenge is making reasonable choices. Here’s a stab at some of them…

Hosting models

I see these basic choices:

  1. LAMP virtual hosting. If you can build everything you need with mysql+php and you have few enough users that you need only one database server, by far the easiest and cheapest.
  2. Application hosting. Code on github, project management with basecamp or hosted jira, build on AppEngine or Heroku or force.com. You don’t have to do your own infrastructure but you’re limited in what you can build. Also comes with a large chance of lock-in.
  3. Managed hosting. Rent (virtual) servers with pre-installed operating systems and managed networking. Expensive for large deployments but you don’t need all web operations team skills and you have a lot of flexibility (famously, twitter do this).
  4. Dedicated hosting. Buy or rent servers, rent rackspace or build your own data center. You need network engineers and people that can handle hardware. Usually the only cost-effective option beyond a certain size.

Given our stated requirements, we are really only talking about option #4, but I wanted to mention the alternatives because they will make sense for a lot of people. Oh, and I think all the other options are these days called cloud computing 🙂

Hardware platform

I’m not really a hardware guy, normally I leave this kind of stuff to others. Anyone have any good hardware evaluation guides? Some things I do know:

  • Get at least two of everything.
  • Get quality switches. Many of the worst outages have something to do with blown-up switches, and since you usually have only a few, losing one during a traffic spike is uncool.
  • Get beefy database boxes. Scaling databases out is hard, but they scale up nicely without wasting resources.
  • Get beefy (hardware) load balancers. Going to more than 2 load balancers is complicated, and while the load balancers have spare capacity they can help with SSL, caching, etc.
  • Get beefy boxes to run your monitoring systems (remember, two of everything). In my experience most monitoring systems suffer from pretty crappy architectures, and so are real resource hogs.
  • Get hardware RAID (RAID 5 seems common) with a battery-backed write-through cache, for all storage systems. That is, unless you have some other redundancy architecture and you don’t need RAID for redundancy.
  • Don’t forget about hardware for backups. Do you need tape?

Other thoughts:

  • Appliances. I really like the idea. Things like the schooner appliances for mysql and memcache, or the kickfire appliance for mysql analytics. I have no firsthand experience with them (yet) though. I’m guessing oracle+sun is going to big in this space.
  • SSD. It is obviously the future, but right now they seem to come with limited warranties, and they’re still expensive enough that you should only use them for data that will actually get hot.

Operating system

Choice #1: unix-ish or windows or both. The Microsoft Web Platform actually looks pretty impressive to me these days but I don’t know much about it. So I’ll go for unix-ish.

Choice #2: ubuntu or red hat or freebsd or opensolaris.

I think Ubuntu is currently the best of the debian-based linuxes. I somewhat prefer ubuntu to red hat, primarily because I really don’t like RPM. Unfortunately red hat comes with better training and certification programs, better hardware vendor support and better available support options.

FreeBSD and solaris have a whole bunch of advantages (zfs, zones/jails, smf, network stack, many-core, …) over linux that make linux seem like a useless toy, if it wasn’t for the fact that linux sees so much more use. This is important: linux has the largest array of pre-packaged software that works on it out of the box, linux runs on more hardware (like laptops…), and many more developers are used to linux.

One approach would be solaris for database (ZFS) and media (ZFS!) hosting, and linux for application hosting. The cost of that, of course, would be the complexity in having to manage two platforms. The question then is whether the gain in manageability offsets the price paid in complexity.

And so, red hat gains another (reluctant) customer.

Database

As much sympathy as I have for the NoSQL movement, the relational database is not dead, and it sure as hell is easier to manage. When dealing with a wide variety of applications by a wide variety of developers, and a lot of legacy software, I think a SQL database is still the default model to go with. There’s a large range of options there.

Choice #1: clustered or sharded. At some point some application will have more data than fits on one server, and it will have to be split. Either you use a fancy database that supports clustering (like Oracle or SQL Server), or you use some fancy clustering middleware (like continuent), or you teach your application to split up the data (using horizontal partitioning or sharding) and you use a more no-frills open source database (mysql or postgres).

I suspect that the additional cost of operating an oracle cluster may very well be worth paying for – besides not having to do application level clustering, the excellent management and analysis tools are worth it. I wish someone did a model/spreadsheet to prove it. Anyone?

However, it is much easier to find developers skilled with open source databases, and it is much easier for developers to run a local copy of their database for development. Again there’s a tradeoff.

The choice between mysql and postgres has a similar tradeoff. Postgres has a much more complete feature set, but mysql is slightly easier to get started with and has significantly easier-to-use replication features.

And so, mysql gains another (reluctant) customer.

With that choice made, I think its important to invest early on in providing some higher-level APIs so that while the storage engine might be InnoDB and the access to that storage engine might be MySQL, many applications are coded to talk to a more constrained API. Things like Amazon’s S3, SimpleDB and the Google AppEngine data store provide good examples of constrained APIs that are worth emulating.

HTTP architecture

Apache HTTPD. Easiest choice so far. Its swiss army knife characteristic is quite important. Its what everyone knows. Things like nginx are pretty cool and can be used as the main web server, but I suspect most people that switch to them should’ve spent some time tuning httpd instead. Since I know how to do that…I’ll stick with what I know.

As easy as that choice is, the choice of what to put between HTTPD and the web seems to be harder than ever. The basic sanctioned architecture these days seems to use BGP load sharing to have the switches direct traffic at some fancy layer 7 load balancers where you terminate SSL and KeepAlive. Those fancy load balancers then may point at a layer of caching reverse proxies like which then point at the (httpd) app servers.

I’m going to assume we can afford a pair of F5 Big-IPs per datacenter. Since they can do caching, too, we might avoid building that reverse proxy layer until we need it (at which point we can evaluate squid, varnish, HAProxy, nginx and perlbal, with that evaluation showing we should go with Varnish 🙂 ).

Application architecture

Memcache is nearly everywhere, obviously. Or is it? If you’re starting mostly from scratch and most stuff can be AJAX, http caching in front of the frontends (see above) might be nearly enough.

Assuming a 3-tier (web, middleware, db) system, reasonable choices for the front-end layer might include PHP, WSGI+Django, and mod_perl. I still can’t see myself rolling out Ruby on Rails on a large scale. Reasonable middelware choices might include java servlets, unix daemons written in C/C++ and more mod_perl. I’d say Twisted would be an unreasonable but feasible choice 🙂

Communication between the layers could be REST/HTTP (probably going through the reverse proxy caches) but I’d like to try and make use of thrift. Latency is a bitch, and HTTP doesn’t help.

I’m not sure whether considering a 2-tier system (i.e. PHP direct to database, or perhaps PHP link against C/C++ modules that talk to the database) makes sense these days. I think the layered architecture is usually worth it, mostly for organizational reasons: you can have specialized backend teams and frontend teams.

If it was me personally doing the development, I’m pretty sure I would go 3-tier, with (mostly) mod_wsgi/python frontends using (mostly) thrift to connect to (mostly) daemonized python backends (to be re-written in faster/more concurrent languages as usage patterns dictate) that connect to a farm of (mostly) mysql databases using raw _mysql, with just about all caching in front of the frontend layer. I’m not so sure its easy to teach a large community of people that pattern; it’d be interesting to try 🙂

As for the more boring choice…PHP frontends with java and/or C/C++ backends with REST in the middle seems easier to teach and evangelize, and its also easier to patch up bad apps by sticking custom caching stuff (and, shudder, mod_rewrite) in the middle.

Messaging

If there’s anything obvious in today’s web architecture it is that deferred processing is absolutely key to low-latency user experiences.

The obvious way to do asynchronous work is by pushing jobs on queues. One hard choice at the moment is what messaging stack to use. Obvious contenders include:

  • Websphere MQ (the expensive incumbent)
  • ActiveMQ (the best-known open source system with stability issues)
  • OpenAMQ (AMQP backed by interesting startup)
  • 0MQ (AMQP bought up by same startup)
  • RabbitMQ (AMQP by another startup; erlang yuck)
  • MRG (or QPid, AMQP by red hat which is not exactly a startup).

A less obvious way to do asynchronous work is through a job architecture such as gearman, app engine cron or quartz, where the queue is not explicit but rather exists as a “pending connections” set of work.

I’m not sure what I would pick right now. I’d probably still stay safe and use AMQ with JMS and/or STOMP with JMS semantics. 2 months from now I might choose differently.

A short url-safe identifier scheme

Let’s say you’re building a central-database system that you may want to make into a distributed system later. Then you don’t want to tie yourself to serial numeric identifiers (like i.e. the ones that Ruby on Rails is full of).

What do distributed platforms do?

They leave the id-generation problem to the user (though they will provide details based on some very-unique number). IDs are strings (UTF-8 or ascii-safe), and can be quite long:

250 characters seems like a pretty large upper limit.

128 random bits should be unique enough for anybody

UUIDs are 128 bits and are encoded as 32 characters (base16 with 4 dashes). The possibility of an identifier collision is really really tiny (random UUIDs have 122 random bits).

Unfortunately, UUIDs are ugly:

http://example.com/68ff9b72-7b6a-4ea4-b35f-77ff50f938fb

It is just not a nice url. It would be nice if we could take 128-bit numbers and encode them as base64, or maybe base62 or url-safe-base64, or maybe even as base36 for increased compatibility. A 128-bit number is 22 characters in base64, 25 characters in base36. You end up with:

http://example.com/f5lxx1zz5pnok6cyejdnd7ri9

What about 64 bits?

If we went with 64-bit numbers, we’d sacrifice quite a bit of collision-prevention, but maybe not so much that it is scary on a per-application basis.

What is also interesting is that lots of software supports operations on 64-bit numbers a lot better than on 128-bit numbers. We would end up with 13 characters in base36 (11 in base64). I.e. in base36 that looks like this:

http://app.example.com/3w5e11264sgsf

That seems kind-of good enough, for now. Having failed inserts into the database seems like a reasonable way to avoid identifier collision, especially if our application is rather neat REST (so a failed PUT can be re-tried pretty safely).

Moving to a distributed system safely is possible if we have some reasonable identifier versioning scheme (13 characters = version 0, 14 characters = scheme version 1-10, more characters = TBD). Then in our app we match our identifiers using ^[0-9a-z][0-9a-z-]{11,30}[0-9a-z]$ (a regex which will also match UUIDs).

Some ruby

def encode_id(n)
  return n.to_s(36).rjust(13,'0')
end

def decode_id(s)
  return s.to_i(36)
end

def gen_id()
  return encode_id( rand( 18446744073709551615 ) )
end

Some MySQL

Besides the above functions, some ideas on how to maintain some consistency for ids across data types (tables).

CREATE FUNCTION encode_id (n BIGINT) RETURNS char(13) NO SQL
  RETURN LPAD( LOWER(CONV(n,10,36)), 13, '0');

CREATE FUNCTION decode_id (n char(13)) RETURNS BIGINT NO SQL
  RETURN CONV(n,36,10);

CREATE FUNCTION gen_num_id () RETURNS BIGINT NO SQL
  RETURN FLOOR(RAND() * 184467440737095516);

CREATE FUNCTION gen_id () RETURNS char(13) NO SQL
  RETURN encode_id( gen_num_id() );

CREATE TABLE ids (
  -- this table should not be updated directly by apps,
  --   though they are expected to read from it
  numid BIGINT unsigned NOT NULL PRIMARY KEY,
  id char(13) NOT NULL UNIQUE,
  prettyid varchar(64) DEFAULT NULL UNIQUE
) ENGINE=InnoDB;

CREATE TABLE mythings (
  numid BIGINT unsigned NOT NULL PRIMARY KEY,
  id char(13) NOT NULL UNIQUE,
  prettyid varchar(64) DEFAULT NULL UNIQUE,
  something varchar(255) DEFAULT NULL
) ENGINE=InnoDB;

CREATE TABLE mythings2ids (
  -- this table should not be updated directly by apps,
  --   though its ok if they read from it
  numid BIGINT unsigned NOT NULL PRIMARY KEY,
  CONSTRAINT FOREIGN KEY (numid)
    REFERENCES ids (numid)
    ON DELETE cascade
    ON UPDATE cascade,
  CONSTRAINT FOREIGN KEY (numid)
    REFERENCES mythings (numid)
    ON DELETE cascade
    ON UPDATE cascade
) ENGINE=InnoDB;

DELIMITER |
CREATE TRIGGER mythings_before_insert BEFORE INSERT ON mythings
  FOR EACH ROW BEGIN
    INSERT INTO ids (numid,id,prettyid) VALUES (NEW.numid, NEW.id, NEW.prettyid);
  END
|
CREATE TRIGGER mythings_after_insert AFTER INSERT ON mythings
  FOR EACH ROW BEGIN
   INSERT INTO mythings2ids (numid) VALUES (NEW.numid);
  END
|
CREATE TRIGGER mythings_before_update BEFORE UPDATE ON mythings
  FOR EACH ROW BEGIN
    IF NEW.numid != OLD.numid THEN
      CALL CANNOT_CHANGE_NUMID_AFTER_CREATION;
    END IF;
    IF NEW.id != OLD.id THEN
      CALL CANNOT_CHANGE_ID_AFTER_CREATION;
    END IF;
    IF NEW.prettyid != OLD.prettyid THEN
      IF OLD.prettyid IS NOT NULL THEN
        CALL CANNOT_CHANGE_PRETTYID_AFTER_INIT;
      ELSE
        UPDATE ids SET prettyid = NEW.prettyid
          WHERE numid = NEW.numid LIMIT 1;
      END IF;
    END IF;
  END
|
CREATE TRIGGER mythings_after_delete AFTER DELETE ON mythings
  FOR EACH ROW BEGIN
   DELETE FROM ids WHERE numid = OLD.numid LIMIT 1;
  END
|
DELIMITER ;

-- SELECT gen_id() INTO @nextid;
-- INSERT INTO mythings (numid,id,prettyid,something)
--   VALUES (decode_id(@nextid),@nextid,
--       '2009/03/22/safe-id-names2','blah blah blah');

Some python

Python lacks built-in base36 encoding. Below is based on this sample, nicer than my own attempts that used recursion…

import string
import random

__ALPHABET = string.digits + string.ascii_lowercase
__ALPHABET_REVERSE = dict((c, i) for (i, c) in enumerate(__ALPHABET))
__BASE = len(__ALPHABET)
__MAX = 18446744073709551615L
__MAXLEN = 13

def encode_id(n):
    s = []
    while True:
        n, r = divmod(n, __BASE)
        s.append(__ALPHABET[r])
        if n == 0: break
    while len(s) < __MAXLEN:
        s.append('0')
    return ''.join(reversed(s))

def decode_id(s):
    n = 0
    for c in s.lstrip('0'):
        n = n * __BASE + __ALPHABET_REVERSE[c]
    return n

def gen_id():
    return encode_id(random.randint(0,MAX))

Diving into ruby on rails part 2

The coffee helped!

I’ve followed through the excellent getting started guide with no problems (though my demo site is about britney spears videos, not a blog), nipping out every now and then to check out reference documentation and rails source code (which is pretty tough to follow for now).

I’ve also installed modrails which worked as advertised (gotcha: they forget to mention you should also set up access permissions for your $railsapproot/public).

modrails performance, no database

With the same apache config I settled on for mod_wsgi, out of the box, performance is reasonable:

$ ab -k -n 10000 -c 100 http://127.0.0.1:81/
Requests per second:    508.93 [#/sec] (mean)
Time per request:       196.490 [ms] (mean)
Time per request:       1.965 [ms]
          (mean, across all concurrent requests)

With some very basic tuning:

PassengerHighPerformance on
RailsSpawnMethod smart
PassengerMaxPoolSize 30
PassengerPoolIdleTime 0
PassengerStatThrottleRate 300

I don’t see that much difference:

Requests per second:    533.85 [#/sec] (mean)
Time per request:       187.319 [ms] (mean)
Time per request:       1.873 [ms]
           (mean, across all concurrent requests)

The ruby processes take about 4-5% CPU per process, the httpd ones take about 0.6% per process. So while the overhead of ruby on rails is pretty significant, its really not shocking considering how much .

The built-in mongrel server in development mode does about 40 req/s, so you really don’t want to use that as a guide for performance benchmarking.

modrails performance, with database

Using the sqlite3 database backend with a very simple page:

$ ab -k -n 10000 -c 100 http://127.0.0.1:81/videos/1/comments/3
Requests per second:    256.87 [#/sec] (mean)
Time per request:       389.302 [ms] (mean)
Time per request:       3.893 [ms] (mean, across all concurrent requests)

Let’s try mysql…

production:
  adapter: mysql
  encoding: utf8
  database: britneyweb
  pool: 30
  username: root
  password:
  socket: /tmp/mysql.sock
$ sudo gem install mysql -- \
  --with-mysql-config=/usr/local/mysql/bin/mysql_config
$ RAILS_ENV=production rake db:migrate
$ sudo apachectl restart
# create some sample data in production db...
$ mysql -u root britneyweb
mysql> analyze table comments;
mysql> analyze table tags;
mysql> analyze table videos;
mysql> show create table comments \G
*************************** 1. row ***************************
       Table: comments
Create Table: CREATE TABLE `comments` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `commenterName` varchar(255) DEFAULT NULL,
  `commenterUrl` varchar(255) DEFAULT NULL,
  `commenterEmail` varchar(255) DEFAULT NULL,
  `body` text,
  `video_id` int(11) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
$ 

Hmm, now the machine is swapping. Tuning down a bit, and then:

Requests per second:    250.00 [#/sec] (mean)
Time per request:       400.004 [ms] (mean)
Time per request:       4.000 [ms]
        (mean, across all concurrent requests)

I can’t really get it to go faster. It seems we are pretty much CPU-bound, with the vast majority of CPU going to ruby processes.

Adding page caching

Adding this tiny bit of code to the comments controller:

class CommentsController < ApplicationController
  ...
  caches_page :show
  ...

Helps a ‘bit’:

Requests per second:    4398.80 [#/sec] (mean)
Time per request:       22.733 [ms] (mean)
Time per request:       0.227 [ms]
        (mean, across all concurrent requests)

Now I need a cache sweeper:

class CommentsController  [:create, :update, :edit, :destroy]
  ...
  caches_page :show
  cache_sweeper :comment_sweeper, :only => [:create, :update, :edit, :destroy]

Enabling memcached cache backend in config/production.rb:

config.cache_store = :mem_cache_store, 'localhost',  \
   '192.168.1.1:11211', { :namespace => 'bwprod' }

…is not much faster (so bottleneck is probably elsewhere):

Requests per second:    4461.97 [#/sec] (mean)
Time per request:       22.412 [ms] (mean)
Time per request:       0.224 [ms]
      (mean, across all concurrent requests)

Lessons learned:

  • raw apache + modrails + rails can do a reasonable 500 req/s on my laptop when not connecting to a database, and a reasonable 400 req/s when connecting to sqlite3 for a simple page.
  • I shouldn’t really attempt to performance-tune modrails.
  • ActiveRecord seems good at chewing up CPU.
  • rails page caching makes things go fast, above 4000 req/s on my laptop.
  • rails + memcached is trivial to set up.

Diving (back) into ruby on rails

First lesson learned: if you’re like me and you bought a new mac recently that came with OS X 10.5, and you have installed the developer tools (from your install DVDs), then more likely-to-work instructions for installing ruby on rails on mac os x 10.5:

$ sudo gem update --system
$ sudo gem install activeresource
$ sudo gem update
$ rails path/to/your/new/application
$ cd path/to/your/new/application
$ ruby script/server

How the lesson was learned

I abandoned Ruby on Rails somewhere in its 1.x days because it was unstable, evolving too fast, and because the deployment architecture (FastCGI) was stupid. The later seems to have gotten much better, and apparently my macbook even comes with rails installed. Rails is on version 2.3, which “sounds” pretty mature. So let’s dive in!

I go to this page and click the red get started arrow. The download page scares me a bit with comments about upgrading to ruby 1.9, but I figure out I should be able to do this:

sudo gem update rails
mkdir foo
rails foo

But I get this:

/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rubygems.rb:379:
in `report_activate_error': RubyGem version error: rake(0.7.3 not >= 0.8.3) (Gem::LoadError)
	from .../rubygems.rb:311:in `activate'
	from .../rubygems.rb:337:in `activate'
	from .../rubygems.rb:336:in `each'
	from .../rubygems.rb:336:in `activate'
	from .../rubygems.rb:65:in `active_gem_with_options'
	from .../rubygems.rb:50:in `gem'
	from /usr/bin/rails:18

Ploughing on looked like it would be depressing:

$ sudo gem update rake
$ rails foo
$ ... RubyGem version error: activesupport(1.4.4 not = 2.3.2) ...

Grr. Time for a bit of RTFM.

$ gem help
$ gem help commands
$ sudo gem update

It seems I’m now installing various not-yet-1.0 tools with impressive names such as acts_as_ferret and hpricot. I also got a bunch more errors:

ERROR:  Error installing RedCloth:
	RedCloth requires RubyGems version >= 1.2
ERROR:  While generating documentation for actionpack-2.3.2
... MESSAGE:   Unhandled special: Special: type=33, text="TODO"
ERROR:  While generating documentation for activerecord-2.3.2
... MESSAGE:   Unhandled special: Special: type=33, text="Fixtures"
ERROR:  While generating documentation for activesupport-2.3.2
... MESSAGE:   Unhandled special: Special: type=33, text="TODO"
ERROR:  While generating documentation for acts_as_ferret-0.4.3
... MESSAGE:   Unhandled special: Special: type=33, text="Ferret"
ERROR:  Error installing capistrano:
	capistrano requires RubyGems version >= 1.2
ERROR:  While generating documentation for daemons-1.0.10
... MESSAGE:   Unhandled special: Special: type=33, text="All"
ERROR:  Error installing fastthread:
	fastthread requires RubyGems version >= 1.2
ERROR:  While generating documentation for ferret-0.11.6
... MESSAGE:   Unhandled special: Special: type=33, text="Provides"
ERROR:  While generating documentation for highline-1.5.0
... MESSAGE:   Unhandled special: Special: type=33, text="A"
ERROR:  While generating documentation for hpricot-0.7
... MESSAGE:   Unhandled special: Special: type=33, text="Detect"
ERROR:  While generating documentation for libxml-ruby-1.1.2
... MESSAGE:   Unhandled special: Special: type=33, text="The"
ERROR:  While generating documentation for mongrel-1.1.5
... MESSAGE:   Unhandled special: Special: type=33, text="Alias"
ERROR:  Error installing net-sftp:
	net-sftp requires RubyGems version >= 1.2
ERROR:  Error installing net-ssh:
	net-ssh requires RubyGems version >= 1.2
ERROR:  While generating documentation for ruby-openid-2.1.4
... MESSAGE:   Unhandled special: Special: type=33, text="A"
ERROR:  While generating documentation for sqlite3-ruby-1.2.4
... MESSAGE:   Unhandled special: Special: type=33, text="This"

Even less impressive:

Gems updated: RedCloth, actionmailer, actionpack, activerecord,
activesupport, acts_as_ferret, capistrano, daemons, dnssd,
fastthread, ferret, highline, hpricot, libxml-ruby, mongrel,
net-sftp, net-ssh, ruby-openid, rubynode, sqlite3-ruby

It says it failed to install RedCloth, but then later on it says it updated it anyway! I wonder if I just b0rked my out-of-the-box ruby setup… Pressing on, according to the RubyGems docs I need to run

$ sudo gem update --system
...
RubyGems installed the following executables:
	/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/gem

If `gem` was installed by a previous RubyGems installation, you may need
to remove it by hand.
$ ls -l `which gem`                                                                                                                                             dev
/usr/bin/gem ->
   ../../System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/gem

Will it be happier, now?

$ sudo gem update
Updating installed gems
...
Gems updated: RedCloth, net-ssh, net-sftp, net-scp, net-ssh-gateway,
capistrano, fastthread

That looks ok. Looks like I’ll be lacking some documentation for some of my gems, but I can live with that. Try again:

$ rails foo
/Library/Ruby/Site/1.8/rubygems.rb:636:in `report_activate_error':
Could not find RubyGem activeresource (= 2.3.2) (Gem::LoadError)
	from /Library/Ruby/Site/1.8/rubygems.rb:141:in `activate'
	from /Library/Ruby/Site/1.8/rubygems.rb:165:in `activate'
	from /Library/Ruby/Site/1.8/rubygems.rb:164:in `each'
	from /Library/Ruby/Site/1.8/rubygems.rb:164:in `activate'
	from /Library/Ruby/Site/1.8/rubygems.rb:49:in `gem'
	from /usr/bin/rails:18
$ gem list | grep active
activerecord (2.3.2, 1.15.6)
activesupport (2.3.2, 1.4.4)

Not impressed. Time to go find a list of dependencies. Oh, its the last one remaining. Duh.

$ sudo gem install activeresource
$ rails foo
      ...
      create  log/test.log
$ cd foo
$ ruby script/server
$ curl http://localhost:3000/

Yay! Success. Time taken to get rails 2.3.2 install on OS X: ~40 mins. Current mood: feel like an idiot. Feel like ranting about how to do dependency management that doesn’t suck at the ruby/rails community, and about testing your installation instructions. Time for coffee.