stripping illegal characters out of xml in python

XML 1.0 does not allow all characters in unicode:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

It often trips up developers (like, today, me) that end up having, say, valid unicode, with valid characters like VT (\x1B), or ESC (\x1B), and suddenly they are producing invalid XML. A decent way to deal with this is to strip out the invalid characters. For example this stack overflow post shows how to do this with perl:

$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;

Unfortunately the equivalent does not quite work with Python, since \x{10000}-\x{10FFFF} needs to be expressed as \U00010000-\U0010FFFF which not all versions of python seem to accept as part of a regular expression character class.

So people end up doing messy-looking things in python. But I figured out that if I invert the character class, the biggest character I need to write is \uFFFF, which the python regex engine does acccept. Yay:

import re

# xml 1.0 valid characters:
#    Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
# so to invert that, not in Char ::
#       x0 - x8 | xB | xC | xE - x1F 
#       (most control characters, though TAB, CR, LF allowed)
#       | #xD800 - #xDFFF
#       (unicode surrogate characters)
#       | #xFFFE | #xFFFF |
#       (unicode end-of-plane non-characters)
#       >= 110000
#       that would be beyond unicode!!!
_illegal_xml_chars_RE = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')

def escape_xml_illegal_chars(val, replacement='?'):
    """Filter out characters that are illegal in XML.
    Looks for any character in val that is not allowed in XML
    and replaces it with replacement ('?' by default).
    >>> escape_illegal_chars("foo \x0c bar")
    'foo ? bar'
    >>> escape_illegal_chars("foo \x0c\x0c bar")
    'foo ?? bar'
    >>> escape_illegal_chars("foo \x1b bar")
    'foo ? bar'
    >>> escape_illegal_chars(u"foo \uFFFF bar")
    u'foo ? bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar")
    u'foo ? bar'
    >>> escape_illegal_chars(u"foo bar")
    u'foo bar'
    >>> escape_illegal_chars(u"foo bar", "")
    u'foo bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar", "BLAH")
    u'foo BLAH bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar", " ")
    u'foo   bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar", "\x0c")
    u'foo \x0c bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar", replacement=" ")
    u'foo   bar'
    return _illegal_xml_chars_RE.sub(replacement, val)

Learn to program with python

Someone asked me the other day where they should start if they wanted to learn how to program. I said “learn python”, followed by “or, maybe learn some HTML first”. They then asked me where to start. This, I was not actually so sure about, so I did some research. The result is a simple set of pointers to some web pages, videos and books that I can recommend:

I made an effort to write this targeted at an absolute beginner who has English as a second language. I hope it is useful for someone else too.

Web application platform technology choices

The hardest bit in the web application platform challenge is making reasonable choices. Here’s a stab at some of them…

Hosting models

I see these basic choices:

  1. LAMP virtual hosting. If you can build everything you need with mysql+php and you have few enough users that you need only one database server, by far the easiest and cheapest.
  2. Application hosting. Code on github, project management with basecamp or hosted jira, build on AppEngine or Heroku or You don’t have to do your own infrastructure but you’re limited in what you can build. Also comes with a large chance of lock-in.
  3. Managed hosting. Rent (virtual) servers with pre-installed operating systems and managed networking. Expensive for large deployments but you don’t need all web operations team skills and you have a lot of flexibility (famously, twitter do this).
  4. Dedicated hosting. Buy or rent servers, rent rackspace or build your own data center. You need network engineers and people that can handle hardware. Usually the only cost-effective option beyond a certain size.

Given our stated requirements, we are really only talking about option #4, but I wanted to mention the alternatives because they will make sense for a lot of people. Oh, and I think all the other options are these days called cloud computing 🙂

Hardware platform

I’m not really a hardware guy, normally I leave this kind of stuff to others. Anyone have any good hardware evaluation guides? Some things I do know:

  • Get at least two of everything.
  • Get quality switches. Many of the worst outages have something to do with blown-up switches, and since you usually have only a few, losing one during a traffic spike is uncool.
  • Get beefy database boxes. Scaling databases out is hard, but they scale up nicely without wasting resources.
  • Get beefy (hardware) load balancers. Going to more than 2 load balancers is complicated, and while the load balancers have spare capacity they can help with SSL, caching, etc.
  • Get beefy boxes to run your monitoring systems (remember, two of everything). In my experience most monitoring systems suffer from pretty crappy architectures, and so are real resource hogs.
  • Get hardware RAID (RAID 5 seems common) with a battery-backed write-through cache, for all storage systems. That is, unless you have some other redundancy architecture and you don’t need RAID for redundancy.
  • Don’t forget about hardware for backups. Do you need tape?

Other thoughts:

  • Appliances. I really like the idea. Things like the schooner appliances for mysql and memcache, or the kickfire appliance for mysql analytics. I have no firsthand experience with them (yet) though. I’m guessing oracle+sun is going to big in this space.
  • SSD. It is obviously the future, but right now they seem to come with limited warranties, and they’re still expensive enough that you should only use them for data that will actually get hot.

Operating system

Choice #1: unix-ish or windows or both. The Microsoft Web Platform actually looks pretty impressive to me these days but I don’t know much about it. So I’ll go for unix-ish.

Choice #2: ubuntu or red hat or freebsd or opensolaris.

I think Ubuntu is currently the best of the debian-based linuxes. I somewhat prefer ubuntu to red hat, primarily because I really don’t like RPM. Unfortunately red hat comes with better training and certification programs, better hardware vendor support and better available support options.

FreeBSD and solaris have a whole bunch of advantages (zfs, zones/jails, smf, network stack, many-core, …) over linux that make linux seem like a useless toy, if it wasn’t for the fact that linux sees so much more use. This is important: linux has the largest array of pre-packaged software that works on it out of the box, linux runs on more hardware (like laptops…), and many more developers are used to linux.

One approach would be solaris for database (ZFS) and media (ZFS!) hosting, and linux for application hosting. The cost of that, of course, would be the complexity in having to manage two platforms. The question then is whether the gain in manageability offsets the price paid in complexity.

And so, red hat gains another (reluctant) customer.


As much sympathy as I have for the NoSQL movement, the relational database is not dead, and it sure as hell is easier to manage. When dealing with a wide variety of applications by a wide variety of developers, and a lot of legacy software, I think a SQL database is still the default model to go with. There’s a large range of options there.

Choice #1: clustered or sharded. At some point some application will have more data than fits on one server, and it will have to be split. Either you use a fancy database that supports clustering (like Oracle or SQL Server), or you use some fancy clustering middleware (like continuent), or you teach your application to split up the data (using horizontal partitioning or sharding) and you use a more no-frills open source database (mysql or postgres).

I suspect that the additional cost of operating an oracle cluster may very well be worth paying for – besides not having to do application level clustering, the excellent management and analysis tools are worth it. I wish someone did a model/spreadsheet to prove it. Anyone?

However, it is much easier to find developers skilled with open source databases, and it is much easier for developers to run a local copy of their database for development. Again there’s a tradeoff.

The choice between mysql and postgres has a similar tradeoff. Postgres has a much more complete feature set, but mysql is slightly easier to get started with and has significantly easier-to-use replication features.

And so, mysql gains another (reluctant) customer.

With that choice made, I think its important to invest early on in providing some higher-level APIs so that while the storage engine might be InnoDB and the access to that storage engine might be MySQL, many applications are coded to talk to a more constrained API. Things like Amazon’s S3, SimpleDB and the Google AppEngine data store provide good examples of constrained APIs that are worth emulating.

HTTP architecture

Apache HTTPD. Easiest choice so far. Its swiss army knife characteristic is quite important. Its what everyone knows. Things like nginx are pretty cool and can be used as the main web server, but I suspect most people that switch to them should’ve spent some time tuning httpd instead. Since I know how to do that…I’ll stick with what I know.

As easy as that choice is, the choice of what to put between HTTPD and the web seems to be harder than ever. The basic sanctioned architecture these days seems to use BGP load sharing to have the switches direct traffic at some fancy layer 7 load balancers where you terminate SSL and KeepAlive. Those fancy load balancers then may point at a layer of caching reverse proxies like which then point at the (httpd) app servers.

I’m going to assume we can afford a pair of F5 Big-IPs per datacenter. Since they can do caching, too, we might avoid building that reverse proxy layer until we need it (at which point we can evaluate squid, varnish, HAProxy, nginx and perlbal, with that evaluation showing we should go with Varnish 🙂 ).

Application architecture

Memcache is nearly everywhere, obviously. Or is it? If you’re starting mostly from scratch and most stuff can be AJAX, http caching in front of the frontends (see above) might be nearly enough.

Assuming a 3-tier (web, middleware, db) system, reasonable choices for the front-end layer might include PHP, WSGI+Django, and mod_perl. I still can’t see myself rolling out Ruby on Rails on a large scale. Reasonable middelware choices might include java servlets, unix daemons written in C/C++ and more mod_perl. I’d say Twisted would be an unreasonable but feasible choice 🙂

Communication between the layers could be REST/HTTP (probably going through the reverse proxy caches) but I’d like to try and make use of thrift. Latency is a bitch, and HTTP doesn’t help.

I’m not sure whether considering a 2-tier system (i.e. PHP direct to database, or perhaps PHP link against C/C++ modules that talk to the database) makes sense these days. I think the layered architecture is usually worth it, mostly for organizational reasons: you can have specialized backend teams and frontend teams.

If it was me personally doing the development, I’m pretty sure I would go 3-tier, with (mostly) mod_wsgi/python frontends using (mostly) thrift to connect to (mostly) daemonized python backends (to be re-written in faster/more concurrent languages as usage patterns dictate) that connect to a farm of (mostly) mysql databases using raw _mysql, with just about all caching in front of the frontend layer. I’m not so sure its easy to teach a large community of people that pattern; it’d be interesting to try 🙂

As for the more boring choice…PHP frontends with java and/or C/C++ backends with REST in the middle seems easier to teach and evangelize, and its also easier to patch up bad apps by sticking custom caching stuff (and, shudder, mod_rewrite) in the middle.


If there’s anything obvious in today’s web architecture it is that deferred processing is absolutely key to low-latency user experiences.

The obvious way to do asynchronous work is by pushing jobs on queues. One hard choice at the moment is what messaging stack to use. Obvious contenders include:

  • Websphere MQ (the expensive incumbent)
  • ActiveMQ (the best-known open source system with stability issues)
  • OpenAMQ (AMQP backed by interesting startup)
  • 0MQ (AMQP bought up by same startup)
  • RabbitMQ (AMQP by another startup; erlang yuck)
  • MRG (or QPid, AMQP by red hat which is not exactly a startup).

A less obvious way to do asynchronous work is through a job architecture such as gearman, app engine cron or quartz, where the queue is not explicit but rather exists as a “pending connections” set of work.

I’m not sure what I would pick right now. I’d probably still stay safe and use AMQ with JMS and/or STOMP with JMS semantics. 2 months from now I might choose differently.

A short url-safe identifier scheme

Let’s say you’re building a central-database system that you may want to make into a distributed system later. Then you don’t want to tie yourself to serial numeric identifiers (like i.e. the ones that Ruby on Rails is full of).

What do distributed platforms do?

They leave the id-generation problem to the user (though they will provide details based on some very-unique number). IDs are strings (UTF-8 or ascii-safe), and can be quite long:

250 characters seems like a pretty large upper limit.

128 random bits should be unique enough for anybody

UUIDs are 128 bits and are encoded as 32 characters (base16 with 4 dashes). The possibility of an identifier collision is really really tiny (random UUIDs have 122 random bits).

Unfortunately, UUIDs are ugly:

It is just not a nice url. It would be nice if we could take 128-bit numbers and encode them as base64, or maybe base62 or url-safe-base64, or maybe even as base36 for increased compatibility. A 128-bit number is 22 characters in base64, 25 characters in base36. You end up with:

What about 64 bits?

If we went with 64-bit numbers, we’d sacrifice quite a bit of collision-prevention, but maybe not so much that it is scary on a per-application basis.

What is also interesting is that lots of software supports operations on 64-bit numbers a lot better than on 128-bit numbers. We would end up with 13 characters in base36 (11 in base64). I.e. in base36 that looks like this:

That seems kind-of good enough, for now. Having failed inserts into the database seems like a reasonable way to avoid identifier collision, especially if our application is rather neat REST (so a failed PUT can be re-tried pretty safely).

Moving to a distributed system safely is possible if we have some reasonable identifier versioning scheme (13 characters = version 0, 14 characters = scheme version 1-10, more characters = TBD). Then in our app we match our identifiers using ^[0-9a-z][0-9a-z-]{11,30}[0-9a-z]$ (a regex which will also match UUIDs).

Some ruby

def encode_id(n)
  return n.to_s(36).rjust(13,'0')

def decode_id(s)
  return s.to_i(36)

def gen_id()
  return encode_id( rand( 18446744073709551615 ) )

Some MySQL

Besides the above functions, some ideas on how to maintain some consistency for ids across data types (tables).

  RETURN LPAD( LOWER(CONV(n,10,36)), 13, '0');

  RETURN CONV(n,36,10);

  RETURN FLOOR(RAND() * 184467440737095516);

  RETURN encode_id( gen_num_id() );

  -- this table should not be updated directly by apps,
  --   though they are expected to read from it
  id char(13) NOT NULL UNIQUE,
  prettyid varchar(64) DEFAULT NULL UNIQUE

CREATE TABLE mythings (
  id char(13) NOT NULL UNIQUE,
  prettyid varchar(64) DEFAULT NULL UNIQUE,
  something varchar(255) DEFAULT NULL

CREATE TABLE mythings2ids (
  -- this table should not be updated directly by apps,
  --   though its ok if they read from it
    REFERENCES ids (numid)
    ON DELETE cascade
    ON UPDATE cascade,
    REFERENCES mythings (numid)
    ON DELETE cascade
    ON UPDATE cascade

CREATE TRIGGER mythings_before_insert BEFORE INSERT ON mythings
    INSERT INTO ids (numid,id,prettyid) VALUES (NEW.numid,, NEW.prettyid);
CREATE TRIGGER mythings_after_insert AFTER INSERT ON mythings
   INSERT INTO mythings2ids (numid) VALUES (NEW.numid);
CREATE TRIGGER mythings_before_update BEFORE UPDATE ON mythings
    IF NEW.numid != OLD.numid THEN
    END IF;
    IF != THEN
    END IF;
    IF NEW.prettyid != OLD.prettyid THEN
      IF OLD.prettyid IS NOT NULL THEN
        UPDATE ids SET prettyid = NEW.prettyid
          WHERE numid = NEW.numid LIMIT 1;
      END IF;
    END IF;
CREATE TRIGGER mythings_after_delete AFTER DELETE ON mythings
   DELETE FROM ids WHERE numid = OLD.numid LIMIT 1;

-- SELECT gen_id() INTO @nextid;
-- INSERT INTO mythings (numid,id,prettyid,something)
--   VALUES (decode_id(@nextid),@nextid,
--       '2009/03/22/safe-id-names2','blah blah blah');

Some python

Python lacks built-in base36 encoding. Below is based on this sample, nicer than my own attempts that used recursion…

import string
import random

__ALPHABET = string.digits + string.ascii_lowercase
__ALPHABET_REVERSE = dict((c, i) for (i, c) in enumerate(__ALPHABET))
__BASE = len(__ALPHABET)
__MAX = 18446744073709551615L
__MAXLEN = 13

def encode_id(n):
    s = []
    while True:
        n, r = divmod(n, __BASE)
        if n == 0: break
    while len(s) < __MAXLEN:
    return ''.join(reversed(s))

def decode_id(s):
    n = 0
    for c in s.lstrip('0'):
        n = n * __BASE + __ALPHABET_REVERSE[c]
    return n

def gen_id():
    return encode_id(random.randint(0,MAX))

Getting a feel for the cost of using mysql with WSGI

So is the database the bottleneck?

Note that compared to the last post, I enabled KeepAlive, i.e. I added the -k flag to ab:

ab -k -n 10000 -c 100 http://localhost/hello.txt

I also toned down the MPM limits a bunch:

StartServers         16
MinSpareServers      16
MaxSpareServers     128
MaxClients           64
MaxRequestsPerChild   0

(Apache will now complain about reaching MaxClients, but given the db is on the same machine, things end up faster overall, this way)

And then I decided to just run ‘ab’ locally since I couldn’t really see that much difference vs running it over the network.

Then I proceeded to get myself a MySQL 5 server with a single InnoDB table inside with a single row in it that reads (key:1, value:'Hello world!'). I also set up a memcached server. Now, look at this…

What? Requests per second relative performance hit CPU usage
hello.txt 9700 httpd 2.2% per process
python string 7500 23% httpd 2% per process
memcached lookup 2100 23% * 72% memcached 10%, httpd 2% per process
_mysql 1400 23% * 82% MySQL 77%, httpd 1.2% per process
mysql 880 23% * 82% * 37% MySQL 65%, httpd 4.5% per process
SQLAlchemy 700 23% * 82% * 37% * 20% MySQL 19%, httpd 5.5% per process

Using a mysql database backend costs 82% req/s compared to the “serve python string from global variable”. Totally expected of course, and we have also been taught how to deal with it (caching).

In this example, we’re not actually getting a lot out of using memcached – the memcached approach (I’m using python-memcached) is still 72% slower than having the in-process memory, though if you look at it the other way around, it is 50% faster than using the _mysql driver. Nevertheless, this shows that the mysql query cache is pretty fast, too, esp. if you have a 100% cache hit ratio 🙂

Using the SQLAlchemy ORM framework makes things 20% slower, and it gobbles up a lot of CPU. We also expected that, and we know how to deal with (hand-optimize the SQL in performance-sensitive areas, or perhaps use a faster ORM, like storm).

But, did you know that you can take a 37% performance hit just by using the _mysql bindings directly instead of the MySQLdb object oriented wrappers? Looks like some optimizations should be possible there…

Now, in “real life”, imagine a 90% hit ratio on the memcache, and for the other 10% there’s some expensive ORM processing. Further imagine a 5% performance hit in the logic to decide between memcache and ORM layer. You’d end up doing about 1850 requests per second.

This leads to Leo’s WSGI performance rule of thumb number 1: Even if your python webapp is really really really fast, it will still be about 5 times slower than apache serving up static files.

Getting a feel for the performance of mod_wsgi

Disclaimer: there’s a lot that’s wrong with my testing setup and/or methodology. Given the kind of concurrency I’m playing around with, I run into a whole bunch of operating system and TCP/IP limits. Though I’ve tweaked those a little bit (for example # of open file descriptors, TCP window size), I haven’t been particularly scientific about it. Though I did various warm-ups and system reboots, I wasn’t particularly good at keeping consistent timing or consistently checking that there were no lingering connections in TIME_WAIT or whatever.

System under test:

  • Server: Latest-gen macbook 2.4ghz intel core 2 duo
  • Test client: Mac Pro tower dual core G5
  • Switch: some crappy 1gbit whitelabel
  • Stock Apache on Mac OS X 10.5 (prefork MPM, 64bit)
  • Stock python 2.5 on Mac OS X 10.5
  • mod_wsgi 2.3
  • ApacheBench, Version 2.3, commands like ab -n 5000 -c 100

Out of the box:

  • about 2000-3000 req/s for static file
  • about 2000-2800 req/s for mod_wsgi in-process
  • about 1500-2500 req/s when using mod_wsgi in daemon mode (with errors beyond concurrency of about 250, for various settings of p, t)
  • concurrency=1000 makes ApacheBench start reporting lots of failures

Follows some MPM tuning, arriving to:

StartServers        100
MinSpareServers      10
MaxSpareServers     500
MaxClients          500
MaxRequestsPerChild   0

Results then become better (especially more stable):

  • about 5000 req/s for static file
  • With EnableMMAP off and EnableSendfile off, even concurrency=10 is already a problem for static file scenario, and req/s doesn’t go above 3000 req/s for concurrency>=5
  • about 4300 req/s for mod_wsgi in process
  • about 2700 req/s for mod_wsgi in daemon mode
  • concurrency=1000 still makes ApacheBench start reporting lots of failures

Some more data:

                    hello.txt           wsgitest        wsgitest    wsgitest    wsgitest
                                        in-process      p=2,t=10    p=2,t=100   p=20,t=10
concurrency 10      req/s:  4784
                    ms/req:    0.21

concurrency 100     req/s:  5081        4394            3026        3154        2827
                    ms/req:    0.20        0.23            0.33        0.32        0.35

concurrency 200     req/s:  5307        4449            2688        2988        2711
                    ms/req:    0.19        0.24            0.37        0.34        0.36

concurrency 500     req/s:  4885        4137            2779        3019        2738
                    ms/req:    0.21        0.24            0.36        0.33        0.36

hello.txt is a 13 byte file containing "Hello World!\n"
wsgitest is a really simple wsgi script spitting out "Hello World!\n"
concurrency is the argument to ab -c
p is number of processes for mod_wsgi daemon mode
t is number of threads for mod_wsgi daemon mode
ms/req is the mean time per request across all concurrent requests as reported by ab

Tentative conclusions:

  • With my hardware I have little chance of actually finding the limits of apache or mod_wsgi unless I spend a whole lot of time on much more carefully testing what I’m actually measuring
  • running mod_wsgi in-process is probably a good idea if you tweak the MPM for it
  • mod_wsgi on my laptop can probably easily serve over a billion requests/month after a little tuning
  • mod_wsgi on my laptop can deal with 500 “concurrent” users without errors

…so, in other words, mod_wsgi is most likely “fast enough” and most likely will no be a bottleneck if I build something with it. Not exactly surprising. But, more importantly, now I have some baseline numbers for req/s performance on my system, so that I can run some “performance smoke tests” against software I write.

Cloud computing – comparing Google AppEngine and Amazon Web Services

Seems like Cloud computing is the new hot topic these days! AWS has been around for a while, and google has now added billing features to App Engine, which was the last thing I was waiting for. So let’s try to compare these two.


Numbers in US dollars. Source for google numbers, Amazon numbers here and here.

  Google Amazon
  AppEngine S3 50/10* EC2 Small** S3 5000/1000*** EC2 Huge****
per CPU hour 0.10 0 0 0 0
per hour per running instance 0 0 0.10 0 0.8
per GB bandwidth incoming 0.10 0.10 0.10 0.10 0.10
per GB bandwidth outgoing 0.12 0.17 0.17 0.10 0.10
per GB storage per month 0.15 0.15 0 0.12 0
per request incoming 0 0.00001 0 0.00001 0
per request outgoing 0 0.000001 0 0.000001 0
per email outgoing 0.0001 0 0 0
* up to 50TB storage & 10TB outbound traffic, after that gets cheaper
** “small” instances, 160GB storage, up to 10TB outbound traffic, after that gets cheaper
*** far over 500TB storage & far over 150TB storage
**** “extra large”, 8 compute units, 1.6TB storage, over 150TB storage

Apple, meet pear

It should be obvious the above table is really comparing incomparables – amazon gives you much more flexibility, more control, and more options. This still puts a lot of he responsibility for efficient resource utilization on the application developer. If you are not careful, you can have many cpu-hours and storage-hours available that you do not use. The google offering provides far less flexibility and fewer options, with the advantage that you can start for free, and you do not have to plan resource utilization as much – you pay for CPU hours used, not for CPU hours reserved.

There will still be a large amount of applications for which AppEngine hosting simply is not an option (those that need a data store different from GFS, those that are not written in python, etc), where one might still consider AWS.

But what if you are writing a WSGI-compliant pure python application that you want to scale out, that does not need any (pyrex, swig, C) native optimization or bespoke caching, and whose data model supports using GFS-style persistence? How do the offerings compare, then?

Scaling with AWS

In the Amazon case you will have to do a lot of up-front infrastructure development yourself. You will need to figure out how to deploy EC2 instances with apache+mod_wsgi on them, automatically adding more instances under load and remove them again when they become unneeded. You will need to think about how to split off your large files (if any) so they are hosted on S3 (or the new Elastic Block Store, perhaps with Mongo DB on top, and some memcache obviously). You will need to think about and set up what application-level across-cluster monitoring you need. Etc etc. Lots of fun, but lots of work.

Once you do all that work, after you have your infrastructure up and running, you will enjoy tremendous flexibility when scaling – use of the right application patterns and optimization techniques means you have many ways to scale up as well as out. For example, you will soon figure out how to speed up your application using psyco, pyrex, and hand-crafted C. You will end up using a messaging pattern where it makes sense, using SQS or ActiveMQ, maybe some twisted etc etc.

By the end of it all that you’ll be pretty well-versed in what it means to scale out a web application, and probably looking to set up your own hosting (which will be cheaper still than AWS). Your application will be somewhat aware of the infrastructure it runs on, which can be a good thing or bad thing depending on how you set it up.

Scaling with AppEngine

In the AppEngine case, lots and lots of decisions have already been made for you. You will build your application following google’s manuals, monitor it using google’s tools (at least for quite a while), and not worry much about capacity planning (that is, until your site is discussed during a prime time TV shows, your costs go through the roof, and you wish you had optimized for minimum bandwidth consumption) or advanced performance optimization techniques.

You won’t be able to build a very useful messaging infrastructure (better not be writing a MMOG or chat app) or streaming thingie or anything like that. So after you become wildly successful you’re going to need to augment your AppEngine-hosted systems, or migrate away from them completely.

On the upside, if your application never becomes massively popular, and you stay within the free usage quotas, the only thing you have wasted is your own time, and less of it at that, since you focused most of your time on building the application.

Scaling out an internet startup, 2009

All in all my advice is, if you’re planning to build the next fancy social 2.0 web 2.0 application, to seriously consider building on Google AppEngine. Be careful what APIs you use, and make sure to insulate yourself from their specifics – you don’t want to be tied at the hip to GFS, with no way to move away from it.

If AppEngine hosting is too expensive for your business model to work, get a new business model immediately. Both AppEngine and AWS are dirt cheap.

Once your application really takes off (and you start paying google some significant $s), try and see if you have some time to keep a version of it running on the AWS infrastructure, perhaps redirecting a few percent of your traffic there. If you can, that’s great, you now have options. At that point, one of your options to seriously consider is probably to set up your own dedicated infrastructure, starting with getting a competent ops guy or two, and then buying some server space (or some managed servers, maybe) and some bandwidth.

Can I do XML web services and SOAP?

The only reasonable python library for doing serious XML is lxml, which requires libxml2 and libxslt python bindings, which is currently an unsupported extension module on AppEngine. So not yet with AppEngine, or at least not properly. With AWS, yes, you can 🙂

But I don’t like python!

Well, apparently these guys can cloud-compute your rails, and these guys can cloud-compute your php and java. But really you should just use python.