Skip to content

stripping illegal characters out of xml in python

March 17, 2011

XML 1.0 does not allow all characters in unicode:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

It often trips up developers (like, today, me) that end up having, say, valid unicode, with valid characters like VT (\x1B), or ESC (\x1B), and suddenly they are producing invalid XML. A decent way to deal with this is to strip out the invalid characters. For example this stack overflow post shows how to do this with perl:

$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;

Unfortunately the equivalent does not quite work with Python, since \x{10000}-\x{10FFFF} needs to be expressed as \U00010000-\U0010FFFF which not all versions of python seem to accept as part of a regular expression character class.

So people end up doing messy-looking things in python. But I figured out that if I invert the character class, the biggest character I need to write is \uFFFF, which the python regex engine does acccept. Yay:

import re

# xml 1.0 valid characters:
#    Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
# so to invert that, not in Char ::
#       x0 - x8 | xB | xC | xE - x1F 
#       (most control characters, though TAB, CR, LF allowed)
#       | #xD800 - #xDFFF
#       (unicode surrogate characters)
#       | #xFFFE | #xFFFF |
#       (unicode end-of-plane non-characters)
#       >= 110000
#       that would be beyond unicode!!!
_illegal_xml_chars_RE = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')

def escape_xml_illegal_chars(val, replacement='?'):
    """Filter out characters that are illegal in XML.
    
    Looks for any character in val that is not allowed in XML
    and replaces it with replacement ('?' by default).
    
    >>> escape_illegal_chars("foo \x0c bar")
    'foo ? bar'
    >>> escape_illegal_chars("foo \x0c\x0c bar")
    'foo ?? bar'
    >>> escape_illegal_chars("foo \x1b bar")
    'foo ? bar'
    >>> escape_illegal_chars(u"foo \uFFFF bar")
    u'foo ? bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar")
    u'foo ? bar'
    >>> escape_illegal_chars(u"foo bar")
    u'foo bar'
    >>> escape_illegal_chars(u"foo bar", "")
    u'foo bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar", "BLAH")
    u'foo BLAH bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar", " ")
    u'foo   bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar", "\x0c")
    u'foo \x0c bar'
    >>> escape_illegal_chars(u"foo \uFFFE bar", replacement=" ")
    u'foo   bar'
    """
    return _illegal_xml_chars_RE.sub(replacement, val)
Advertisements

Comments are closed.

%d bloggers like this: