Truncating HTML with Python

On the 10gen events pages we show a table of sessions for our MongoDB Day conferences, some of which are quite crowded. In order to try to keep the table relatively sanely laid out, we truncate the session descriptions to about 150 characters, and add a "Read More" link which expands to show the full description.

Our initial approach did this truncation on the client side, using Javascript, but we found that this was difficult to get right, particularly when the description spanned several paragraphs or other block-level tags. So I wrote a quick HTMLParser sub-class to do the truncation on the server side.

import re
from HTMLParser import HTMLParser

whitespace = re.compile('(\w+)')

class HTMLAbbrev(HTMLParser):

    def __init__(self, maxlength, *args, **kwargs):
        HTMLParser.__init__(self, *args, **kwargs)
        self.stack = []
        self.maxlength = maxlength
        self.length = 0
        self.done = False
        self.out = []

    def emit(self, thing, count=False):
        if count:
            self.length += len(thing)
        if self.length < self.maxlength:
            self.out.append(thing)
        elif not self.done:
            # trim trailing whitespace
            self.out[-1] = self.out[-1].rstrip()

            # close out tags on the stack
            for tag in reversed(self.stack):
                self.out.append('</%s>' % tag)
            self.done = True

    def handle_starttag(self, tag, attrs):
        self.stack.append(tag)
        attrs = ' '.join('%s="%s"' % (k, v) for k, v in attrs)
        self.emit('<%s%s>' % (tag, (' ' + attrs).rstrip()))

    def handle_endtag(self, tag):
        if tag == self.stack[-1]:
            self.emit('</%s>' % tag)
            del self.stack[-1]
        else:
            raise Exception(
                'end tag %r does not match stack: %r' % (tag, self.stack))

    def handle_startendtag(self, tag, attrs):
        self.stack.append(tag)
        attrs = ' '.join('%s="%s"' % (k, v) for k, v in attrs)
        self.emit('<%s%s/>' % (tag, (' ' + attrs).rstrip()))

    def handle_data(self, data):
        for word in whitespace.split(data):
            self.emit(word, count=True)

    def handle_entityref(self, name):
        self.emit('&%s;' % name)

    def handle_charref(self, name):
        return self.handle_entityref('#%s' % name)

    def close(self):
        return ''.join(self.out)

(You can download the source from this gist)

HTMLAbbrev attempts to truncate to a target number of visible (i.e. in the browser) characters --- in fact, it will truncate at the first word break after the target number of characters --- and maintains a tag stack so that the emitted HTML is correctly closed, to prevent breaking layout in the browser. It doesn't count HTML tag contents towards the length, so links and other markup won't be counted towards the total.

We have a call to HTMLAbbrev wrapped into a Jinja template filter, like so:

@filter
def htmlabbrev(value, maxlen=150):
    parser = HTMLAbbrev(maxlen)
    parser.feed(value)
    return parser.close()

Which we then call from our templates:

<div class="abbrev">
  {{session.description|htmlabbrev(120)}}
  <span class="ellipsis">...</span>
  <span class="more clickable">Read More</span>
</div>
<div class="full">
  {{session.description}}
  <span class="less clickable">Hide</span>
</div>

CSS hides the <div class="full"> element by default, and javascript toggles between displaying the two.