I've wanted to start writing a blog for the past few months. By which I mean, writing blog posts. But why focus on writing great content when you can write great software instead?

Allow me to introduce Plog, the software running late.am. Read on to see how Flask, Mongoengine, and MongoDB can work together to create something beautiful.

Data Model

The data model for Plog is about as simple as you'd expect:

class Post(db.Document):
    pubdate = db.DateTimeField(required=True)
    updated = db.DateTimeField()
    published = db.BooleanField(default=True)

    title = db.StringField(required=True)
    slug = db.StringField(required=True, unique=True)

    blurb = db.StringField(required=True)
    body = db.StringField(required=False)

    tags = db.ListField(db.StringField())

    views = db.IntField()

    _words = db.ListField(db.StringField())

    meta = {
        'allow_inheritance': False,
        'indexes': [
            {'fields': ['published', 'slug', 'pubdate']},
            {'fields': ['published', '_words', 'pubdate']},
        ],

    }

Setting allow_inheritance to False instructs Mongoengine not to add two bookkeeping fields _cls and _types to the documents. In cases where many types of documents share a collection, Mongoengine uses these fields to filter results only to the type (or subtypes) of document corresponding to the Python class you are querying for.

The first index will be used on virtually every page in the site. Individual post pages will always be queried using published and slug, and sorted by pubdate; the homepage and archive pages don't query by slug, but do sort by (and, in the case of archive pages, filter by) pubdate.

Search

The second index includes the _words field, which is automatically generated by Plog when a post is saved. Indexes on arrays in MongoDB are called "multi-key" indexes, since each element of the array is given an entry in the index. This allows efficient queries into array fields, looking for one or more of the values.

Plog's search is implemented using the $all operator against _words, which returns documents whose _words array contains each of the search terms:

posts = Post.objects(published=True, _words__all=query).order_by('-pubdate')

Plog doesn't support complex search use cases like stemming (although that would be easy to add using nltk or a similar tool), boolean queries, phrase searches, etc.

Tag Clouds

Each post contains its own tags, but to generate a tag cloud we need to know the distribution of tags among all the posts in order to size.

A naive solution would be to query for all published posts, and aggregate the tag information in the application logic. However this requires sending all the posts over the network from the database to the application server which (and I'm being very optimistic here) could be a lot of data to transmit.

A slightly better approach might use map-reduce to pre-aggregate the tag count information within the database, and generate a collection containing a mapping from tag name to tag count. Such a job might be triggered on a schedule, or each time a post is saved.

Rather than either of these, Plog use MongoDB's atomic update operator $inc along with upserts to maintain tag count information when posts are saved. The counts for the tags on the previous version of the post are each decremented by 1, and the counts for the tags on the new version of the post are each incremented by one:

if post.published:
    # decrement tagcloud count on all tags in the
    # previous version of the Post
    TagCloud.objects(tag__in=post.tags, count__gte=1).update(
        inc__count=-1, set__updated=datetime.utcnow())

for field in form:
    setattr(post, field.name, field.data)
post.slug = slug_for(title=post.title, pubdate=post.pubdate)
post.save()

if post.published:
    # then increment tagcloud count on all tags in
    # the current version of the Post
    for tag in post.tags:
        TagCloud.objects(tag=tag).update(
            inc__count=1, set__updated=datetime.utcnow(), upsert=True)

Note that since upsert will create at most one document, we have to iterate the tags in the second set of updates; in the first, since we're only interested in modifying TagCloud objects that might already exist, we don't need an upset, and can do a one-line update.

The TagCloud model is quite simple:

class TagCloud(db.Document):
    tag = db.StringField(primary_key=True)
    count = db.IntField()
    updated = db.DateTimeField()

A healthy sprinkling of math determines the "bucket" that each tag belongs in (with bucket 1 being the tags occurring most frequently to bucket 6 being the least frequent -- the bucket is used to generate an h1 through h6 tag in the template):

@staticmethod
def get(sizes=6):
    tags = [t for t in TagCloud.objects(count__gt=0).order_by('tag')]
    if tags == []:
        return tags

    least = min(t.count for t in tags)
    most = max(t.count for t in tags)
    range = max(most - least, 1)
    scale = float(min(range, sizes))
    for t in tags:
        t.bucket = sizes -  int(round(scale * (t.count - least) / range))

    return tags

Atom Feed

Flask has built-in support for generating Atom XML feeds from (or, to be more accurate, Werkzeug, the WSGI library beneath Flask, does)

feed = AtomFeed(
    title='late.am',
    feed_url=url_for('feed', _external=True),
    author={'name': 'Dan Crosta', 'email': 'dcrosta@late.am'},
    icon=url_for('static', filename='mug.png', _external=True),
    generator=('plog', 'https://github.com/dcrosta/plog', '0.1'),
)

posts = Post.objects(published=True).order_by('-pubdate')
for post in posts[:20]:
    feed.add(
        title=post.title,
        content=markdown(post.blurb + '\n' + post.body),
        content_type='html',
        author={'name': 'Dan Crosta', 'email': 'dcrosta@late.am'},
        url=url_for('post', slug=post.slug, _external=True),
        id=url_for('permalink', post_id=post.pk, _external=True),
        published=post.pubdate,
        updated=post.updated)

Other Goodies

As a web developer, obviously, I dislike writing HTML. That's why Plog uses Markdown powered by the markdown2 module. markdown2 supports code syntax highlighting using pygments, which was an added bonus.