Hello, Plog
I've wanted to start writing a blog for the past few months. By which I mean, writing blog posts. But why focus on writing great content when you can write great software instead?
Allow me to introduce Plog, the software running late.am. Read on to see how Flask, Mongoengine, and MongoDB can work together to create something beautiful.
Data Model
The data model for Plog is about as simple as you'd expect:
class Post(db.Document):
pubdate = db.DateTimeField(required=True)
updated = db.DateTimeField()
published = db.BooleanField(default=True)
title = db.StringField(required=True)
slug = db.StringField(required=True, unique=True)
blurb = db.StringField(required=True)
body = db.StringField(required=False)
tags = db.ListField(db.StringField())
views = db.IntField()
_words = db.ListField(db.StringField())
meta = {
'allow_inheritance': False,
'indexes': [
{'fields': ['published', 'slug', 'pubdate']},
{'fields': ['published', '_words', 'pubdate']},
],
}
Setting allow_inheritance
to False
instructs Mongoengine not to add two bookkeeping fields _cls
and _types
to the documents. In cases where many types of documents share a collection, Mongoengine uses these fields to filter results only to the type (or subtypes) of document corresponding to the Python class you are querying for.
The first index will be used on virtually every page in the site. Individual post pages will always be queried using published
and slug
, and sorted by pubdate
; the homepage and archive pages don't query by slug
, but do sort by (and, in the case of archive pages, filter by) pubdate
.
Search
The second index includes the _words
field, which is automatically generated by Plog when a post is saved. Indexes on arrays in MongoDB are called "multi-key" indexes, since each element of the array is given an entry in the index. This allows efficient queries into array fields, looking for one or more of the values.
Plog's search is implemented using the $all
operator against _words
, which returns documents whose _words
array contains each of the search terms:
posts = Post.objects(published=True, _words__all=query).order_by('-pubdate')
Plog doesn't support complex search use cases like stemming (although that would be easy to add using nltk or a similar tool), boolean queries, phrase searches, etc.
Tag Clouds
Each post contains its own tags, but to generate a tag cloud we need to know the distribution of tags among all the posts in order to size.
A naive solution would be to query for all published posts, and aggregate the tag information in the application logic. However this requires sending all the posts over the network from the database to the application server which (and I'm being very optimistic here) could be a lot of data to transmit.
A slightly better approach might use map-reduce to pre-aggregate the tag count information within the database, and generate a collection containing a mapping from tag name to tag count. Such a job might be triggered on a schedule, or each time a post is saved.
Rather than either of these, Plog use MongoDB's atomic update operator $inc
along with upserts to maintain tag count information when posts are saved. The counts for the tags on the previous version of the post are each decremented by 1, and the counts for the tags on the new version of the post are each incremented by one:
if post.published:
# decrement tagcloud count on all tags in the
# previous version of the Post
TagCloud.objects(tag__in=post.tags, count__gte=1).update(
inc__count=-1, set__updated=datetime.utcnow())
for field in form:
setattr(post, field.name, field.data)
post.slug = slug_for(title=post.title, pubdate=post.pubdate)
post.save()
if post.published:
# then increment tagcloud count on all tags in
# the current version of the Post
for tag in post.tags:
TagCloud.objects(tag=tag).update(
inc__count=1, set__updated=datetime.utcnow(), upsert=True)
Note that since upsert will create at most one document, we have to iterate the tags in the second set of updates; in the first, since we're only interested in modifying TagCloud
objects that might already exist, we don't need an upset, and can do a one-line update.
The TagCloud
model is quite simple:
class TagCloud(db.Document):
tag = db.StringField(primary_key=True)
count = db.IntField()
updated = db.DateTimeField()
A healthy sprinkling of math determines the "bucket" that each tag belongs in (with bucket 1 being the tags occurring most frequently to bucket 6 being the least frequent -- the bucket is used to generate an h1
through h6
tag in the template):
@staticmethod
def get(sizes=6):
tags = [t for t in TagCloud.objects(count__gt=0).order_by('tag')]
if tags == []:
return tags
least = min(t.count for t in tags)
most = max(t.count for t in tags)
range = max(most - least, 1)
scale = float(min(range, sizes))
for t in tags:
t.bucket = sizes - int(round(scale * (t.count - least) / range))
return tags
Atom Feed
Flask has built-in support for generating Atom XML feeds from (or, to be more accurate, Werkzeug, the WSGI library beneath Flask, does)
feed = AtomFeed(
title='late.am',
feed_url=url_for('feed', _external=True),
author={'name': 'Dan Crosta', 'email': 'dcrosta@late.am'},
icon=url_for('static', filename='mug.png', _external=True),
generator=('plog', 'https://github.com/dcrosta/plog', '0.1'),
)
posts = Post.objects(published=True).order_by('-pubdate')
for post in posts[:20]:
feed.add(
title=post.title,
content=markdown(post.blurb + '\n' + post.body),
content_type='html',
author={'name': 'Dan Crosta', 'email': 'dcrosta@late.am'},
url=url_for('post', slug=post.slug, _external=True),
id=url_for('permalink', post_id=post.pk, _external=True),
published=post.pubdate,
updated=post.updated)
Other Goodies
As a web developer, obviously, I dislike writing HTML. That's why Plog uses Markdown powered by the markdown2
module. markdown2
supports code syntax highlighting using pygments
, which was an added bonus.