Thanks to a series of recent posts on the SvN blog, I've been thinking more about my little Python A/B testing framework, Dabble.

I built Dabble to A/B test (sometimes also called "split test") features on 10gen.com. Following the advice of a blog post I've since lost track of, Dabble configures A/B test parameters entirely in code, follows procedure for independent testing, and generally works without much of a hassle.

Dabble from 10,000 Feet

Aside from a little necessary configuration, Dabble is composed of two user-facing classes: ABTest and ABParameter. The former represents the thing you're testing -- for instance, the design of a signup page or a call-to-action form; the latter is where you actually implement the variants which make up the tested features.

Here's an example, using a class-based Python framework:

class Signup(page):
    path = '/signup'

    abest = ABTest('signup_button', ['red', 'green'], ['show', 'signup'])
    color = ABParameter('signup_button', ['#FF0000', '#00FF00'])

    def get(self):
        # do some stuff
        self.abtest.record('show')
        return render('signup.html, color=self.color)

    def post(self):
        # validate the submission
        # save the user's information
        self.abtest.record('signup')

And in your template:

<form action="/signup" method="post">
  <!-- form goes here -->
  <input type="submit" style="background-color:{{color}}"/>
</form>

That's it.

Wait, what?

Yup, that's it. No if/else, no ugly duplication of code across templates. Just one template variable.

A/B testing is really that simple. Ruby's got 7 Minute Abs, and now Python's got 30 second Dabbles... or something... I guess I'll have to work on that.

Behind the Scenes

What magic makes this possible? First, let's take a look at the parent class for ABTest and ABParameter, aptly named AB:

class AB(object):
    # these are set by the configure() function
    _id_provider = None
    _storage = None

    # track the number of alternatives for each
    # named test; helps prevent errors where some
    # parameters have more alts than others
    __n_per_test = {}

    def __init__(self, test_name, alternatives):
        if test_name not in AB.__n_per_test:
            AB.__n_per_test[test_name] = len(alternatives)
        if len(alternatives) != AB.__n_per_test[test_name]:
            raise Exception('Wrong number of alternatives')

        self.test_name = test_name
        self.alternatives = alternatives

    @property
    def identity(self):
        return hash(self._id_provider.get_identity())

    @property
    def alternative(self):
        alternative = self._storage.get_alternative(
            self.identity, self.test_name)

        if alternative is None:
            alternative = random.randrange(
                len(self.alternatives))
            self._storage.set_alternative(
                self.identity, self.test_name, alternative)

        return alternative

The alternative property here does most of the heavy lifting, it consults with the storage object (which, in 10gen's case, is backed by MongoDB, of course) to find out if the current user already has been assigned an alternative; if not, it creates and stores one for the user. The identity property consults an IdentityProvider to identify the user; this might be implemented with cookies for sites that don't have logins, or might simply be a user ID for sites that do.

Next up is ABParameter:

class ABParameter(AB):
    # a descriptor object which can be used to vary parameters
    # in a class definition according to A/B testing rules.
    #
    # each viewer who views the given class will always be
    # consistently shown the Nth choice from among the
    # alternatives, even between different attributes in the
    # class, so long as the name is the same between them

    def __get__(self, instance, owner):
        return self.alternatives[self.alternative]

Python will call __get__ with a reference to the instance ("self") and the owning class (in our example, the Signup class), and return the value returned by this function when the attribute to which this descriptor is bound is accessed.

The storage backends (currently MongoDB and filesystem are supported) are responsible for storing results from the tests, test configurations, and user-to-alternative assignments. The code there is not particularly exciting, so I won't show it here.

Collecting Results

Dabble's storage object has a report method which summarizes the recorded results in a dictionary like:

{
    'test_name': 'signup_button',
    'results': [
        {
            'alternative': 'red',
            'funnel': [{
                'stage': ('show', 'signup'),
                'attempted': 1813,
                'converted': 18
            }],
        },
        {
            'alternative': 'green',
            'funnel': [{
                'stage': ('show', 'signup'),
                'attempted': 1838,
                'converted': 32
            }]
        }
    ]
}

Each alternative's results has a funnel element, which shows the number of users who attempted (i.e. the first step was recorded) and converted (i.e. the second step was recorded) for each successive pair of steps as defined in the ABTest. Here we can see that about the same number of people attempted the show-signup stage for each alternative, but nearly twice as many people signed up with the green button.

What's Next?

Here are a few areas where I'd like to improve Dabble:

  • Support non class-based views web frameworks (Django, Flask, etc)
  • Build IdentityProvider support for common web frameworks
  • DRY reporting (currently the two backends duplicate much of the same functionality in the report method)
  • Support other back-ends for storing results

You can help! fork the repository on GitHub, use it in your application, report or fix bugs, contribute improvements, etc.