Algebraic Data Types, pt 1: What are they?

In Python, every data type you can define is a class. This class will usually have one or more attributes and some methods that know how to use those attributes (though in Python sometimes you take the less object oriented route and use the attributes directly). Usually there’s a specific set of attributes that you want every member of the class to have — for example, a Point class might have attributes x and y (which should be numbers) or a Matrix class might have data (a list of lists, or something), width, and height (more numbers) — but Python doesn’t really provide any way to enforce this except convention.

C has structs, which if you squint are kinda-sorta vaguely similar to Python’s classes. A struct has some set of attributes (in C you’d call them fields), though unlike Python a struct definition contains a specification of what fields can and must exist. Structs also can’t have methods defined, but you can kind of fake it by not exposing the struct definition so that people can only manipulate the struct by using your functions.

C has another thing, though, to which Python doesn’t really have an equivalent: the enumeration type. In C, you can declare a new type (call it TrafficLight, since that’s what a lot of tutorials seem to do) and say just that it contains three distinct values: red, yellow, and green. These are a bit like lisp keywords in that they can be compared in constant time and contain no data other than which value they are.1 Unlike keywords, each enumeration type has its own set of values — you can’t use a TrafficLight in a place that expects a FileMode (whose values are probably read, write, and append, and maybe rw and some other things). Enums are a very good solution to the problem of “there are exactly seven possibilities for this piece of information and I’d like to distinguish between them”, and it can be kind of frustrating when you want them and you’re working in a language that doesn’t have them. But enums are limited in their own way: they can’t store any data. So for example, if you wanted your FileMode data type to have a field for the encoding of the text you’re working with… you have to add a lot of different enum values, like rw_ascii and rw_utf8 and append_ascii and append_utf8 and so on.

Algebraic data types are a bit like a combination of the two approaches. An ADT definition looks a bit like this:

data FilePosition = Start | End | Offset Int

where the |s separate different alternative forms for values of this type to take. Within each alternative, the first word is called a “constructor” which is a bit like an enum, and all following words are pieces of data that that specific constructor holds — in this case, our Offset constructor holds a single piece of Int data. So a FilePosition is either Start (the beginning of the file), End (the end of the file), or Offset x (which is x characters into the file). Like an enum, if you have a FilePosition you can determine in constant time what constructor it uses, and every FilePosition will use exactly one of these three constructors. But like a struct or Python class, FilePositions can hold data too. And the data held in an ADT value depends on the particular constructor used — if you know the constructor, you (and the compiler) know what kind of data is stored there.

ADT values are consumed in a style commonly called “pattern matching”, but I prefer to think about it as more like “case analysis”. I love it because it makes a lot of things explicit that you already had to be keeping track of in your head anyway. For example, let’s think about binary trees. In Python you might define a tree type like this:

class Tree:
    __init__(self, value=None, left=None, right=None):
        self.value = value
        self.left = left
        self.right = right
    def preorder(self):
        # First, the tree might be empty
        if self.value is None: return []
            # Either the left or right subtrees might be empty
            if self.left is None: left_pre = []
            else: left_pre = self.left.preorder()
            if self.right is None: right_pre = []
            else: right_pre = self.right.preorder()
            return left_pre + [self.value] + right_pre

We have a lot of “if” statements there because there are a lot of possibilities. Let’s look at an equivalent in Haskell:

-- we could make trees that hold any type if we wanted, but for
-- simplicity let's stick to integers.
data Tree = Empty | Branch Int Tree Tree
preorder :: Tree -> [Int]
preorder Empty = []
preorder (Branch val left right) = preorder left ++ [val] ++ preorder right

Because we’re using algebraic data types, there’s no need to do any manual conditional checks to see what kind of tree we have — we know that a tree is in either one of two states (it’s empty or it has a value, a left tree and a right tree (either of which could be empty)), and we can tell what kind of tree we have just by looking at the constructor. We also don’t have to handle a bunch of unnecessary possibilities that exist in the python program — for example, is there a difference between Tree(3, None, None) and Tree(3, Tree(None,None,None), Tree(None,None,None))? In Haskell, we don’t need to make a decision: we only have one way to represent a tree containing just 3 (Branch 3 Empty Empty).

Algebraic data types will spoil you quickly. They’re such a nice way to represent data, and case analysis/pattern matching is such a nice way to consume it, that programming without them when you need them can start to feel like a chore.

  1. Except, of course, that this is C and enum values are just integers — so you could use them as integers if you wanted to. 

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s