Tag: matlab

Should I be using pandas?

My PhD consisted partly of doing very hands-on neuroscience experiments, and partly of analyzing the “medium-sized data” sets produced by those experiements. I say “medium-sized” because the data was large enough to require care, but not so large that it couldn’t fit comfortably on a single disk in raw form, and fit in memory once partially processed.

At the start I was using Matlab for analysis, which seemed to be the weapon of choice for almost all neuroscientists (as far as I could tell)…. uhhhh, let me just give you an xkcd-inspired graph….

me and matlab

…so needless to say, I switched to python+numpy about a third of the way into the PhD and didn’t look back*.  This wasn’t too difficult to do as one of my colleagues had made the switch a few months earlier – I was offered a “starter pack” of functions for doing IO and maybe a few words of advice.

Although I did have a bit of extra-curricular experience with python, I’d not yet done any serious exploration of the language’s features or ecosystem.  Both of these (especially the second) took time: early on I had fun hacking around with magic methods to make a dictionary-like memoizing property-thing; weird mask-able lists; and something not quite as good as IPython’s autoreload feature. Later I even tried to write a system for un-parallelising plotting commands  (that I named “monobrow”).  However, I did eventually settle on an overall analysis design that looks a bit more sensible, and I’m going to describe bits of it here (it doesn’t include any parallelism unfortunately).

I’m not going to describe the actual analysis I do in detail, instead I’ll give a simplified analogy: pretend that I have a stack of about 1k images, and I need to run a handful of algorithms on each image separately, and then summarize the resulting metrics across all the images.  Also, lets say that these 1k images were from 10 different food-obsessed-photographers, who went on separate holidays of varying lengths, but followed a similar pic-taking pattern each day: photos of breakfast, photos of lunch, photos of dinner.

My actual data wasn’t to do with images, photographers, food, or holidays, but in terms of organizing the analysis, my task was similar.  I ended up with something encapsulated in two main classes: one at the level of the individual “image”, and one at the level of the “entire-dataset”.  I’ll describe the “image” class briefly and then go into a bit more detail for the “entire-dataset” class.

The “image” class (aka Trial)

The analogized description I gave above doesn’t quite tell the full story; although there were some parts of the algorithms that had to be run on whole images, much of the computation was focused on one or more manually identified regions within the images.

The time needed to read an image off disk and do the initial processing wasn’t insignificant (nor was the space required to keep it in memory), so I hoped to reuse a single instance of each image when analyzing sub-regions within it.  However, since this was an optimization, I wanted to hide the details from the final top-level analysis script.  My solution required a bit of cooperation between the two classes: the “image” class would have a caching/memoization system, and the “entire-dataset” class would know to reuse instances of the image class across each of the subregions within an image, it would also know how to ask the image class to clear different bits of its cache.

image_id  subregion_id  image_instance
  0             91     [img #8712fc12a00]
  0             92     [img #8712fc12a00]
  0             93     [img #8712fc12a00]
  1             97     [img #1283efd7800]
  1             98     [img #1283efd7800]

The “image” class (which was actually called “Trial”) looked like this:


Each of the blocky things corresponds to a mixin contained in a separate file.  The foundation of the class is the caching mixin, which is used by the mixins within the “core api” cage as well as (potentially) by the more exotic stuff above.  I’ll describe the overall picture first, and then come back to give a bit more detail specifically on the caching.

The most important design decision was to separate the random-bits-and-bobs from the “core api”.  The idea is that by very clearly defining a core API, you can commit to keeping one section of the code in good working order even if other bits of the code are a mess.  If a core function is not quite general, fast, or readable enough, you have to promise to work on it “in situ” rather than inlining it into whatever complex analysis you are currently working on.  Python is good at letting you gradually add options to existing functions, so that you slowly build the value of your interface and its implementation. Drawing the line around the core also makes it easier to share the exotic stuff with other people: you either provide them with a copy of your well-maintained code, or you just give them a well-maintained description of the interface.

The tricky thing I found when defining the “core api” was that there were actually several parts to it: I wanted to separate the caching interface, the IO, and the basic analysis, but it took me a few iterations before I realized that I could create the divisions, and exactly where they should be.  With the current design, I can have separate IO implementations for different file formats, yet still slot that under the top-most part of the core.  I can also play with different systems of caching, or use a no-cache implementation, without having to worry about refactoring lots of core and non-core files.

So, let’s look at memoization/caching in more detail. The main two methods of the TrialCache mixin are as follows:

def _cache_has(self, kind, names, key=None):
    names = (names,) if not isinstance(names,
                            (list, tuple, set)) else names
    # update/create the list of names for the given kind
    if kind in self._cache_name_lists:
        self._cache_name_lists[kind] = set(names)
    # initialize any of the attributes if they are missing
    for name_ii in names:                 
        if not hasattr(self, name_ii):
            setattr(self, name_ii, None if key is None else {})
    # return False if any of the attrs are not already computed
    for name_ii in names:                 
        attr = getattr(self, name_ii)
        if attr is None or (key is not None and key not in attr):
            return False
    return True

def _clear_cache(self, keep=(), drop=()):
    if len(keep) and len(drop):
        raise ValueError("use 'keep' or 'drop', not both.")            
    if any(" " in k for k in keep + drop):
        raise NotImplementedError()
    if len(keep):
        kinds = [k for k in self._cache_name_lists.keys() if 
                 all(kk in keep for kk in k.split(" "))]
    elif len(drop):
        kinds = [k for k in self._cache_name_lists.keys() if
                 any(kk in drop for kk in k.split(" "))]
        kinds = self._cache_name_lists.keys()
    for k in kinds:
        for name in self._cache_name_lists[k]:
            if hasattr(self, name):
                delattr(self, name)

cache_has is a slightly unusual function in so far as its name should be interpreted as a statement of intent as well as a question: the caller is saying “I intended for the cache to contain such-and-such from now on, but does it have it yet?

Within another mixin, you use it as follows:

def red_count(self):
    if not self._cache_has('color_data', '_red_count'):
        self._red_count = compute_red_count(self._data, self._params)
    return self._red_count

def smoothing(self, val):
    self._params['smoothing'] = val

The idea is that the computation of _red_count is guarded by the _cache_has intent/test, so that redundant re-computing can be avoided, yet we can still invalidate stored results when we need to, i.e. when we change the smoothing setting. You could have many computed things that depend on the smoothing setting and invalidate them all with a single command: you can define as many groupings as you need and either ask to invalidate specific groupings or invalidate all except specific groupings.

Note that the code above supports the use of lists of strings and/or strings for most arguments. It also has some builtin support for “lazy” dictionaries that I haven’t demonstrated here.

The code is not perfect: in retrospect it might have been better to hold all the cached items within an actual “cache” container, rather than attaching them to the self object directly and only managing them via lists of the attribute names. Also, I might want to (optionally) expose the caching as a decorator, which is a common paradigm for simple memoizing. If I was being really serious I might investigate some kind of data-flow caching system that lets you define a hierarchy of data dependencies, but that’s probably a lot more complicated…indeed that’s roughly what my ambitious C++ project, “Venomous“, is hoping to do. Anyway, all I wanted to point out is that with a few lines of mixin methods you can create a pretty powerful caching system that other mixins can use in a very compact way.

We’ll come back to this caching system in a bit, but next time from the point of view of the “entire-dataset” class….

The “entire-dataset” class (aka ExpSet)

I don’t have a pretty picture for this class unfortunately, so I’ll just have to describe it in words.

For several months I was using a class that was a lot like the one I’m about to describe but it was cobbled together out of a strange mixture of pure-python lists, numpy arrays and weird home-made concoctions. Switching to pandas wasn’t all that simple, but it made things a lot better.

My impression is that pandas is viewed quite differently by different groups of users: finance/economics-y people might be in the habit of thinking that every dataframe has a timestamp column, with rows that can be aggregated by hour or by day. Geo-people might hold equivalent beliefs about GPS coordinates (I’m guessing). To me, pandas is simply a package that adds labels to the rows and columns of a numpy-like 2D array, and ensures that the labels stick with the rows and columns as you index into and make copies of the array. And, importantly for my usage-case, those labels can be hierarchical…

…well, in theory they can be hierarchical, but in practice the support for MultiIndex isn’t perfect (yet): the syntax for indexing/slicing and joining these kinds of dataframes is not exactly pleasant or easy to remember.

This lack of whole-hearted support for hierarchical indexing led me to encapsulate my main dataframe(s) inside a class of my own. The idea being that I could patch the syntactical gaps more easily and come up with a library of tools for doing slightly more complex things. I settled on having two separate dataframes, which I named simply df1 and df2. The first had one row for each “region of interest” within each image, and the second had one row for each image. The first thing I did was write a method to take the strange csv-tables I had and parse them out into the two dataframes, including instantiating all the images. (I didn’t mention this explicitly earlier, but the image class is designed to be entirely lazily loaded – the filename is merely recorded at construction, nothing is read off disk.)

Once I had the basic pair of dataframes I could start thinking about iterating over the rows, running the algorithms on the images. This is where the title of the post comes in: should I be using pandas for this sort of thing, given that pandas is supposed to be designed for performing operations in clever aggregate ways?

Well, it might look strange to the other users of Pandas, but I think the paradigm I’ve adopted is absolutely valid and should perhaps be more widely encouraged.

Before I show you an example of iterating over the rows of the dataframe, let me give you (most of) the code for the RowIndexer class I came up with:

class RowIndexer(object):
    def __init__(self, obj, idx, **kwargs):
        object.__setattr__(self, '_obj', obj)
        object.__setattr__(self, '_idx', idx)
        object.__setattr__(self, '_with_column_sub', None)
        for k, v in kwargs.iteritems():
            object.__setattr__(self, k, v)

    def __getattr__(self, key):
        original_key = key
        if self._with_column_sub is not None:
            sub_idx = self._with_column_sub.index('[attr]')
            key = self._with_column_sub[:sub_idx] + (key,) + \

        # try a simple column indexing, then index names,
        # then index names without sub
            return self._obj.get_value(self._idx, key)
            return self._idx[list(self._obj.index.names).index(key)]
        if original_key == key:
            raise KeyError("Could not find key '" + str(key) +
                            "' in column or index names.")
            return self._idx[list(self._obj.index.names)\
        raise KeyError("Could not find key '" + str(key) 
                       + " or " + str(original_key) +
                       "' in column or index names.")
    def __setattr__(self, key, value):
        if self._with_column_sub is not None:
            sub_idx = self._with_column_sub.index('[attr]')
            key = self._with_column_sub[:sub_idx] + (key,) \
                    + self._with_column_sub[sub_idx+1:]
        elif self._obj.columns.nlevels > 1:
            key = (key,) + ('',) * (self._obj.columns.nlevels-1)
            self._obj.set_value(self._idx, key, value)
        except TypeError:
            self._obj[key] = self._obj[key].get_values()\
            self._obj.set_value(self._idx, key, value)
    def column_sub(self, *args, **kwargs):
        if not args or len([a for a in args if a =='[attr]']
                           ) > len(args)-1:
            raise ValueError("you must specify at most one "
                             "'[attr]' spot, and at least "
                             "one sub label.")
        if 'padding' in kwargs:
            padding = kwargs['padding']
            if len(padding) > 1:
                raise ValueError("urecorgnised wkargs")
            padding = ""
            if kwargs:
                raise ValueError("unrecognised kwargs")
        # establish total levels required                     
        if '[attr]' not in args:
            args = args + ('[attr]',)
        target_nlevels = len(args)
        _pad_columns_to_level(self._obj, target_nlevels, padding)
        object.__setattr__(self, '_with_column_sub', tuple(args))
            object.__setattr__(self, '_with_column_sub', None)

Instances of this class are yielded by a generator in ExpSet as the user iterates over the rows of df1 and/or df2. The main idea is that assigning to attributes on the instance actually results in values being written to columns within a specific row of a dataframe.

for subregion, image_inst in dataset.iter():     
    subregion.redness = image_inst.get_redness(subregion.shape)

Here subregion is a RowIndexer instance that points to a different row on each iteration of the loop, and image_inst is the instance of the “image” class relevant for the given row. Inside the generator we make sure to yield all the subregions for a specific image before moving on to the next. When we are ready to switch images, we clear some/all of the cache for the old image, thus keeping the total memory usage manageable. Exactly which bits of the cache we want to clear depends on (a) what we are likely to want in future if/when we loop over the data again; (b) what was expensive to compute; and (c) how large things are in memory. These three points are common to pretty much any caching system, and the optimal choice is rarely an easy thing to pin down: here we don’t use any automatic heuristics, instead we allow the user to specify which bits of data to delete using the fine-grained caching machinery outlined earlier. For example, the user may delete the raw whole-image, but keep the down-sampled, filtered version.

The nice feature of this system is that the Image class is agnostic of the way in which you iterate: it will do its best to cache stuff and reuse anything from the cache when possible. Also, the user’s loop (i.e. the one shown here) can ignore the details of caching: they don’t even need to know that one image instance is shared across multiple subregions. If the user does want to customize the cache-clearing behavior, he simply provides an extra argument to the iterator. The iterator can also take care of displaying progress reports.

Try doing that in Matlab!!

An important feature of the RowIndexer class is that you don’t need to declare new columns ahead of time: they are automatically added to the parent dataframe as needed. The class also has a couple of bonus features, one of which is the ability to handle and create multi-indexed columns using a with-block:

subregion.n_people = image_inst.n_people(subregion.shape)
for zone in ['upper', 'lower']:
    with subregion.column_sub('[attr]', zone), image_inst.mask(zone):
       subregion.n_flowers = image_inst.n_flowers(subregion.shape)
       subregion.n_clouds = image_inst.n_clouds(subregion.shape)
# the parent dataframe now has columns: 
#  n_people |    flowers    |     clouds    |
#           | upper | lower | upper | lower |

All of this lets you do some fairly complex meta-analysis (still using caching, remember) in just a few clear lines of high-level code.

The final thing I want to discuss is a bit like the RowIndexer, but it applies to the whole dataframe, or in fact to the whole custom class in which the pair of dataframes are held.

As before, I’ll give the implementation first, but you don’t have to look at it too closely, especially as it depends on a few things that I haven’t included:

# method of "entire-dataset" class...
def view(self, **kwargs):
    original_df1, original_df2, original_mode = (
                           self.df1, self.df2, self.mode)
    sub_df1, sub_df2, sub_mode = self._get_sliced(**kwargs)
    self.df1, self.df2, self.mode = sub_df1, sub_df2, sub_mode
    self.df1.is_copy = self.df2.is_copy = None 
    except Exception:
        self.mode = original_mode
        self.df1 = original_df1     
        self.df2 = original_df2
        if sub_mode != original_mode:
            sub_df1, sub_df2 = self._get_as_mode(
                         src=sub_mode, dest=original_mode, 
                         df1=sub_df1, df2=sub_df2)
            _update(original_df1, sub_df1)            
            _update(original_df2, sub_df2)    

And we can use it as follows:

with dataset.view(meal='breakfast', holiday_type='european'): 
   for subregion, image_inst in dataset:
       subregion.juice = image_inst.has_juice(subregion.shape)  

The view method does two things: firstly, it exposes a nice way of slicing the dataset using keywargs that can be handled intelligently behind the scenes (using all the gruesome pandas syntax); and secondly it acts as a context-manager that gives us a filtered psuedo-view onto the whole dataset: once the with-block exits, any changes made are propagated back into the original dataset. When using raw pandas, some types of indexing generate copies and some types generate views: certainly if you have complex filtering requirements it is rare that you will end up with a view back onto the original dataframe. (For many users this may only be memory problem, rather than a logical-one, but given that we want to write into the rows of the dataframe we need to have some kind of system such as this.) In fact, in my ExpSet class, not only can you slice the data in interesting ways with this sytnax, you can also pivot it into different forms, do some computation, and then have it revert back to the original form when the with-block exits.

Identical slicing/pivoting functionality is provided in a second method called copy, as the name suggests, this is not intended for a with-block, but simply gives a do-what-you-want-with clone of the original dataset.

That’s pretty much all I wanted to say about the “entire-dataset” class. The two things I wish it did (better) were outputting nice HTML tables and providing a simple means of running loops in parallel. Pandas does do HTML tables, but I had to fork it in order to put images in the table (I did submit a pull request, but in the end it got rejected in favor of something slightly different). Having interactive tables that you can filter and sort in the browser would also be great (I think there are some tools built on top of pandas that do this to some extent, but I never tried them).

As with previous posts, this took a while to get onto the page. If you found it interesting, or you want to argue about something, please do post in the comments.

*except when helping other people with Matlab, ‘cos I’m such a nice person. Edit Jan-2016: I’ve just come across last year’s Stack Overflow survey in which it was found that two thirds of Matlab-using respondents did not want to continue using it, which put it near the top of the “dreaded” languages list.


Non-trivial vectorizations

For a more general discussion on optimizing, see my previous post on perfectly timed optimization.

Before we get started, I should say that the following stories are likely to be of most interest to Matlab or Numpy users.  In compiled languages, the general lack of overheads will probably do a pretty good job of making fast programs, as will any loop optimization passes.  Auto-vectorization is getting better over time, but is unlikely to work for the examples here (and in fact using these methods in a compiled language may be a pessimization). However, these tricks do become relevant again when one switches to computing on the GPU.

Example 1.

Imagine we have an array giving times for 10k events, and our task is to compute a histogram giving all the pairwise time differences. Or rather, we must compute just the lower end of that histogram, where time differences are within some specified window.


Some raw-python code might look like the following. Note that for simplicity I assume the desired histogram bins are one time-unit wide…

def make_histogram(times, window):
    hist = [0]*window
    for ii, time_ii in enumerate(times):
        for time_jj in times[ii+1:]:
            delta = time_jj - time_ii
            if delta >= window:
            hist[delta] += 1
    return hist

In words: for each event, loop forward over the subsequent events, storing time-diffs in the histogram until you reach the end of the window. You could improve this slightly by pulling the delta>window check into a separate loop from the hist[delta]++ line: create a last_jj variable that gets incremented zero or more steps each ii-iteration.

Right, lets begin vectorizing….

Firstly, for each ii lets find the number of subsequent events in the window. The approach we use may not strictly speaking be using SIMD under the covers, but in the Matlab/Numpy world we could still call it a “pseudo-vectorized” primitive.

counts = times.searchsorted(times + window, side='left')\
                             - arange(len(times)) -1

# which would let us do something like...
for start_ii, (time_ii, count_ii) in enumerate(
                                 zip(times, counts)):
    deltas = times[start_ii+1:start_ii+1+count_ii] - time_ii
    hist += bincount(deltas, minlength=window)

searchsorted and bincount are part of Numpy, the equivalents in Matlab are histc (use the second output) and accumarray. Observe that by knowing the counts in advance we were able to pull the delta calculation out of the inner loop, and then as a result we could pull the increments out of the inner loop too.

The second step is a bit trickier, we want to pull the delta calculation out of the ii loop entirely. This calculation is just a long list of times[jj] - times[ii], so what we’ll need is two sets of indices, that look something like:

all_ii = [0 0 0 0 1 1 1 1 1 3 3 4 4 4 4 ...]
all_jj = [0 1 2 3 0 1 2 3 4 0 1 0 1 2 3 ...] + all_ii + 1

that is we want to repeat each ii the number of times specified by counts[ii], and all_jj is formed from all_iis by the addition of mini-ranges of length counts[ii]. In Numpy there is a function for generating all_ii, called repeat. In Matlab we have to build the function ourselves:

function ret = repeat(n)
    n_mask = logical(n);
    n_inds = find(n);
    n_inds(2:end) = n_inds(2:end) - n_inds(1:end-1);
    n_cumsum = ones('single');
    n_cumsum(2:length(n)+1) = cumsum(n)+1; 
    ret = zeros(n_cumsum(end)-1, 1, 'single');
    ret(n_cumsum(n_mask)) = n_inds; 
           % note: numel(n_mask) = numel(n_cumsum)-1
    ret = cumsum(ret);

For the multiple ranges in the construction of all_jj, we can use the following in Numpy (this translates to Matlab fairly easily, but be careful with one/zero-based indexing):

def multi_arange(n):     
    n_mask = n.astype(bool)
    n_cumsum = cumsum(n)
    ret = ones(n_cumsum[-1]+1, dtype=int)
    ret[n_cumsum[n_mask]] -= n[n_mask] 
    ret[0] -= 1
    return cumsum(ret)[:-1]

If you weren’t already a fan of cumsum, I hope you are warming to it – look how we used it twice in one function here (and we used it twice in the Matlab version of repeat)!

Ok, that’s pretty much it. We can now get rid of the ii-loop, and we’re done:

def make_histogram(times, window):
    counts = times.searchsorted(times + window, side='left')\
                                 - arange(len(times)) -1
    all_ii = repeat(arange(len(counts)), counts)
    all_jj = multi_arange(counts) + all_ii + 1
    deltas = times[all_jj] - times[all_ii]
    return bincount(deltas, minlength=window)

Note that this optimization only makes sense if the counts in each window is guaranteed to be fairly small, and the total deltas array fits easily in memory. If either of those things is not true you might want to stick with the intermediate version, or wrap some batching mechanism around the final version.

Obviously, if you are going to do this sort of thing yourself, make sure you are testing properly as you go along and provide assertions about array shapes and types etc. The two helper functions I showed should also be given some documentation: just showing a sample input/output should be enough here. Note that the examples I’ve given here are simplified versions of the actual code I use normally – the stuff here is fine for simple integer cases, but I recommend testing it properly yourself if you actually intend to use it!

The multi_arange function, and some other goodies are available in the numpy groupies package that I co-authored with @ml31415 on github.

Example 2.

Imagine we have a list of 60k (x, y) coordinates that correspond to the path someone walked. Imagine also, that there are errors in that path: at certain points the person rotated by some amount that was not reflected in the path data. Our task is to take the 2D array of xy data, together with a 1D array providing the rotation errors at each step, and generate a corrected 2D array of xy data.

Lets picture a simple example in which the person took just 50 steps (rather than 60k), and recorded two abrupt rotations. Our task could be to correct for 3 additional rotations, or we might need to correct for a series of very small rotations over multiple steps:


The function I wrote to solve this wasn’t actually needed in any analysis in the end, but I had fun writing it, not least because I got to name it cumrot!

I’m not going to give an un-vectorized code sample because I hope the above picture already describes the problem well enough (and I never tried writing it un-vectorized so I’m not sure what that would look like anyway).

Right, let’s get to it then… (edit: you may want to skip the interesting-looking stuff and read the “oh dear” section at the end.)

To begin, you need to know that a rotation of th radians about a point (x,y) can be expressed in matrix form as:

\left( \begin{array}{ccc} cos(th) & -sin(th) & [x - x cos(th) + y sin(th)] \\ sin(th) & cos(th) & [y - x sin(th) - y cos(th)] \\ 0  & 0 & 1 \end{array} \right)

So if we had just one rotation error, we could apply this rotation to all the data from that point forward. Even if we had a small number of corrections we could loop through each of them and apply them one after the other, i.e. this would suit the “with 3 rotations” example above. However, it would be pretty inefficient to do this for the “with gradual rotation” example, because there are many small rotations needing to be corrected for (and even that was just meant to be a simple small example).

Instead, lets imagine performing the rotations in reverse order, i.e. starting with the last one (but still applying the rotation to all the data from that point in the forwards direction) and then working back towards the first. If we do the rotations in this reversed order, then each pivot point is un-altered at the moment we need to create its rotation matrix. Thus, we can produce all the rotation matrices ahead of time. Which is a nice start.

But we can do better than that: observe that each point is being multiplied by all the preceding rotation matrices, however we could (by virtue of associativity) equivalently say that we multiply each point by a single composite-transformation matrix.

What does that give us? Well, *if* we can work out the necessary composite transformation matrices, then we can vectorize the second step, i.e. the multi-matrix-multiply, as follows:

x_out = a*x + b*y + c;
y_out = d*x + e*y + f;

where a b c d e f are all 60k-long 1D arrays, that together define the 60k separate transformation matrices of the form:

\left( \begin{array}{ccc} a & b & c \\ d & e & f \\ 0 & 0 & 1 \end{array} \right)

This is weird, right? Rather than using Numpy/Matlab to do matrix multiplies for us we’ve hand-written one. But ours is way better, because it’s actually 60k simultaneous multiplies! Ok, so the question is: how do we get the composite matrices needed for the multiply, above?

Well, it’s helpful that we’ve just seen how hand-writing matrix-multiplies sometimes makes sense. What we need to spot now is that we’re actually attempting to do a “cumulative” matrix multiply, i.e. each of the 60k steps is either an identity matrix or it’s a rotation matrix, and what we want is to do a cumulative matrix multiply along the length of this 60k stack of matrices. If you’re not familiar yet with basic cumulative sums (i.e. cumsum), then you’re going to be a bit out of your depth here. Even if you are a frequent user of cumsum you may want to go the the wikipedia page that has a nice description of how it can be computed in parallel. I reproduce the wikipedia diagram here:

‘Prefix sum 16’, David Eppstein. (source)

Ok. We know how to vectorize a bunch of matrix multiplies, and the above diagram shows us the pattern we need to do them in, so all that’s left is to write the code. Below, I show the first of the two main loops. In preparation for this loop, we zero-padded x and y to ensure they are of power-two length, and we computed a b c d e f for all the individual rotation matrices (see the matrix above with cos(th), sin(th) etc.).

for log_s in range(logn):
    s = 2**log_s
    s2 = 2*s
    # pick out the right-hand values. 
    a_r, b_r, c_r, d_r, e_r, f_r = a[s2-1::s2].copy(), \
               b[s2-1::s2].copy(), c[s2-1::s2].copy(), \
               d[s2-1::s2].copy(), e[s2-1::s2].copy(), \
    # and the left-hand values
    a_l, b_l, c_l, d_l, e_l, f_l = a[s-1::s2], \
            b[s-1::s2], c[s-1::s2], d[s-1::s2],\
            e[s-1::s2], f[s-1::s2]

    # store in right hand value locations
    a[s2-1::s2] = a_l*a_r + b_l*d_r    
    b[s2-1::s2] = a_l*b_r + b_l*e_r;
    c[s2-1::s2] = a_l*c_r + b_l*f_r + c_l;
    d[s2-1::s2] = d_l*a_r + e_l*d_r;
    e[s2-1::s2] = d_l*b_r + e_l*e_r;
    f[s2-1::s2] = d_l*c_r + e_l*f_r + f_l;

Note that I translated this from Matlab to Numpy without checking the results, but it’s probably correct?! More importantly, note that we have to explicitly copy the RHS values because we’re going to modify them in place bit by bit. Finally, note that although I was quite pleased with this when I wrote it, there is still room for improvement: you could use a single 2D array for abcdef, which would simplify the “picking” bits and might let you fuse together some of the actual assignment bits. For the case when most of the matrices are the just the identity matrix, you could further optimize this by performing the cumulative multiply on only the non-identity subset of the matrices, and then index into the result to expand it out to the appropriate length.

That’s all I wanted to say for this example. If you really want to see the complete solution then let me know (I have written it in Matlab). But hopefully you have enough here to write it yourself.

Oh dear! As I reached the end of writing about this, it occurred to me that there was in fact a much simpler way of doing the calculation: firstly, compute the cumulative sum for the extra rotation data; then convert the (x,y) data into a series of (dx, dy) steps; next, turn these Cartesian steps into polar steps and add the corresponding cumulative rotation angle for each step; finally, convert the steps back into Cartesian space and perform a cumulative sum in x and y.

This alternative method method requires slightly more trigonometry (cosine and sine at the end in addition to the arctan at the start), but the massive benefit is that all the hand-written matrix multiplies are replaced with standard cumulative sums: one for angle, one for x and one for y. The original solution may be slightly better in terms of resistance to floating point errors (but I’m not going to examine that in detail now). It would also probably be faster if the number of rotations is small (and we implement the suggested optimization for that scenario). I wonder whether there was a reason I wrote the original function with matrix multiplies? I notice that the code included the option to apply the cumulative transformation to some secondary position data, so perhaps that had something to do with it, but unfortunately I no longer remember the actual details of the intended use-case.

Example 3.

In contrast with the previous examples, I am going to give a rough explanation of the neuroscience motivating this task (because it would be rather abstract and contrived without it).

Neurons, roughly speaking, behave like Poisson processes: at any given moment they each have a “true” rate, but we can only observe that rate indirectly by counting the number of discrete “spikes” (aka action potentials) they generate. For example, if a neuron generates 4 spikes in one second and 6 the next second we might (using an appropriate formula) estimate that the “true” underlying rate was about 5 Hz.

For many neurons, the “true” rate correlates with something we can observe in the outside world, where the word “correlate” here is meant in the general sense, i.e. some sort of statistically observed coupling. For example, the rates of some neurons are modulated by the exact location of the organism within its environment, i.e. a particular neuron might have a rate of 4Hz each time the person/animal goes to a spot near the window, but have a rate of approximately zero elsewhere.

Our task here is to analyse the data from 50 of these place-sensitive neurons: we will get a stack of 50 “ratemaps” that tells us the average rate for each of the cells in every square centimeter of an experimental arena. And we will get a list of the spike times for each of the 50 cells over the course of a 20 minute experiment. The question is: how can we use the neuron spike data and ratemaps to estimate the location of the animal? Since the animal is moving around, we need to produce a separate estimate for every second, or lets say every quarter of a second.


The calculation we need to do (which can be found in a number of proper academic papers) is, in essence, quite simple: (1) count the number of spikes for each cell in each time window; (2) for each of the 1 centimeter-square bins in the arena, calculate the probability that the observed counts could have occurred; (3) either stop there, i.e. provide the likelihood-map for each time window as the output, or find the location of the maximum likelihood for each time window, and just give that as the output.

For a single 1-cm bin, at (x,y) in a single window, t, the calculation is:

L(x,y,t) = \prod_{i=0}^{50} \frac{rate_i(x,y)^{counts_i(t)} e^{-rate_i(x,y)}}{counts_i(t)!}

If real maths seems like hard work, then just make sure you don’t stare too closely. It’s the programming that we’re here for, so you don’t need to worry about the details of the equation really.

Right, and we’re off…

To begin with, I need to make a point that is going to be obvious to people with certain backgrounds (but is apparently not obvious to many neuroscientists):

\prod_i x_i=exp(\sum_i log(x_i))

This means we can re-write the equation for L(x,y,t) as a sum, in which the terms are nicely separated:

log(L(x,y,t)) = -\sum_i rate_i(x,y) + \sum_i [counts_i(t) log(rate_i(x,y))] -\sum_i log(counts_i(t)!)

Now, we can examine each term individually and come up with a way of vectorizing over the whole of x, y, and t.

The first term is easy:

# ratemaps is a 3D array of shape: [100 cm x 100cm x 50 neurons]
term_1 = sum(ratemaps, axis=-1)

Once we have the counts, the last term is also pretty easy. (For the counts we probably need to use bincount, either looping over cells, or using ravel_multi_index as I do in numpy-groupies.) We can even manage a “bonus” optimization because we know we’re dealing with small integers:

# counts is a 2D array of shape: [4800 windows x 50 neurons]
log_k_factorial = log(factorial(arange(max(counts)+1)))   
term_3 = sum(log_k_factorial[counts], axis=-1)

The bonus optimization was that we build a tiny lookup-table of log-factorials and then simply index into that. This is a real help as computing logs is seriously slow.

The middle term is the only interesting one. Notice that it’s roughly of the form \sum_i a_i b_i. Whenever you see something of this form (and it occurs surprisingly often in my experience) then you should immediately say to yourself “I bet that can be re-written as a matrix-multiply”. I’m not going to try and describe the exact circumstances in which it is possible to turn sums into matrix multiplies, but just remember that it’s worth looking out for them Note that if you need to transpose a large array in order to do this it may not be worth it in Matlab, but it’s free to do so in Numpy. Numpy also lets you do n-dimensional dot products, whereas Matlab just does 1 or 2D – however, by reshaping your ND arrays appropriately you should be able to re-imagine an ND multiply as a Matlab-friendly 2D multiply).

# counts is a 2D array of shape: [4800 windows x 50 neurons]
# ratemaps is a 3D array of shape: [100cm x 100cm x 50 
term_2 = dot(log(ratemaps), counts[:,:,nax]) # nax = np.newaxis

All that’s left is to add the three terms together, with the appropriate broadcasting in Numpy or bsxfun in Matlab, and we’re done. The result is about 6 lines long, and doesn’t look that horrible (well it’s ok in Numpy, but a bit uglier in Matlab). It’s also amazingly fast given how much number crunching it’s doing!

There may be some mistakes in this third example as I wasn’t as careful as with the previous two.

Example 4

Note that I added this example a few days after writing the original post, although the code actually predates that of the other examples.

This task is interesting in that we manage to *partially* vectorize a loop even though we can’t go the whole way. The trick is to perform a kind of manual profile guided optimization.

The task is relatively simple: in an experiment, an LED traced out a trajectory in 2D space and we now have an array of 60k (x,y) coordinates corresponding to that trajectory. The problem is that in a few places the camera doing the tracking made an error, and spuriously recorded that the LED “jumped” over a large distance in an unrealistically short time.

speed filter

The filtering algorithm we need to implement is roughly the following:

def filter_speed(xy, max_speed):
   is_good = zeros(len(xy), dtype=bool)
   ii = 0
   while ii < len(xy):
       is_good[ii] = True 
       from_xy = xy[ii]
       for jj, to_xy in enumerate(xy[ii+1:]):
           # dist is diff then hypot (though sqrt is redundant)
           if dist(from_xy, to_xy) < (jj-ii)*max_speed:
              ii = jj
   # now we're supposed to interpolate where is_good is False

Notice how each iteration of the ii loop depends on the previous iteration: we need to know which points are good and which are bad before we can make claims about subsequent points. (Look at the diagram if this isn’t immediately clear.) This kind of problem is vaguely similar to a regex search where you need to scan along a long string looking for instances a complex pattern: however, I believe that regexes are generally implemented with a state machine and lookup table, which is not something that makes sense here.

In python I opted to compile this (or something pretty similar) with Cython. Job done: easy 10x speed up. And that is probably the right approach to take, really. However, some months earlier, back in Matlab, I was more reluctant to write a MEX file and instead I spent a bit of effort vectorizing. My approach is probably ridiculous in all but the most obscure situations, but I’ll include it here as I think it’s fairly interesting.

So how can we vectorize this loop, when each iteration clearly depends on the previous iteration? The answer (as I already mentioned) is that we can only partially vectorize, but that we can still get a pretty decent payoff. The observation we need to make is that, in general, the data is of a reasonably high “quality”, meaning there won’t be that many errors (if there are a lot of errors then something needs to be fixed in the experimental setup so as to bring the quality of the data back to standard). Thus if we vectorize the sections of loop with no errors, or very simple errors, we can jump over those sections when we encounter them in the loop, and dwell on only the difficult bits.

I’m not going to give the full code, because it’s quite long and rather ugly, but I’ll try and give you a flavor of it (as I said, don’t try this at home unless you really need to). Note that I’ve translated to python/numpy here:

# Is a step from xy[t] to xy[t+1] slow enough?
step1_is_good = dist(xy[:-1], xy[1:]) < 1*max_speed

# If not, is a step from xy[t] to xy[t+2] slow enough?
bad_idx, _ = (~step1_is_good).nonzero()
step2_is_good = dist(xy[bad_idx], xy[bad_idx+2]) < 2*max_speed

# do similar for t+3 (not shown), then summarize...
step_size = step1_is_good.astype(uint8) # i.e. = 1
step_size[bad_idx[step2_is_good]] = 2 
# and similar for 3, everything else is set to 4

Right, this is just the preparation. What we now have is a vector that might look something like the following:


We can now make some basic observations:

  • wherever we see the pattern 111X, or 311X, or 321X, or 221X, or 211X, or 212X, we know that X will be “stepped on” unless step_size=4 comes into play in this region. (i.e. we are interested in the pattern [321][21][1]X.)
  • wherever we see the pattern 112X, or 113X, or 131X, or 132X, we know that X will *not* be stepped on unless step_size=4 comes into play. (i.e. we are interested in the pattern [1][13][12]X)
  • we can’t always be sure whether such triple-prefixes will or will not cause X to be stepped on, e.g. in 231X it depends on whether or not the 2 is stepped on.

Ok, lets assume we have created an array that tells us which of the three categories each element falls in to. And, now if we convert step_size to step_to_idx, i.e. an absolute target address, we can do something interesting. For example, step_size=13121... becomes step_to_idx=14355.... The interesting thing is that we can now summarize chunks of this step_to_idx array: wherever there is a contiguous block of known steps and/or known no-steps, we can make every element point to the end of the block, saying: “if,during the loop execution, you find yourself stepping on to this element, then please go to the end of the block, because I’ve already dealt with the details from here up to that point“. In the loop we may “land” on such points either because of a combination of [123][123][123]X that we couldn’t deal with in advance, or because a 4 came in to play.

reduced looping

The black lines in the “loop stage”, above, show how the loop might eventually iterate through the data. Note how some regions we read from and write step data to for the first time, some regions had step data, that we have to modify, and some regions we can skip right over.

So now, because we expect that most of the data is pretty much error-free, we should find that the loop requires far fewer iterations to get through all the data.

Unless someone in the comments is really keen to see it, I’m not going to provide any further code. I’ll just say that all the pre-loop preparation is done using a few calls to nonzero (aka find in Matlab) and one call to cumsum. The loop itself is nothing special really.

That’s it for this post. It took a long time to write, but hopefully it contained some interesting bits and bobs. Do leave a comment if you have something to say!

The next post I write might be aimed at absolute beginners rather than whoever the target audience was for this post: I’d like to do a basic indexing tutorial in which you learn to “paint” flags of the world in 2D arrays.