Tag: vectorization

Flags of the world (indexing tutorial)

The code examples for this tutorial are currently only given for Python/Numpy. Conversion to Matlab is not difficult, but would be confusing for beginners (who are the target audience here). I may get round to doing the extra work at some point soon.

This tutorial is aimed at absolute beginners. Ideally get an experienced person to setup Python* for you because it is liable to be an off-putting hurdle otherwise. (When you are an expert yourself, you will have plenty of opportunities to repay your debt to the world!)

Before we get to the actual tutorial, you need a tiny bit of preparatory code that you don’t need to understand. Copy-paste the following few lines into your interpreter/script (ask your nominated “experienced person” how you do this, and then keep them around for another minute or so just to check that everything is working as it should):

def show_flag(data, colors=['#ff0000','#ffffff','#0000ff'],
              title="my flag", fignum=1):
    from matplotlib.colors import ListedColormap
    from matplotlib.pylab import matshow, gca, clf
    cm = ListedColormap(colors)
    if fignum is not None:
        clf()
    matshow(data, cmap=cm, fignum=fignum)
    gca().axis('off')
    gca().set_title(title)

That function will let us display 2D arrays as colorful flags. Here is a very simple example to start with:

flag = [
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1],
    [0, 0, 0, 0, 1, 1, 1],
    [0, 0, 0, 0, 2, 2, 2],
    [0, 0, 0, 0, 2, 2, 2]
  ]
show_flag(flag)

flag

Notice how the flag contains just the numbers 0 1 2. Inside the show_flag function these numbers are converted to colors (rather like the color-by-number pictures you may have done as a child). By default, 0 maps to red, 1 maps to white and 2 maps to blue. (I chose those colors because they’re pretty common for flags. I’ll show you how to customize the color mapping later on.)

Right, there’s one more thing you need before we get started:

# Use one (or both) of the following two lines:
from numpy import * # simple, but frowned upon
import numpy as np # better, but sometimes annoying

I will explain this briefly because it’s worth understanding, but I’ll have to use “scare-quotes” a bit as the explanation is rather hand-wavy: basically, Python is not a good language for doing heavy number crunching. If you try adding/subtracting/multiplying individual numbers thousands/millions/billions of times there will be a huge amount of “overhead” and you will get very slow (and complicated-looking) code. However, Python does allow, and encourage, the use of “plugins”. These plugins are free to do special kinds of “magic behind the scenes”, so long as they present a “Python-friendly exterior” that the user (i.e. you) can interact with in Python.

Numpy is one of these “plugins”, but it’s a rather special one in so far as it provides such significant “number-crunching functionality” that no other plugin can really compete with it. If you want to do number crunching in python you basically *need* to use Numpy. Most plugins that provide advanced number-crunching tools build on top of Numpy rather than replace it. (Not all plugins are about “number-crunching”, they do all kinds of things, like helping generate HTML for webpages etc.)

So now I can explain import numpy as np. This says to Python: “I have a plugin called ‘numpy’ and I’d like to use it here please. And I would like to refer to it as ‘np’ for short.” You can then use commands like np.ones(40), and python will “delegate” to Numpy to do its magic. If you use the alternative form from numpy import * then you are saying to python: “I have a plugin called ‘numpy’ and I’d like to use it here please. And I’m not going to tell you explicitly when I’m trying to use it – if I mention something from inside numpy, assume that is indeed what I meant.” This second version is sometimes easier because you don’t have to type np. all over the place, but it can also confuse your text editor and may introduce subtle bugs or confuse readers of your code. To keep the code in this tutorial compact, I’ll use the from numpy import * version.

Ok, finally we are ready to start…

The French Tricolour

H, W = 50, 70
# remember that our defaults are: red=0, white=1, blue=2
flag = zeros([H, W])
flag[:, :W/3] = 2 
flag[:, W/3:2*W/3] = 1
show_flag(flag, title="Vive la France")

flag

Try it out and check that it works. Can you make sense of the code?

We begun by creating a 2D array of shape 50 x 70. In numpy we refer to the dimensions of a multidimensional array as “axis 0”, “axis 1”, “axis 2”, “axis 3” etc. Here axis 0 is of length 50 and axis 1 is of length 70. You can see in the flag that the 50 and 70 correspond to the vertical and horizontal dimensions respectively (because the image of the flag is wider than it is tall). The array we started with is all zero, which gets mapped to red when we display it. There are other functions for initializing arrays but this is all we need for now.

The line flag[:, :W/3] = 2 is setting the blue section of the flag. It does this by indexing into the whole of axis 0 (the vertical), but only the first third of axis 1 (the horizontal).

The next line flag[:, W/3:2*W/3] = 1 is similar, setting the white portion of the flag. Here the relevant section spans from the end of the first third, to the end of the second third.

That was all we had to do in this example.

Vlag van Nederland

flag

Have a go at doing this yourself. Note that you may have to guess a little bit for some of the syntax.

Expand to view my suggested solution…

H, W = 50, 70
# remember that our defaults are: red=0, white=1, blue=2
flag = zeros([H, W])
flag[H/3:2*H/3, :] = 1
flag[2*H/3:, :] = 2 # set last third to blue
show_flag(flag, title="Van Gogh would love numpy")

This looks pretty similar to the French version, but we’re indexing into axis 0 rather than axis 1. Note that here we don’t need to set the first third to blue, but we do need to set the last third. Hopefully you were able to guess the necessary syntax.

Easy, right?

Les Quatre Bandes of Mauritius

flag

You should be able to make the array for this yourself, but I need to tell you how to customize the color mapping:

show_flag(flag, title="What's the capital of Mauritius?",
          colors=['#00A551', '#EA2839', '#1A206D', '#FFD500'])

I found the four color codes using the colorpicker in Chrome devtools – google for help if you’ve never done that kind of thing.

My solution looks like this:

H, W = 50, 70
flag = zeros([H, W])
flag[:H/4, :] = 1
flag[H/4:2*H/4,:] = 2
flag[2*H/4:3*H/4,:] = 3
show_flag(flag, title="What's the capital of Mauritius?",
          colors=['#00A551', '#EA2839', '#1A206D', '#FFD500'])

Norway

flag

There shouldn’t be anything new here really. But if you want to get obsessive you can spend more time fiddling with the exact proportions.

My solution looks like this:

H, W = 100, 140
flag = zeros([H, W])
flag[:, W*0.3:W*0.48] = 2
flag[H*0.33:H*0.56, :] = 2
flag[:, W*0.35:W*0.44] = 1
flag[H*0.39:H*0.5, :] = 1
show_flag(flag, title="How deep is your fjord?",
          colors=['#EF2B2D', '#002868', '#ffffff'])

Note that I doubled W and H to increase resolution a bit. Also note that we are now using non-integer indices for the arrays, or rather we are not explicitly using integers: numpy will round our floating point numbers down to the nearest integer before doing anything with them.

The Jamaican Cross

flag

How are we going to deal with those diagonal lines? Well, hopefully you already are familiar with the basic equation of a straight line y = mx + c. Here we are essentially going to use that, but we’ll treat it as an inequality rather than an equality, i.e. for a given y at a specific x horizontal position, we want to know is it above or below the line? The code is now a bit more complicated:

from numpy import newaxis as nax
H, W = 100, 140
x = arange(W)[nax, :] 
y = arange(H)[:, nax]
flag =  y > x*0.75 - W*0.1
show_flag(flag, title="Everything's gonna be alright",
          colors=['#000000', '#FED100']) 
    # we'll also need '#009B3A' (green) later

flag

The first line is just a convenient way to shorten the word newaxis to nax, we can then use it on the third and fourth lines: arange(W) creates a 1D array of the numbers [0, 1, 2, ..., W-1]. When we put [nax, :] after that, we are changing the shape of this 1D array to make it a 2D array of shape 1 x W. This means we can treat the x array as a single horizontal “row” in the flag, giving the x-coordinate of each column. Equivalently, the y array is a column, giving the y-coordinates of each row. Once we’ve setup these two variables, we are able to combine them in interesting ways.

By “setup” I mean that we’ve created two arrays with the same number of axes (i.e. 2), and the axes lengths “agree” with each other. What does it mean for two arrays to have axes lengths that “agree”? Well, it means specifically that each axis either has a length of 1, or it has exactly the same length as in the other array. For example 10 x 1 x 17 agrees with 10 x 23 x 1, but it doesn’t agree with 17 x 1 x 10. Once we have agreement, numpy can automatically “broadcast” any operations (such as adding, subtracting, comparing) out to match the full length in all axes.

TODO: a diagram of broadcasting would help a lot here

So here, we broadcast out the row x and the column y with the “greater-than” operation, and the result is a 2D array of trues and falses, telling us which elements in the array are above/below the line. When we display the flag the trues are converted to 1s and the falses to 0s (and then the 0s and 1s are mapped to colors as with previous flags). Note (if you’d hadn’t already) that the 0-end of the vertical axis is at the top of the flag and the H-end is at the bottom. If you want to flip this round you can add the line gca().invert_yaxis() at the end of your show_flag function (and then re-evaluate the function to update it).

We are not done yet, but hopefully you can now see how you might go about constructing the flag. If you get stuck, don’t feel too bad jumping to the suggested solution as I haven’t explained everything that you’re going to need to use here.

from numpy import newaxis as nax
H, W = 100, 140
flag = zeros([H, W])
x = arange(W)[nax, :] 
y = arange(H)[:, nax]
yellow_part = ((y > x*0.75 - W*0.1) & (y < x*0.75 + W*0.1)) 
flag[yellow_part] = 2
yellow_part = ((y > W*0.6 - x*0.75) & (y < W*0.8 - x*0.75)) 
flag[yellow_part] = 2
green_part = ((y >= x*0.75 + W*0.1) & (y >= W*0.8 - x*0.75))
flag[green_part] = 1
green_part = ((y <= x*0.75 - W*0.1) & (y <= W*0.6 - x*0.75))
flag[green_part] = 1
show_flag(flag, title="Everything's gonna be alright",
          colors=['#000000', '#009B3A', '#FED100'])

Right, that’s a little bit messy, but it’s largely unavoidable. There are two new things here.

Firstly, we are using & to combine pairs of 2D arrays: the combined array has the same shape as original two arrays (numpy would let us use broadcast again here if we needed to) and its elements are true where the elements of both arrays were true, and false elsewhere. With this mechanism we are able to say: “I want the band where y is greater than something *and* less than something else.“.

The second new thing is that we are indexing into flag using a 2D array of trues and falses rather than bands of indices. In addition to these two methods of indexing there are one (or two?) more that we haven’t covered yet.

Nisshōki of Japan

flag

This uses trues and falses like we did in Jamaica, but now we need to create a circle.

from numpy import newaxis as nax
H, W = 100, 140
flag = hypot(arange(-W/2, W/2)[nax, :],
             arange(-H/2, H/2)[:, nax]) > 30
show_flag(flag, title="Pythonic haiku? Not really.",
          colors=['#ff0000', '#ffffff']) 

I Galanolefki of Greece

flag

Although you could just use the basic stuff we learnt at the start, here we create all the stripes with a single command.

H, W = 100, 140
flag = zeros([H, W])
is_stripe = arange(H) % 22 > 10
flag[is_stripe, :] = 1
flag[:55, :50] = 0
flag[:55, 20:30] = 1
flag[22:33, :50] = 1
show_flag(flag, title="'flag' is not of Greek origin",
          colors=['#0D5EAF', '#ffffff']) 

* I use the Spyder IDE as distributed with anaconda, but you could possibly use an IPython notebook if you prefer. The only modules we need for this tutorial are Numpy and Matplotlib. I use Python 2.7, but everything here is probably more or less fine in 3.x.

This tutorial took a good while to write, what with all the explanations of how to use numpy and the actual messing around to make the flags. If you found it useful let me know in the comments. If I have time I may add some more examples to deal with stepped indices, reverse indices (maybe?), and perhaps suggest how custom-written functions can be used to draw complicated shapes like circles or stars…and then those functions can be reused for later flags.

Advertisements

Non-trivial vectorizations

For a more general discussion on optimizing, see my previous post on perfectly timed optimization.

Before we get started, I should say that the following stories are likely to be of most interest to Matlab or Numpy users.  In compiled languages, the general lack of overheads will probably do a pretty good job of making fast programs, as will any loop optimization passes.  Auto-vectorization is getting better over time, but is unlikely to work for the examples here (and in fact using these methods in a compiled language may be a pessimization). However, these tricks do become relevant again when one switches to computing on the GPU.

Example 1.

Imagine we have an array giving times for 10k events, and our task is to compute a histogram giving all the pairwise time differences. Or rather, we must compute just the lower end of that histogram, where time differences are within some specified window.

tac

Some raw-python code might look like the following. Note that for simplicity I assume the desired histogram bins are one time-unit wide…

def make_histogram(times, window):
    hist = [0]*window
    for ii, time_ii in enumerate(times):
        for time_jj in times[ii+1:]:
            delta = time_jj - time_ii
            if delta >= window:
                break
            hist[delta] += 1
    return hist

In words: for each event, loop forward over the subsequent events, storing time-diffs in the histogram until you reach the end of the window. You could improve this slightly by pulling the delta>window check into a separate loop from the hist[delta]++ line: create a last_jj variable that gets incremented zero or more steps each ii-iteration.

Right, lets begin vectorizing….

Firstly, for each ii lets find the number of subsequent events in the window. The approach we use may not strictly speaking be using SIMD under the covers, but in the Matlab/Numpy world we could still call it a “pseudo-vectorized” primitive.

counts = times.searchsorted(times + window, side='left')\
                             - arange(len(times)) -1

# which would let us do something like...
for start_ii, (time_ii, count_ii) in enumerate(
                                 zip(times, counts)):
    deltas = times[start_ii+1:start_ii+1+count_ii] - time_ii
    hist += bincount(deltas, minlength=window)

searchsorted and bincount are part of Numpy, the equivalents in Matlab are histc (use the second output) and accumarray. Observe that by knowing the counts in advance we were able to pull the delta calculation out of the inner loop, and then as a result we could pull the increments out of the inner loop too.

The second step is a bit trickier, we want to pull the delta calculation out of the ii loop entirely. This calculation is just a long list of times[jj] - times[ii], so what we’ll need is two sets of indices, that look something like:

all_ii = [0 0 0 0 1 1 1 1 1 3 3 4 4 4 4 ...]
all_jj = [0 1 2 3 0 1 2 3 4 0 1 0 1 2 3 ...] + all_ii + 1

that is we want to repeat each ii the number of times specified by counts[ii], and all_jj is formed from all_iis by the addition of mini-ranges of length counts[ii]. In Numpy there is a function for generating all_ii, called repeat. In Matlab we have to build the function ourselves:

function ret = repeat(n)
    n_mask = logical(n);
    n_inds = find(n);
    n_inds(2:end) = n_inds(2:end) - n_inds(1:end-1);
    n_cumsum = ones('single');
    n_cumsum(2:length(n)+1) = cumsum(n)+1; 
    ret = zeros(n_cumsum(end)-1, 1, 'single');
    ret(n_cumsum(n_mask)) = n_inds; 
           % note: numel(n_mask) = numel(n_cumsum)-1
    ret = cumsum(ret);
end

For the multiple ranges in the construction of all_jj, we can use the following in Numpy (this translates to Matlab fairly easily, but be careful with one/zero-based indexing):

def multi_arange(n):     
    n_mask = n.astype(bool)
    n_cumsum = cumsum(n)
    ret = ones(n_cumsum[-1]+1, dtype=int)
    ret[n_cumsum[n_mask]] -= n[n_mask] 
    ret[0] -= 1
    return cumsum(ret)[:-1]

If you weren’t already a fan of cumsum, I hope you are warming to it – look how we used it twice in one function here (and we used it twice in the Matlab version of repeat)!

Ok, that’s pretty much it. We can now get rid of the ii-loop, and we’re done:

def make_histogram(times, window):
    counts = times.searchsorted(times + window, side='left')\
                                 - arange(len(times)) -1
    all_ii = repeat(arange(len(counts)), counts)
    all_jj = multi_arange(counts) + all_ii + 1
    deltas = times[all_jj] - times[all_ii]
    return bincount(deltas, minlength=window)

Note that this optimization only makes sense if the counts in each window is guaranteed to be fairly small, and the total deltas array fits easily in memory. If either of those things is not true you might want to stick with the intermediate version, or wrap some batching mechanism around the final version.

Obviously, if you are going to do this sort of thing yourself, make sure you are testing properly as you go along and provide assertions about array shapes and types etc. The two helper functions I showed should also be given some documentation: just showing a sample input/output should be enough here. Note that the examples I’ve given here are simplified versions of the actual code I use normally – the stuff here is fine for simple integer cases, but I recommend testing it properly yourself if you actually intend to use it!

The multi_arange function, and some other goodies are available in the numpy groupies package that I co-authored with @ml31415 on github.

Example 2.

Imagine we have a list of 60k (x, y) coordinates that correspond to the path someone walked. Imagine also, that there are errors in that path: at certain points the person rotated by some amount that was not reflected in the path data. Our task is to take the 2D array of xy data, together with a 1D array providing the rotation errors at each step, and generate a corrected 2D array of xy data.

Lets picture a simple example in which the person took just 50 steps (rather than 60k), and recorded two abrupt rotations. Our task could be to correct for 3 additional rotations, or we might need to correct for a series of very small rotations over multiple steps:

cumrot

The function I wrote to solve this wasn’t actually needed in any analysis in the end, but I had fun writing it, not least because I got to name it cumrot!

I’m not going to give an un-vectorized code sample because I hope the above picture already describes the problem well enough (and I never tried writing it un-vectorized so I’m not sure what that would look like anyway).

Right, let’s get to it then… (edit: you may want to skip the interesting-looking stuff and read the “oh dear” section at the end.)

To begin, you need to know that a rotation of th radians about a point (x,y) can be expressed in matrix form as:

\left( \begin{array}{ccc} cos(th) & -sin(th) & [x - x cos(th) + y sin(th)] \\ sin(th) & cos(th) & [y - x sin(th) - y cos(th)] \\ 0  & 0 & 1 \end{array} \right)

So if we had just one rotation error, we could apply this rotation to all the data from that point forward. Even if we had a small number of corrections we could loop through each of them and apply them one after the other, i.e. this would suit the “with 3 rotations” example above. However, it would be pretty inefficient to do this for the “with gradual rotation” example, because there are many small rotations needing to be corrected for (and even that was just meant to be a simple small example).

Instead, lets imagine performing the rotations in reverse order, i.e. starting with the last one (but still applying the rotation to all the data from that point in the forwards direction) and then working back towards the first. If we do the rotations in this reversed order, then each pivot point is un-altered at the moment we need to create its rotation matrix. Thus, we can produce all the rotation matrices ahead of time. Which is a nice start.

But we can do better than that: observe that each point is being multiplied by all the preceding rotation matrices, however we could (by virtue of associativity) equivalently say that we multiply each point by a single composite-transformation matrix.

What does that give us? Well, *if* we can work out the necessary composite transformation matrices, then we can vectorize the second step, i.e. the multi-matrix-multiply, as follows:

x_out = a*x + b*y + c;
y_out = d*x + e*y + f;

where a b c d e f are all 60k-long 1D arrays, that together define the 60k separate transformation matrices of the form:

\left( \begin{array}{ccc} a & b & c \\ d & e & f \\ 0 & 0 & 1 \end{array} \right)

This is weird, right? Rather than using Numpy/Matlab to do matrix multiplies for us we’ve hand-written one. But ours is way better, because it’s actually 60k simultaneous multiplies! Ok, so the question is: how do we get the composite matrices needed for the multiply, above?

Well, it’s helpful that we’ve just seen how hand-writing matrix-multiplies sometimes makes sense. What we need to spot now is that we’re actually attempting to do a “cumulative” matrix multiply, i.e. each of the 60k steps is either an identity matrix or it’s a rotation matrix, and what we want is to do a cumulative matrix multiply along the length of this 60k stack of matrices. If you’re not familiar yet with basic cumulative sums (i.e. cumsum), then you’re going to be a bit out of your depth here. Even if you are a frequent user of cumsum you may want to go the the wikipedia page that has a nice description of how it can be computed in parallel. I reproduce the wikipedia diagram here:

prefixsum
‘Prefix sum 16’, David Eppstein. (source)

Ok. We know how to vectorize a bunch of matrix multiplies, and the above diagram shows us the pattern we need to do them in, so all that’s left is to write the code. Below, I show the first of the two main loops. In preparation for this loop, we zero-padded x and y to ensure they are of power-two length, and we computed a b c d e f for all the individual rotation matrices (see the matrix above with cos(th), sin(th) etc.).

for log_s in range(logn):
    s = 2**log_s
    s2 = 2*s
    
    # pick out the right-hand values. 
    a_r, b_r, c_r, d_r, e_r, f_r = a[s2-1::s2].copy(), \
               b[s2-1::s2].copy(), c[s2-1::s2].copy(), \
               d[s2-1::s2].copy(), e[s2-1::s2].copy(), \
               f[s2-1::s2].copy()
    
    # and the left-hand values
    a_l, b_l, c_l, d_l, e_l, f_l = a[s-1::s2], \
            b[s-1::s2], c[s-1::s2], d[s-1::s2],\
            e[s-1::s2], f[s-1::s2]

    # store in right hand value locations
    a[s2-1::s2] = a_l*a_r + b_l*d_r    
    b[s2-1::s2] = a_l*b_r + b_l*e_r;
    c[s2-1::s2] = a_l*c_r + b_l*f_r + c_l;
    d[s2-1::s2] = d_l*a_r + e_l*d_r;
    e[s2-1::s2] = d_l*b_r + e_l*e_r;
    f[s2-1::s2] = d_l*c_r + e_l*f_r + f_l;

Note that I translated this from Matlab to Numpy without checking the results, but it’s probably correct?! More importantly, note that we have to explicitly copy the RHS values because we’re going to modify them in place bit by bit. Finally, note that although I was quite pleased with this when I wrote it, there is still room for improvement: you could use a single 2D array for abcdef, which would simplify the “picking” bits and might let you fuse together some of the actual assignment bits. For the case when most of the matrices are the just the identity matrix, you could further optimize this by performing the cumulative multiply on only the non-identity subset of the matrices, and then index into the result to expand it out to the appropriate length.

That’s all I wanted to say for this example. If you really want to see the complete solution then let me know (I have written it in Matlab). But hopefully you have enough here to write it yourself.

Oh dear! As I reached the end of writing about this, it occurred to me that there was in fact a much simpler way of doing the calculation: firstly, compute the cumulative sum for the extra rotation data; then convert the (x,y) data into a series of (dx, dy) steps; next, turn these Cartesian steps into polar steps and add the corresponding cumulative rotation angle for each step; finally, convert the steps back into Cartesian space and perform a cumulative sum in x and y.

This alternative method method requires slightly more trigonometry (cosine and sine at the end in addition to the arctan at the start), but the massive benefit is that all the hand-written matrix multiplies are replaced with standard cumulative sums: one for angle, one for x and one for y. The original solution may be slightly better in terms of resistance to floating point errors (but I’m not going to examine that in detail now). It would also probably be faster if the number of rotations is small (and we implement the suggested optimization for that scenario). I wonder whether there was a reason I wrote the original function with matrix multiplies? I notice that the code included the option to apply the cumulative transformation to some secondary position data, so perhaps that had something to do with it, but unfortunately I no longer remember the actual details of the intended use-case.

Example 3.

In contrast with the previous examples, I am going to give a rough explanation of the neuroscience motivating this task (because it would be rather abstract and contrived without it).

Neurons, roughly speaking, behave like Poisson processes: at any given moment they each have a “true” rate, but we can only observe that rate indirectly by counting the number of discrete “spikes” (aka action potentials) they generate. For example, if a neuron generates 4 spikes in one second and 6 the next second we might (using an appropriate formula) estimate that the “true” underlying rate was about 5 Hz.

For many neurons, the “true” rate correlates with something we can observe in the outside world, where the word “correlate” here is meant in the general sense, i.e. some sort of statistically observed coupling. For example, the rates of some neurons are modulated by the exact location of the organism within its environment, i.e. a particular neuron might have a rate of 4Hz each time the person/animal goes to a spot near the window, but have a rate of approximately zero elsewhere.

Our task here is to analyse the data from 50 of these place-sensitive neurons: we will get a stack of 50 “ratemaps” that tells us the average rate for each of the cells in every square centimeter of an experimental arena. And we will get a list of the spike times for each of the 50 cells over the course of a 20 minute experiment. The question is: how can we use the neuron spike data and ratemaps to estimate the location of the animal? Since the animal is moving around, we need to produce a separate estimate for every second, or lets say every quarter of a second.

stack

The calculation we need to do (which can be found in a number of proper academic papers) is, in essence, quite simple: (1) count the number of spikes for each cell in each time window; (2) for each of the 1 centimeter-square bins in the arena, calculate the probability that the observed counts could have occurred; (3) either stop there, i.e. provide the likelihood-map for each time window as the output, or find the location of the maximum likelihood for each time window, and just give that as the output.

For a single 1-cm bin, at (x,y) in a single window, t, the calculation is:

L(x,y,t) = \prod_{i=0}^{50} \frac{rate_i(x,y)^{counts_i(t)} e^{-rate_i(x,y)}}{counts_i(t)!}

If real maths seems like hard work, then just make sure you don’t stare too closely. It’s the programming that we’re here for, so you don’t need to worry about the details of the equation really.

Right, and we’re off…

To begin with, I need to make a point that is going to be obvious to people with certain backgrounds (but is apparently not obvious to many neuroscientists):

\prod_i x_i=exp(\sum_i log(x_i))

This means we can re-write the equation for L(x,y,t) as a sum, in which the terms are nicely separated:

log(L(x,y,t)) = -\sum_i rate_i(x,y) + \sum_i [counts_i(t) log(rate_i(x,y))] -\sum_i log(counts_i(t)!)

Now, we can examine each term individually and come up with a way of vectorizing over the whole of x, y, and t.

The first term is easy:

# ratemaps is a 3D array of shape: [100 cm x 100cm x 50 neurons]
term_1 = sum(ratemaps, axis=-1)

Once we have the counts, the last term is also pretty easy. (For the counts we probably need to use bincount, either looping over cells, or using ravel_multi_index as I do in numpy-groupies.) We can even manage a “bonus” optimization because we know we’re dealing with small integers:

# counts is a 2D array of shape: [4800 windows x 50 neurons]
log_k_factorial = log(factorial(arange(max(counts)+1)))   
term_3 = sum(log_k_factorial[counts], axis=-1)

The bonus optimization was that we build a tiny lookup-table of log-factorials and then simply index into that. This is a real help as computing logs is seriously slow.

The middle term is the only interesting one. Notice that it’s roughly of the form \sum_i a_i b_i. Whenever you see something of this form (and it occurs surprisingly often in my experience) then you should immediately say to yourself “I bet that can be re-written as a matrix-multiply”. I’m not going to try and describe the exact circumstances in which it is possible to turn sums into matrix multiplies, but just remember that it’s worth looking out for them Note that if you need to transpose a large array in order to do this it may not be worth it in Matlab, but it’s free to do so in Numpy. Numpy also lets you do n-dimensional dot products, whereas Matlab just does 1 or 2D – however, by reshaping your ND arrays appropriately you should be able to re-imagine an ND multiply as a Matlab-friendly 2D multiply).

# counts is a 2D array of shape: [4800 windows x 50 neurons]
# ratemaps is a 3D array of shape: [100cm x 100cm x 50 
term_2 = dot(log(ratemaps), counts[:,:,nax]) # nax = np.newaxis

All that’s left is to add the three terms together, with the appropriate broadcasting in Numpy or bsxfun in Matlab, and we’re done. The result is about 6 lines long, and doesn’t look that horrible (well it’s ok in Numpy, but a bit uglier in Matlab). It’s also amazingly fast given how much number crunching it’s doing!

There may be some mistakes in this third example as I wasn’t as careful as with the previous two.

Example 4

Note that I added this example a few days after writing the original post, although the code actually predates that of the other examples.

This task is interesting in that we manage to *partially* vectorize a loop even though we can’t go the whole way. The trick is to perform a kind of manual profile guided optimization.

The task is relatively simple: in an experiment, an LED traced out a trajectory in 2D space and we now have an array of 60k (x,y) coordinates corresponding to that trajectory. The problem is that in a few places the camera doing the tracking made an error, and spuriously recorded that the LED “jumped” over a large distance in an unrealistically short time.

speed filter

The filtering algorithm we need to implement is roughly the following:

def filter_speed(xy, max_speed):
   is_good = zeros(len(xy), dtype=bool)
   ii = 0
   while ii < len(xy):
       is_good[ii] = True 
       from_xy = xy[ii]
       for jj, to_xy in enumerate(xy[ii+1:]):
           # dist is diff then hypot (though sqrt is redundant)
           if dist(from_xy, to_xy) < (jj-ii)*max_speed:
              ii = jj
              break
   # now we're supposed to interpolate where is_good is False

Notice how each iteration of the ii loop depends on the previous iteration: we need to know which points are good and which are bad before we can make claims about subsequent points. (Look at the diagram if this isn’t immediately clear.) This kind of problem is vaguely similar to a regex search where you need to scan along a long string looking for instances a complex pattern: however, I believe that regexes are generally implemented with a state machine and lookup table, which is not something that makes sense here.

In python I opted to compile this (or something pretty similar) with Cython. Job done: easy 10x speed up. And that is probably the right approach to take, really. However, some months earlier, back in Matlab, I was more reluctant to write a MEX file and instead I spent a bit of effort vectorizing. My approach is probably ridiculous in all but the most obscure situations, but I’ll include it here as I think it’s fairly interesting.

So how can we vectorize this loop, when each iteration clearly depends on the previous iteration? The answer (as I already mentioned) is that we can only partially vectorize, but that we can still get a pretty decent payoff. The observation we need to make is that, in general, the data is of a reasonably high “quality”, meaning there won’t be that many errors (if there are a lot of errors then something needs to be fixed in the experimental setup so as to bring the quality of the data back to standard). Thus if we vectorize the sections of loop with no errors, or very simple errors, we can jump over those sections when we encounter them in the loop, and dwell on only the difficult bits.

I’m not going to give the full code, because it’s quite long and rather ugly, but I’ll try and give you a flavor of it (as I said, don’t try this at home unless you really need to). Note that I’ve translated to python/numpy here:

# Is a step from xy[t] to xy[t+1] slow enough?
step1_is_good = dist(xy[:-1], xy[1:]) < 1*max_speed

# If not, is a step from xy[t] to xy[t+2] slow enough?
bad_idx, _ = (~step1_is_good).nonzero()
step2_is_good = dist(xy[bad_idx], xy[bad_idx+2]) < 2*max_speed

# do similar for t+3 (not shown), then summarize...
step_size = step1_is_good.astype(uint8) # i.e. = 1
step_size[bad_idx[step2_is_good]] = 2 
# and similar for 3, everything else is set to 4

Right, this is just the preparation. What we now have is a vector that might look something like the following:

1111111211211241111211221321111124121211111111123111121...

We can now make some basic observations:

  • wherever we see the pattern 111X, or 311X, or 321X, or 221X, or 211X, or 212X, we know that X will be “stepped on” unless step_size=4 comes into play in this region. (i.e. we are interested in the pattern [321][21][1]X.)
  • wherever we see the pattern 112X, or 113X, or 131X, or 132X, we know that X will *not* be stepped on unless step_size=4 comes into play. (i.e. we are interested in the pattern [1][13][12]X)
  • we can’t always be sure whether such triple-prefixes will or will not cause X to be stepped on, e.g. in 231X it depends on whether or not the 2 is stepped on.

Ok, lets assume we have created an array that tells us which of the three categories each element falls in to. And, now if we convert step_size to step_to_idx, i.e. an absolute target address, we can do something interesting. For example, step_size=13121... becomes step_to_idx=14355.... The interesting thing is that we can now summarize chunks of this step_to_idx array: wherever there is a contiguous block of known steps and/or known no-steps, we can make every element point to the end of the block, saying: “if,during the loop execution, you find yourself stepping on to this element, then please go to the end of the block, because I’ve already dealt with the details from here up to that point“. In the loop we may “land” on such points either because of a combination of [123][123][123]X that we couldn’t deal with in advance, or because a 4 came in to play.

reduced looping

The black lines in the “loop stage”, above, show how the loop might eventually iterate through the data. Note how some regions we read from and write step data to for the first time, some regions had step data, that we have to modify, and some regions we can skip right over.

So now, because we expect that most of the data is pretty much error-free, we should find that the loop requires far fewer iterations to get through all the data.

Unless someone in the comments is really keen to see it, I’m not going to provide any further code. I’ll just say that all the pre-loop preparation is done using a few calls to nonzero (aka find in Matlab) and one call to cumsum. The loop itself is nothing special really.

That’s it for this post. It took a long time to write, but hopefully it contained some interesting bits and bobs. Do leave a comment if you have something to say!

The next post I write might be aimed at absolute beginners rather than whoever the target audience was for this post: I’d like to do a basic indexing tutorial in which you learn to “paint” flags of the world in 2D arrays.