# Non-trivial vectorizations

For a more general discussion on optimizing, see my previous post on perfectly timed optimization.

Before we get started, I should say that the following stories are likely to be of most interest to Matlab or Numpy users.  In compiled languages, the general lack of overheads will probably do a pretty good job of making fast programs, as will any loop optimization passes.  Auto-vectorization is getting better over time, but is unlikely to work for the examples here (and in fact using these methods in a compiled language may be a pessimization). However, these tricks do become relevant again when one switches to computing on the GPU.

### Example 1.

Imagine we have an array giving times for 10k events, and our task is to compute a histogram giving all the pairwise time differences. Or rather, we must compute just the lower end of that histogram, where time differences are within some specified window.

Some raw-python code might look like the following. Note that for simplicity I assume the desired histogram bins are one time-unit wide…

def make_histogram(times, window):
hist = [0]*window
for ii, time_ii in enumerate(times):
for time_jj in times[ii+1:]:
delta = time_jj - time_ii
if delta >= window:
break
hist[delta] += 1
return hist


In words: for each event, loop forward over the subsequent events, storing time-diffs in the histogram until you reach the end of the window. You could improve this slightly by pulling the delta>window check into a separate loop from the hist[delta]++ line: create a last_jj variable that gets incremented zero or more steps each ii-iteration.

Right, lets begin vectorizing….

Firstly, for each ii lets find the number of subsequent events in the window. The approach we use may not strictly speaking be using SIMD under the covers, but in the Matlab/Numpy world we could still call it a “pseudo-vectorized” primitive.

counts = times.searchsorted(times + window, side='left')\
- arange(len(times)) -1

# which would let us do something like...
for start_ii, (time_ii, count_ii) in enumerate(
zip(times, counts)):
deltas = times[start_ii+1:start_ii+1+count_ii] - time_ii
hist += bincount(deltas, minlength=window)


searchsorted and bincount are part of Numpy, the equivalents in Matlab are histc (use the second output) and accumarray. Observe that by knowing the counts in advance we were able to pull the delta calculation out of the inner loop, and then as a result we could pull the increments out of the inner loop too.

The second step is a bit trickier, we want to pull the delta calculation out of the ii loop entirely. This calculation is just a long list of times[jj] - times[ii], so what we’ll need is two sets of indices, that look something like:

all_ii = [0 0 0 0 1 1 1 1 1 3 3 4 4 4 4 ...]
all_jj = [0 1 2 3 0 1 2 3 4 0 1 0 1 2 3 ...] + all_ii + 1


that is we want to repeat each ii the number of times specified by counts[ii], and all_jj is formed from all_iis by the addition of mini-ranges of length counts[ii]. In Numpy there is a function for generating all_ii, called repeat. In Matlab we have to build the function ourselves:

function ret = repeat(n)
n_inds = find(n);
n_inds(2:end) = n_inds(2:end) - n_inds(1:end-1);
n_cumsum = ones('single');
n_cumsum(2:length(n)+1) = cumsum(n)+1;
ret = zeros(n_cumsum(end)-1, 1, 'single');
ret = cumsum(ret);
end


For the multiple ranges in the construction of all_jj, we can use the following in Numpy (this translates to Matlab fairly easily, but be careful with one/zero-based indexing):

def multi_arange(n):
n_cumsum = cumsum(n)
ret = ones(n_cumsum[-1]+1, dtype=int)
ret[0] -= 1
return cumsum(ret)[:-1]


If you weren’t already a fan of cumsum, I hope you are warming to it – look how we used it twice in one function here (and we used it twice in the Matlab version of repeat)!

Ok, that’s pretty much it. We can now get rid of the ii-loop, and we’re done:

def make_histogram(times, window):
counts = times.searchsorted(times + window, side='left')\
- arange(len(times)) -1
all_ii = repeat(arange(len(counts)), counts)
all_jj = multi_arange(counts) + all_ii + 1
deltas = times[all_jj] - times[all_ii]
return bincount(deltas, minlength=window)


Note that this optimization only makes sense if the counts in each window is guaranteed to be fairly small, and the total deltas array fits easily in memory. If either of those things is not true you might want to stick with the intermediate version, or wrap some batching mechanism around the final version.

Obviously, if you are going to do this sort of thing yourself, make sure you are testing properly as you go along and provide assertions about array shapes and types etc. The two helper functions I showed should also be given some documentation: just showing a sample input/output should be enough here. Note that the examples I’ve given here are simplified versions of the actual code I use normally – the stuff here is fine for simple integer cases, but I recommend testing it properly yourself if you actually intend to use it!

The multi_arange function, and some other goodies are available in the numpy groupies package that I co-authored with @ml31415 on github.

### Example 2.

Imagine we have a list of 60k (x, y) coordinates that correspond to the path someone walked. Imagine also, that there are errors in that path: at certain points the person rotated by some amount that was not reflected in the path data. Our task is to take the 2D array of xy data, together with a 1D array providing the rotation errors at each step, and generate a corrected 2D array of xy data.

Lets picture a simple example in which the person took just 50 steps (rather than 60k), and recorded two abrupt rotations. Our task could be to correct for 3 additional rotations, or we might need to correct for a series of very small rotations over multiple steps:

The function I wrote to solve this wasn’t actually needed in any analysis in the end, but I had fun writing it, not least because I got to name it cumrot!

I’m not going to give an un-vectorized code sample because I hope the above picture already describes the problem well enough (and I never tried writing it un-vectorized so I’m not sure what that would look like anyway).

Right, let’s get to it then… (edit: you may want to skip the interesting-looking stuff and read the “oh dear” section at the end.)

To begin, you need to know that a rotation of th radians about a point (x,y) can be expressed in matrix form as:

$\left( \begin{array}{ccc} cos(th) & -sin(th) & [x - x cos(th) + y sin(th)] \\ sin(th) & cos(th) & [y - x sin(th) - y cos(th)] \\ 0 & 0 & 1 \end{array} \right)$

So if we had just one rotation error, we could apply this rotation to all the data from that point forward. Even if we had a small number of corrections we could loop through each of them and apply them one after the other, i.e. this would suit the “with 3 rotations” example above. However, it would be pretty inefficient to do this for the “with gradual rotation” example, because there are many small rotations needing to be corrected for (and even that was just meant to be a simple small example).

Instead, lets imagine performing the rotations in reverse order, i.e. starting with the last one (but still applying the rotation to all the data from that point in the forwards direction) and then working back towards the first. If we do the rotations in this reversed order, then each pivot point is un-altered at the moment we need to create its rotation matrix. Thus, we can produce all the rotation matrices ahead of time. Which is a nice start.

But we can do better than that: observe that each point is being multiplied by all the preceding rotation matrices, however we could (by virtue of associativity) equivalently say that we multiply each point by a single composite-transformation matrix.

What does that give us? Well, *if* we can work out the necessary composite transformation matrices, then we can vectorize the second step, i.e. the multi-matrix-multiply, as follows:

x_out = a*x + b*y + c;
y_out = d*x + e*y + f;


where a b c d e f are all 60k-long 1D arrays, that together define the 60k separate transformation matrices of the form:

$\left( \begin{array}{ccc} a & b & c \\ d & e & f \\ 0 & 0 & 1 \end{array} \right)$

This is weird, right? Rather than using Numpy/Matlab to do matrix multiplies for us we’ve hand-written one. But ours is way better, because it’s actually 60k simultaneous multiplies! Ok, so the question is: how do we get the composite matrices needed for the multiply, above?

Well, it’s helpful that we’ve just seen how hand-writing matrix-multiplies sometimes makes sense. What we need to spot now is that we’re actually attempting to do a “cumulative” matrix multiply, i.e. each of the 60k steps is either an identity matrix or it’s a rotation matrix, and what we want is to do a cumulative matrix multiply along the length of this 60k stack of matrices. If you’re not familiar yet with basic cumulative sums (i.e. cumsum), then you’re going to be a bit out of your depth here. Even if you are a frequent user of cumsum you may want to go the the wikipedia page that has a nice description of how it can be computed in parallel. I reproduce the wikipedia diagram here:

Ok. We know how to vectorize a bunch of matrix multiplies, and the above diagram shows us the pattern we need to do them in, so all that’s left is to write the code. Below, I show the first of the two main loops. In preparation for this loop, we zero-padded x and y to ensure they are of power-two length, and we computed a b c d e f for all the individual rotation matrices (see the matrix above with cos(th), sin(th) etc.).

for log_s in range(logn):
s = 2**log_s
s2 = 2*s

# pick out the right-hand values.
a_r, b_r, c_r, d_r, e_r, f_r = a[s2-1::s2].copy(), \
b[s2-1::s2].copy(), c[s2-1::s2].copy(), \
d[s2-1::s2].copy(), e[s2-1::s2].copy(), \
f[s2-1::s2].copy()

# and the left-hand values
a_l, b_l, c_l, d_l, e_l, f_l = a[s-1::s2], \
b[s-1::s2], c[s-1::s2], d[s-1::s2],\
e[s-1::s2], f[s-1::s2]

# store in right hand value locations
a[s2-1::s2] = a_l*a_r + b_l*d_r
b[s2-1::s2] = a_l*b_r + b_l*e_r;
c[s2-1::s2] = a_l*c_r + b_l*f_r + c_l;
d[s2-1::s2] = d_l*a_r + e_l*d_r;
e[s2-1::s2] = d_l*b_r + e_l*e_r;
f[s2-1::s2] = d_l*c_r + e_l*f_r + f_l;


Note that I translated this from Matlab to Numpy without checking the results, but it’s probably correct?! More importantly, note that we have to explicitly copy the RHS values because we’re going to modify them in place bit by bit. Finally, note that although I was quite pleased with this when I wrote it, there is still room for improvement: you could use a single 2D array for abcdef, which would simplify the “picking” bits and might let you fuse together some of the actual assignment bits. For the case when most of the matrices are the just the identity matrix, you could further optimize this by performing the cumulative multiply on only the non-identity subset of the matrices, and then index into the result to expand it out to the appropriate length.

That’s all I wanted to say for this example. If you really want to see the complete solution then let me know (I have written it in Matlab). But hopefully you have enough here to write it yourself.

Oh dear! As I reached the end of writing about this, it occurred to me that there was in fact a much simpler way of doing the calculation: firstly, compute the cumulative sum for the extra rotation data; then convert the (x,y) data into a series of (dx, dy) steps; next, turn these Cartesian steps into polar steps and add the corresponding cumulative rotation angle for each step; finally, convert the steps back into Cartesian space and perform a cumulative sum in x and y.

This alternative method method requires slightly more trigonometry (cosine and sine at the end in addition to the arctan at the start), but the massive benefit is that all the hand-written matrix multiplies are replaced with standard cumulative sums: one for angle, one for x and one for y. The original solution may be slightly better in terms of resistance to floating point errors (but I’m not going to examine that in detail now). It would also probably be faster if the number of rotations is small (and we implement the suggested optimization for that scenario). I wonder whether there was a reason I wrote the original function with matrix multiplies? I notice that the code included the option to apply the cumulative transformation to some secondary position data, so perhaps that had something to do with it, but unfortunately I no longer remember the actual details of the intended use-case.

### Example 3.

In contrast with the previous examples, I am going to give a rough explanation of the neuroscience motivating this task (because it would be rather abstract and contrived without it).

Neurons, roughly speaking, behave like Poisson processes: at any given moment they each have a “true” rate, but we can only observe that rate indirectly by counting the number of discrete “spikes” (aka action potentials) they generate. For example, if a neuron generates 4 spikes in one second and 6 the next second we might (using an appropriate formula) estimate that the “true” underlying rate was about 5 Hz.

For many neurons, the “true” rate correlates with something we can observe in the outside world, where the word “correlate” here is meant in the general sense, i.e. some sort of statistically observed coupling. For example, the rates of some neurons are modulated by the exact location of the organism within its environment, i.e. a particular neuron might have a rate of 4Hz each time the person/animal goes to a spot near the window, but have a rate of approximately zero elsewhere.

Our task here is to analyse the data from 50 of these place-sensitive neurons: we will get a stack of 50 “ratemaps” that tells us the average rate for each of the cells in every square centimeter of an experimental arena. And we will get a list of the spike times for each of the 50 cells over the course of a 20 minute experiment. The question is: how can we use the neuron spike data and ratemaps to estimate the location of the animal? Since the animal is moving around, we need to produce a separate estimate for every second, or lets say every quarter of a second.

The calculation we need to do (which can be found in a number of proper academic papers) is, in essence, quite simple: (1) count the number of spikes for each cell in each time window; (2) for each of the 1 centimeter-square bins in the arena, calculate the probability that the observed counts could have occurred; (3) either stop there, i.e. provide the likelihood-map for each time window as the output, or find the location of the maximum likelihood for each time window, and just give that as the output.

For a single 1-cm bin, at (x,y) in a single window, t, the calculation is:

$L(x,y,t) = \prod_{i=0}^{50} \frac{rate_i(x,y)^{counts_i(t)} e^{-rate_i(x,y)}}{counts_i(t)!}$

If real maths seems like hard work, then just make sure you don’t stare too closely. It’s the programming that we’re here for, so you don’t need to worry about the details of the equation really.

Right, and we’re off…

To begin with, I need to make a point that is going to be obvious to people with certain backgrounds (but is apparently not obvious to many neuroscientists):

$\prod_i x_i=exp(\sum_i log(x_i))$

This means we can re-write the equation for $L(x,y,t)$ as a sum, in which the terms are nicely separated:

$log(L(x,y,t)) = -\sum_i rate_i(x,y) + \sum_i [counts_i(t) log(rate_i(x,y))] -\sum_i log(counts_i(t)!)$

Now, we can examine each term individually and come up with a way of vectorizing over the whole of x, y, and t.

The first term is easy:

# ratemaps is a 3D array of shape: [100 cm x 100cm x 50 neurons]
term_1 = sum(ratemaps, axis=-1)


Once we have the counts, the last term is also pretty easy. (For the counts we probably need to use bincount, either looping over cells, or using ravel_multi_index as I do in numpy-groupies.) We can even manage a “bonus” optimization because we know we’re dealing with small integers:

# counts is a 2D array of shape: [4800 windows x 50 neurons]
log_k_factorial = log(factorial(arange(max(counts)+1)))
term_3 = sum(log_k_factorial[counts], axis=-1)


The bonus optimization was that we build a tiny lookup-table of log-factorials and then simply index into that. This is a real help as computing logs is seriously slow.

The middle term is the only interesting one. Notice that it’s roughly of the form $\sum_i a_i b_i$. Whenever you see something of this form (and it occurs surprisingly often in my experience) then you should immediately say to yourself “I bet that can be re-written as a matrix-multiply”. I’m not going to try and describe the exact circumstances in which it is possible to turn sums into matrix multiplies, but just remember that it’s worth looking out for them Note that if you need to transpose a large array in order to do this it may not be worth it in Matlab, but it’s free to do so in Numpy. Numpy also lets you do n-dimensional dot products, whereas Matlab just does 1 or 2D – however, by reshaping your ND arrays appropriately you should be able to re-imagine an ND multiply as a Matlab-friendly 2D multiply).

# counts is a 2D array of shape: [4800 windows x 50 neurons]
# ratemaps is a 3D array of shape: [100cm x 100cm x 50
term_2 = dot(log(ratemaps), counts[:,:,nax]) # nax = np.newaxis


All that’s left is to add the three terms together, with the appropriate broadcasting in Numpy or bsxfun in Matlab, and we’re done. The result is about 6 lines long, and doesn’t look that horrible (well it’s ok in Numpy, but a bit uglier in Matlab). It’s also amazingly fast given how much number crunching it’s doing!

There may be some mistakes in this third example as I wasn’t as careful as with the previous two.

### Example 4

Note that I added this example a few days after writing the original post, although the code actually predates that of the other examples.

This task is interesting in that we manage to *partially* vectorize a loop even though we can’t go the whole way. The trick is to perform a kind of manual profile guided optimization.

The task is relatively simple: in an experiment, an LED traced out a trajectory in 2D space and we now have an array of 60k (x,y) coordinates corresponding to that trajectory. The problem is that in a few places the camera doing the tracking made an error, and spuriously recorded that the LED “jumped” over a large distance in an unrealistically short time.

The filtering algorithm we need to implement is roughly the following:

def filter_speed(xy, max_speed):
is_good = zeros(len(xy), dtype=bool)
ii = 0
while ii < len(xy):
is_good[ii] = True
from_xy = xy[ii]
for jj, to_xy in enumerate(xy[ii+1:]):
# dist is diff then hypot (though sqrt is redundant)
if dist(from_xy, to_xy) < (jj-ii)*max_speed:
ii = jj
break
# now we're supposed to interpolate where is_good is False


Notice how each iteration of the ii loop depends on the previous iteration: we need to know which points are good and which are bad before we can make claims about subsequent points. (Look at the diagram if this isn’t immediately clear.) This kind of problem is vaguely similar to a regex search where you need to scan along a long string looking for instances a complex pattern: however, I believe that regexes are generally implemented with a state machine and lookup table, which is not something that makes sense here.

In python I opted to compile this (or something pretty similar) with Cython. Job done: easy 10x speed up. And that is probably the right approach to take, really. However, some months earlier, back in Matlab, I was more reluctant to write a MEX file and instead I spent a bit of effort vectorizing. My approach is probably ridiculous in all but the most obscure situations, but I’ll include it here as I think it’s fairly interesting.

So how can we vectorize this loop, when each iteration clearly depends on the previous iteration? The answer (as I already mentioned) is that we can only partially vectorize, but that we can still get a pretty decent payoff. The observation we need to make is that, in general, the data is of a reasonably high “quality”, meaning there won’t be that many errors (if there are a lot of errors then something needs to be fixed in the experimental setup so as to bring the quality of the data back to standard). Thus if we vectorize the sections of loop with no errors, or very simple errors, we can jump over those sections when we encounter them in the loop, and dwell on only the difficult bits.

I’m not going to give the full code, because it’s quite long and rather ugly, but I’ll try and give you a flavor of it (as I said, don’t try this at home unless you really need to). Note that I’ve translated to python/numpy here:

# Is a step from xy[t] to xy[t+1] slow enough?
step1_is_good = dist(xy[:-1], xy[1:]) < 1*max_speed

# If not, is a step from xy[t] to xy[t+2] slow enough?

# do similar for t+3 (not shown), then summarize...
step_size = step1_is_good.astype(uint8) # i.e. = 1
# and similar for 3, everything else is set to 4


Right, this is just the preparation. What we now have is a vector that might look something like the following:

1111111211211241111211221321111124121211111111123111121...


We can now make some basic observations:

• wherever we see the pattern 111X, or 311X, or 321X, or 221X, or 211X, or 212X, we know that X will be “stepped on” unless step_size=4 comes into play in this region. (i.e. we are interested in the pattern [321][21][1]X.)
• wherever we see the pattern 112X, or 113X, or 131X, or 132X, we know that X will *not* be stepped on unless step_size=4 comes into play. (i.e. we are interested in the pattern [1][13][12]X)
• we can’t always be sure whether such triple-prefixes will or will not cause X to be stepped on, e.g. in 231X it depends on whether or not the 2 is stepped on.

Ok, lets assume we have created an array that tells us which of the three categories each element falls in to. And, now if we convert step_size to step_to_idx, i.e. an absolute target address, we can do something interesting. For example, step_size=13121... becomes step_to_idx=14355.... The interesting thing is that we can now summarize chunks of this step_to_idx array: wherever there is a contiguous block of known steps and/or known no-steps, we can make every element point to the end of the block, saying: “if,during the loop execution, you find yourself stepping on to this element, then please go to the end of the block, because I’ve already dealt with the details from here up to that point“. In the loop we may “land” on such points either because of a combination of [123][123][123]X that we couldn’t deal with in advance, or because a 4 came in to play.

The black lines in the “loop stage”, above, show how the loop might eventually iterate through the data. Note how some regions we read from and write step data to for the first time, some regions had step data, that we have to modify, and some regions we can skip right over.

So now, because we expect that most of the data is pretty much error-free, we should find that the loop requires far fewer iterations to get through all the data.

Unless someone in the comments is really keen to see it, I’m not going to provide any further code. I’ll just say that all the pre-loop preparation is done using a few calls to nonzero (aka find in Matlab) and one call to cumsum. The loop itself is nothing special really.

That’s it for this post. It took a long time to write, but hopefully it contained some interesting bits and bobs. Do leave a comment if you have something to say!

The next post I write might be aimed at absolute beginners rather than whoever the target audience was for this post: I’d like to do a basic indexing tutorial in which you learn to “paint” flags of the world in 2D arrays.