Month: November 2015

Waveform GUI: neuroscience in Chrome

During my PhD I used bundles of electrodes to record the voltage fluctuations generated by neurons in the brain.   I’m not going to describe the details of the science here, instead I want to focus on the graphical user interface I ended up building.

As of this afternoon, I now have a video demo to show off the application to researchers in my field; the demo may also be of vague interest to the lay web-developer and/or lay scientist, although I’d be surprised if either understands much of what is being demonstrated.

Motivation and design principles/features

I created this application because I was so intensely irritated by the old piece of software, which I would otherwise have had to use.  Every time I clicked the tiny buttons in that other program and waited for it to respond to my actions I felt like I was in some kind of UX purgatory where the devil was laughing over my shoulder, hoping that I would snap and lash out at something.  (That is roughly the image I had in mind for a large part of my PhD.)

Anyway.  This application is a lot better: rather than having to press 10 buttons (or maybe it was 7, I did count them once) in order to get a single plot, here you can get a whole page of plots using just ctrl+a followed by a single drag and drop.   Compare eating your morning cereal with a toothpick, versus wolfing it down with a soup ladle in each hand.

The initial loading is not the only thing that’s easier: all the various files are automatically organised in a graphical table that lets you jump between any two data sets with a single click.   And although these data sets can each be 10MB+, I worked hard to ensure that they typically load and display in less than about 400ms.

You can also change various parameters with a single click: no need to force a redraw as everything is automatically re-rendered.  This minimal-clicking rule applies to the cluster-assignment editing too: merging is done with one drag-drop; splitting with two clicks; and swapping can be done entirely from the keyboard when the mouse is in position.  The greatest triumph of this anti-hand-eye-coordination crusade was the “invention” of the spacebar+click paradigm:  when you hold down space (which is by far the easiest key to hit) various bits of the UI become clickable, with clicking usually causing the thing to be copied or closed/deleted.   Normally these actions would require clicking through a tiny drop-down menu, toolbar button or, some sort of mouse action followed by multi-key shortcut (not that I have anything against keyboard shortcuts, but this is even easier).

The nice thing about building your own application is that as you discover usage patterns, you can add tools to make your life even easier.  For example, I added a tool for copying plots to the clipboard so that I could make notes with absolutely minimal effort: the plots come with a caption, so there’s very little left to say. Another example: when running experiments I often needed to know the time that the previous experiment had finished: you could work this out manually from the start time and duration info in the header, but by creating a little info widget to display this information automatically, suddenly experiments were that little bit less tedious.  Also, in cases where I had to choose a default from two or more clashing files, I was able to get the program to automatically use the most recently modified file (i.e. rather than one higher up the alphabet, or one matching some preset filename pattern); the user could override the default with a single click, if needed, but in most cases the most recently modified file is the one of interest.

In summary, the main aims were to optimized away redundant clicking, and redundant accuracy in pointing.   Jumping between different datasets had to be easy to request, and fast to execute.  And it had to be possible to put tons of plots on the screen all at once.

How does that help?

Now that data exploration requires a lot less time and effort, the experimenter can spend more resources on other aspects of research. They can also increase the volume of data they collect, and be generally more thorough in their data exploration: the more you shown on screen in one go, the easier it is to spot interesting features in the data, both as a novice and as an experienced user.

A bit about implementation

The application is built in Chrome, using JS and HTML, with most of the file IO and plots being done in WebWorkers off the main thread.   The wave data itself, which can often be over 10MB (as shown in the demo), is sent down to the GPU for rendering with WebGL: I wrote a couple of fairly simple kernels to do this.

The application evolved in fits and starts over the course of a couple of years.  At various stages I spent quite a bit of effort staring at the profiler in Chrome to try and squeeze as much as possible out of the hardware.  Unlike with a commercial webapp, my life was relatively simple in that I was able to focused exclusively on Chrome and assume that users have a reasonably powerful computer.   The profiler really helped when, for example, I needed to prepare the waveform data for the GPU.  The operation I had to perform was (in terms of memory layout) a bit like a matrix transpose: using Chrome’s profiler I discovered that this operation was taking several hundred milliseconds, and that if I re-wrote the loop I could reduce the execution time by a factor of about 4: the trick was to read from memory contiguously even though this meant writing out non-contiguously.  I was also able to play with a loop converting from Int8 to Uint8:

var Int8ToUint8 = function(A){
    // Takes an int8array, A, adds 128 to each element
    // and views it as a uint8 array, in place.
    // We are XORing each byte with 128, which in hex
    // is we do it with 4 bytes at a time.
    A = new Uint32Array(A.buffer); 
    for(var i=0;i<A.length;i++)
        A[i] ^= 0x80808080;
    return new Uint8Array(A.buffer);

This takes about 10ms for the typical 10MB of data that the program deals with.

There were a lot of ways to actually organize the drawing on the GPU. I ended up doing the following:

render waves

In words: we draw all the lines from (t, y(t)) to (t+1, y(t+1)) for all waves at fixed t, before moving on to the next t. This requires us to rearrange and (almost) duplicate the y data, before moving it to the GPU (which is what I mentioned above), but once on the GPU it is incredibly easy to use. The cluster assignment data is also duplicated, but there’s only two copies of it per wave, rather than one for every single point on the wave; this makes it fast to update if we modify clusters.

The first stage of the kernel is to count the number of lines crossing every pixel, we use GL_FUNC_ADD as the blend equation mode:

attribute float voltage; // y-data for line at t and t+1
attribute vec2 wave_xy_offset; // position of group on canvas
uniform mediump float t_x_offset; // x offset for t
attribute float is_t_plus_one; // 0 1 0 1 0 1 0 1 ...
const mediump float delta_t_x_offset = 2./512.;
uniform highp vec4 count_mode_color; // small number
varying lowp vec4 v_col;
void main(void) {
    v_col = count_mode_color; // in palette-mode this is trickier
    gl_Position.x = wave_xy_offset.x + (t_x_offset + 
    gl_Position.y = wave_xy_offset.y + voltage*y_factor;
    gl_Position[3] = 1.;

See, it’s not that complicated, right? I tried doing it a few different ways before this, but the simplicity of this method seems to help with performance.

Reflections on browser-based science

The browser is an amazing tool for building complex user interfaces quickly, especially if you’re already familiar with a good framework (such as polymer, which I started using towards the end of this project).
In terms of parallelism support: you have basic multithreading with WebWorkers (although I recommend using a helper library such as my BridegedWorker); you have basic GPU rendering, and to some extent compute (I managed to write a distance matrix computation in WebGL at one point, but it was unpleasant and I never fully understood all the rules about precision); you have basic compiler optimizations through V8….but all those things are decidedly “basic”, meaning that you have to work hard to get good performance if you need it, and ultimately there are serious limitations in quite how well you can really utilize the hardware. For example, there’s currently no SIMD support or true shared memory/atomics on the web, though that is due to change soon. Perhaps when it does we will get some decent numerical libraries, like those that we take for granted in Matlab/Numpy/C++ etc.

In summary, I’m fairly confident that the browser is going to become a better and better place to be manipulating serious data sets in a user-friendly way. (This is not a novel observation, but I wanted to say it in light of my own experience.)

Should I be using pandas?

My PhD consisted partly of doing very hands-on neuroscience experiments, and partly of analyzing the “medium-sized data” sets produced by those experiements. I say “medium-sized” because the data was large enough to require care, but not so large that it couldn’t fit comfortably on a single disk in raw form, and fit in memory once partially processed.

At the start I was using Matlab for analysis, which seemed to be the weapon of choice for almost all neuroscientists (as far as I could tell)…. uhhhh, let me just give you an xkcd-inspired graph….

me and matlab

…so needless to say, I switched to python+numpy about a third of the way into the PhD and didn’t look back*.  This wasn’t too difficult to do as one of my colleagues had made the switch a few months earlier – I was offered a “starter pack” of functions for doing IO and maybe a few words of advice.

Although I did have a bit of extra-curricular experience with python, I’d not yet done any serious exploration of the language’s features or ecosystem.  Both of these (especially the second) took time: early on I had fun hacking around with magic methods to make a dictionary-like memoizing property-thing; weird mask-able lists; and something not quite as good as IPython’s autoreload feature. Later I even tried to write a system for un-parallelising plotting commands  (that I named “monobrow”).  However, I did eventually settle on an overall analysis design that looks a bit more sensible, and I’m going to describe bits of it here (it doesn’t include any parallelism unfortunately).

I’m not going to describe the actual analysis I do in detail, instead I’ll give a simplified analogy: pretend that I have a stack of about 1k images, and I need to run a handful of algorithms on each image separately, and then summarize the resulting metrics across all the images.  Also, lets say that these 1k images were from 10 different food-obsessed-photographers, who went on separate holidays of varying lengths, but followed a similar pic-taking pattern each day: photos of breakfast, photos of lunch, photos of dinner.

My actual data wasn’t to do with images, photographers, food, or holidays, but in terms of organizing the analysis, my task was similar.  I ended up with something encapsulated in two main classes: one at the level of the individual “image”, and one at the level of the “entire-dataset”.  I’ll describe the “image” class briefly and then go into a bit more detail for the “entire-dataset” class.

The “image” class (aka Trial)

The analogized description I gave above doesn’t quite tell the full story; although there were some parts of the algorithms that had to be run on whole images, much of the computation was focused on one or more manually identified regions within the images.

The time needed to read an image off disk and do the initial processing wasn’t insignificant (nor was the space required to keep it in memory), so I hoped to reuse a single instance of each image when analyzing sub-regions within it.  However, since this was an optimization, I wanted to hide the details from the final top-level analysis script.  My solution required a bit of cooperation between the two classes: the “image” class would have a caching/memoization system, and the “entire-dataset” class would know to reuse instances of the image class across each of the subregions within an image, it would also know how to ask the image class to clear different bits of its cache.

image_id  subregion_id  image_instance
  0             91     [img #8712fc12a00]
  0             92     [img #8712fc12a00]
  0             93     [img #8712fc12a00]
  1             97     [img #1283efd7800]
  1             98     [img #1283efd7800]

The “image” class (which was actually called “Trial”) looked like this:


Each of the blocky things corresponds to a mixin contained in a separate file.  The foundation of the class is the caching mixin, which is used by the mixins within the “core api” cage as well as (potentially) by the more exotic stuff above.  I’ll describe the overall picture first, and then come back to give a bit more detail specifically on the caching.

The most important design decision was to separate the random-bits-and-bobs from the “core api”.  The idea is that by very clearly defining a core API, you can commit to keeping one section of the code in good working order even if other bits of the code are a mess.  If a core function is not quite general, fast, or readable enough, you have to promise to work on it “in situ” rather than inlining it into whatever complex analysis you are currently working on.  Python is good at letting you gradually add options to existing functions, so that you slowly build the value of your interface and its implementation. Drawing the line around the core also makes it easier to share the exotic stuff with other people: you either provide them with a copy of your well-maintained code, or you just give them a well-maintained description of the interface.

The tricky thing I found when defining the “core api” was that there were actually several parts to it: I wanted to separate the caching interface, the IO, and the basic analysis, but it took me a few iterations before I realized that I could create the divisions, and exactly where they should be.  With the current design, I can have separate IO implementations for different file formats, yet still slot that under the top-most part of the core.  I can also play with different systems of caching, or use a no-cache implementation, without having to worry about refactoring lots of core and non-core files.

So, let’s look at memoization/caching in more detail. The main two methods of the TrialCache mixin are as follows:

def _cache_has(self, kind, names, key=None):
    names = (names,) if not isinstance(names,
                            (list, tuple, set)) else names
    # update/create the list of names for the given kind
    if kind in self._cache_name_lists:
        self._cache_name_lists[kind] = set(names)
    # initialize any of the attributes if they are missing
    for name_ii in names:                 
        if not hasattr(self, name_ii):
            setattr(self, name_ii, None if key is None else {})
    # return False if any of the attrs are not already computed
    for name_ii in names:                 
        attr = getattr(self, name_ii)
        if attr is None or (key is not None and key not in attr):
            return False
    return True

def _clear_cache(self, keep=(), drop=()):
    if len(keep) and len(drop):
        raise ValueError("use 'keep' or 'drop', not both.")            
    if any(" " in k for k in keep + drop):
        raise NotImplementedError()
    if len(keep):
        kinds = [k for k in self._cache_name_lists.keys() if 
                 all(kk in keep for kk in k.split(" "))]
    elif len(drop):
        kinds = [k for k in self._cache_name_lists.keys() if
                 any(kk in drop for kk in k.split(" "))]
        kinds = self._cache_name_lists.keys()
    for k in kinds:
        for name in self._cache_name_lists[k]:
            if hasattr(self, name):
                delattr(self, name)

cache_has is a slightly unusual function in so far as its name should be interpreted as a statement of intent as well as a question: the caller is saying “I intended for the cache to contain such-and-such from now on, but does it have it yet?

Within another mixin, you use it as follows:

def red_count(self):
    if not self._cache_has('color_data', '_red_count'):
        self._red_count = compute_red_count(self._data, self._params)
    return self._red_count

def smoothing(self, val):
    self._params['smoothing'] = val

The idea is that the computation of _red_count is guarded by the _cache_has intent/test, so that redundant re-computing can be avoided, yet we can still invalidate stored results when we need to, i.e. when we change the smoothing setting. You could have many computed things that depend on the smoothing setting and invalidate them all with a single command: you can define as many groupings as you need and either ask to invalidate specific groupings or invalidate all except specific groupings.

Note that the code above supports the use of lists of strings and/or strings for most arguments. It also has some builtin support for “lazy” dictionaries that I haven’t demonstrated here.

The code is not perfect: in retrospect it might have been better to hold all the cached items within an actual “cache” container, rather than attaching them to the self object directly and only managing them via lists of the attribute names. Also, I might want to (optionally) expose the caching as a decorator, which is a common paradigm for simple memoizing. If I was being really serious I might investigate some kind of data-flow caching system that lets you define a hierarchy of data dependencies, but that’s probably a lot more complicated…indeed that’s roughly what my ambitious C++ project, “Venomous“, is hoping to do. Anyway, all I wanted to point out is that with a few lines of mixin methods you can create a pretty powerful caching system that other mixins can use in a very compact way.

We’ll come back to this caching system in a bit, but next time from the point of view of the “entire-dataset” class….

The “entire-dataset” class (aka ExpSet)

I don’t have a pretty picture for this class unfortunately, so I’ll just have to describe it in words.

For several months I was using a class that was a lot like the one I’m about to describe but it was cobbled together out of a strange mixture of pure-python lists, numpy arrays and weird home-made concoctions. Switching to pandas wasn’t all that simple, but it made things a lot better.

My impression is that pandas is viewed quite differently by different groups of users: finance/economics-y people might be in the habit of thinking that every dataframe has a timestamp column, with rows that can be aggregated by hour or by day. Geo-people might hold equivalent beliefs about GPS coordinates (I’m guessing). To me, pandas is simply a package that adds labels to the rows and columns of a numpy-like 2D array, and ensures that the labels stick with the rows and columns as you index into and make copies of the array. And, importantly for my usage-case, those labels can be hierarchical…

…well, in theory they can be hierarchical, but in practice the support for MultiIndex isn’t perfect (yet): the syntax for indexing/slicing and joining these kinds of dataframes is not exactly pleasant or easy to remember.

This lack of whole-hearted support for hierarchical indexing led me to encapsulate my main dataframe(s) inside a class of my own. The idea being that I could patch the syntactical gaps more easily and come up with a library of tools for doing slightly more complex things. I settled on having two separate dataframes, which I named simply df1 and df2. The first had one row for each “region of interest” within each image, and the second had one row for each image. The first thing I did was write a method to take the strange csv-tables I had and parse them out into the two dataframes, including instantiating all the images. (I didn’t mention this explicitly earlier, but the image class is designed to be entirely lazily loaded – the filename is merely recorded at construction, nothing is read off disk.)

Once I had the basic pair of dataframes I could start thinking about iterating over the rows, running the algorithms on the images. This is where the title of the post comes in: should I be using pandas for this sort of thing, given that pandas is supposed to be designed for performing operations in clever aggregate ways?

Well, it might look strange to the other users of Pandas, but I think the paradigm I’ve adopted is absolutely valid and should perhaps be more widely encouraged.

Before I show you an example of iterating over the rows of the dataframe, let me give you (most of) the code for the RowIndexer class I came up with:

class RowIndexer(object):
    def __init__(self, obj, idx, **kwargs):
        object.__setattr__(self, '_obj', obj)
        object.__setattr__(self, '_idx', idx)
        object.__setattr__(self, '_with_column_sub', None)
        for k, v in kwargs.iteritems():
            object.__setattr__(self, k, v)

    def __getattr__(self, key):
        original_key = key
        if self._with_column_sub is not None:
            sub_idx = self._with_column_sub.index('[attr]')
            key = self._with_column_sub[:sub_idx] + (key,) + \

        # try a simple column indexing, then index names,
        # then index names without sub
            return self._obj.get_value(self._idx, key)
            return self._idx[list(self._obj.index.names).index(key)]
        if original_key == key:
            raise KeyError("Could not find key '" + str(key) +
                            "' in column or index names.")
            return self._idx[list(self._obj.index.names)\
        raise KeyError("Could not find key '" + str(key) 
                       + " or " + str(original_key) +
                       "' in column or index names.")
    def __setattr__(self, key, value):
        if self._with_column_sub is not None:
            sub_idx = self._with_column_sub.index('[attr]')
            key = self._with_column_sub[:sub_idx] + (key,) \
                    + self._with_column_sub[sub_idx+1:]
        elif self._obj.columns.nlevels > 1:
            key = (key,) + ('',) * (self._obj.columns.nlevels-1)
            self._obj.set_value(self._idx, key, value)
        except TypeError:
            self._obj[key] = self._obj[key].get_values()\
            self._obj.set_value(self._idx, key, value)
    def column_sub(self, *args, **kwargs):
        if not args or len([a for a in args if a =='[attr]']
                           ) > len(args)-1:
            raise ValueError("you must specify at most one "
                             "'[attr]' spot, and at least "
                             "one sub label.")
        if 'padding' in kwargs:
            padding = kwargs['padding']
            if len(padding) > 1:
                raise ValueError("urecorgnised wkargs")
            padding = ""
            if kwargs:
                raise ValueError("unrecognised kwargs")
        # establish total levels required                     
        if '[attr]' not in args:
            args = args + ('[attr]',)
        target_nlevels = len(args)
        _pad_columns_to_level(self._obj, target_nlevels, padding)
        object.__setattr__(self, '_with_column_sub', tuple(args))
            object.__setattr__(self, '_with_column_sub', None)

Instances of this class are yielded by a generator in ExpSet as the user iterates over the rows of df1 and/or df2. The main idea is that assigning to attributes on the instance actually results in values being written to columns within a specific row of a dataframe.

for subregion, image_inst in dataset.iter():     
    subregion.redness = image_inst.get_redness(subregion.shape)

Here subregion is a RowIndexer instance that points to a different row on each iteration of the loop, and image_inst is the instance of the “image” class relevant for the given row. Inside the generator we make sure to yield all the subregions for a specific image before moving on to the next. When we are ready to switch images, we clear some/all of the cache for the old image, thus keeping the total memory usage manageable. Exactly which bits of the cache we want to clear depends on (a) what we are likely to want in future if/when we loop over the data again; (b) what was expensive to compute; and (c) how large things are in memory. These three points are common to pretty much any caching system, and the optimal choice is rarely an easy thing to pin down: here we don’t use any automatic heuristics, instead we allow the user to specify which bits of data to delete using the fine-grained caching machinery outlined earlier. For example, the user may delete the raw whole-image, but keep the down-sampled, filtered version.

The nice feature of this system is that the Image class is agnostic of the way in which you iterate: it will do its best to cache stuff and reuse anything from the cache when possible. Also, the user’s loop (i.e. the one shown here) can ignore the details of caching: they don’t even need to know that one image instance is shared across multiple subregions. If the user does want to customize the cache-clearing behavior, he simply provides an extra argument to the iterator. The iterator can also take care of displaying progress reports.

Try doing that in Matlab!!

An important feature of the RowIndexer class is that you don’t need to declare new columns ahead of time: they are automatically added to the parent dataframe as needed. The class also has a couple of bonus features, one of which is the ability to handle and create multi-indexed columns using a with-block:

subregion.n_people = image_inst.n_people(subregion.shape)
for zone in ['upper', 'lower']:
    with subregion.column_sub('[attr]', zone), image_inst.mask(zone):
       subregion.n_flowers = image_inst.n_flowers(subregion.shape)
       subregion.n_clouds = image_inst.n_clouds(subregion.shape)
# the parent dataframe now has columns: 
#  n_people |    flowers    |     clouds    |
#           | upper | lower | upper | lower |

All of this lets you do some fairly complex meta-analysis (still using caching, remember) in just a few clear lines of high-level code.

The final thing I want to discuss is a bit like the RowIndexer, but it applies to the whole dataframe, or in fact to the whole custom class in which the pair of dataframes are held.

As before, I’ll give the implementation first, but you don’t have to look at it too closely, especially as it depends on a few things that I haven’t included:

# method of "entire-dataset" class...
def view(self, **kwargs):
    original_df1, original_df2, original_mode = (
                           self.df1, self.df2, self.mode)
    sub_df1, sub_df2, sub_mode = self._get_sliced(**kwargs)
    self.df1, self.df2, self.mode = sub_df1, sub_df2, sub_mode
    self.df1.is_copy = self.df2.is_copy = None 
    except Exception:
        self.mode = original_mode
        self.df1 = original_df1     
        self.df2 = original_df2
        if sub_mode != original_mode:
            sub_df1, sub_df2 = self._get_as_mode(
                         src=sub_mode, dest=original_mode, 
                         df1=sub_df1, df2=sub_df2)
            _update(original_df1, sub_df1)            
            _update(original_df2, sub_df2)    

And we can use it as follows:

with dataset.view(meal='breakfast', holiday_type='european'): 
   for subregion, image_inst in dataset:
       subregion.juice = image_inst.has_juice(subregion.shape)  

The view method does two things: firstly, it exposes a nice way of slicing the dataset using keywargs that can be handled intelligently behind the scenes (using all the gruesome pandas syntax); and secondly it acts as a context-manager that gives us a filtered psuedo-view onto the whole dataset: once the with-block exits, any changes made are propagated back into the original dataset. When using raw pandas, some types of indexing generate copies and some types generate views: certainly if you have complex filtering requirements it is rare that you will end up with a view back onto the original dataframe. (For many users this may only be memory problem, rather than a logical-one, but given that we want to write into the rows of the dataframe we need to have some kind of system such as this.) In fact, in my ExpSet class, not only can you slice the data in interesting ways with this sytnax, you can also pivot it into different forms, do some computation, and then have it revert back to the original form when the with-block exits.

Identical slicing/pivoting functionality is provided in a second method called copy, as the name suggests, this is not intended for a with-block, but simply gives a do-what-you-want-with clone of the original dataset.

That’s pretty much all I wanted to say about the “entire-dataset” class. The two things I wish it did (better) were outputting nice HTML tables and providing a simple means of running loops in parallel. Pandas does do HTML tables, but I had to fork it in order to put images in the table (I did submit a pull request, but in the end it got rejected in favor of something slightly different). Having interactive tables that you can filter and sort in the browser would also be great (I think there are some tools built on top of pandas that do this to some extent, but I never tried them).

As with previous posts, this took a while to get onto the page. If you found it interesting, or you want to argue about something, please do post in the comments.

*except when helping other people with Matlab, ‘cos I’m such a nice person. Edit Jan-2016: I’ve just come across last year’s Stack Overflow survey in which it was found that two thirds of Matlab-using respondents did not want to continue using it, which put it near the top of the “dreaded” languages list.

Flags of the world (indexing tutorial)

The code examples for this tutorial are currently only given for Python/Numpy. Conversion to Matlab is not difficult, but would be confusing for beginners (who are the target audience here). I may get round to doing the extra work at some point soon.

This tutorial is aimed at absolute beginners. Ideally get an experienced person to setup Python* for you because it is liable to be an off-putting hurdle otherwise. (When you are an expert yourself, you will have plenty of opportunities to repay your debt to the world!)

Before we get to the actual tutorial, you need a tiny bit of preparatory code that you don’t need to understand. Copy-paste the following few lines into your interpreter/script (ask your nominated “experienced person” how you do this, and then keep them around for another minute or so just to check that everything is working as it should):

def show_flag(data, colors=['#ff0000','#ffffff','#0000ff'],
              title="my flag", fignum=1):
    from matplotlib.colors import ListedColormap
    from matplotlib.pylab import matshow, gca, clf
    cm = ListedColormap(colors)
    if fignum is not None:
    matshow(data, cmap=cm, fignum=fignum)

That function will let us display 2D arrays as colorful flags. Here is a very simple example to start with:

flag = [
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1],
    [0, 0, 0, 0, 1, 1, 1],
    [0, 0, 0, 0, 2, 2, 2],
    [0, 0, 0, 0, 2, 2, 2]


Notice how the flag contains just the numbers 0 1 2. Inside the show_flag function these numbers are converted to colors (rather like the color-by-number pictures you may have done as a child). By default, 0 maps to red, 1 maps to white and 2 maps to blue. (I chose those colors because they’re pretty common for flags. I’ll show you how to customize the color mapping later on.)

Right, there’s one more thing you need before we get started:

# Use one (or both) of the following two lines:
from numpy import * # simple, but frowned upon
import numpy as np # better, but sometimes annoying

I will explain this briefly because it’s worth understanding, but I’ll have to use “scare-quotes” a bit as the explanation is rather hand-wavy: basically, Python is not a good language for doing heavy number crunching. If you try adding/subtracting/multiplying individual numbers thousands/millions/billions of times there will be a huge amount of “overhead” and you will get very slow (and complicated-looking) code. However, Python does allow, and encourage, the use of “plugins”. These plugins are free to do special kinds of “magic behind the scenes”, so long as they present a “Python-friendly exterior” that the user (i.e. you) can interact with in Python.

Numpy is one of these “plugins”, but it’s a rather special one in so far as it provides such significant “number-crunching functionality” that no other plugin can really compete with it. If you want to do number crunching in python you basically *need* to use Numpy. Most plugins that provide advanced number-crunching tools build on top of Numpy rather than replace it. (Not all plugins are about “number-crunching”, they do all kinds of things, like helping generate HTML for webpages etc.)

So now I can explain import numpy as np. This says to Python: “I have a plugin called ‘numpy’ and I’d like to use it here please. And I would like to refer to it as ‘np’ for short.” You can then use commands like np.ones(40), and python will “delegate” to Numpy to do its magic. If you use the alternative form from numpy import * then you are saying to python: “I have a plugin called ‘numpy’ and I’d like to use it here please. And I’m not going to tell you explicitly when I’m trying to use it – if I mention something from inside numpy, assume that is indeed what I meant.” This second version is sometimes easier because you don’t have to type np. all over the place, but it can also confuse your text editor and may introduce subtle bugs or confuse readers of your code. To keep the code in this tutorial compact, I’ll use the from numpy import * version.

Ok, finally we are ready to start…

The French Tricolour

H, W = 50, 70
# remember that our defaults are: red=0, white=1, blue=2
flag = zeros([H, W])
flag[:, :W/3] = 2 
flag[:, W/3:2*W/3] = 1
show_flag(flag, title="Vive la France")


Try it out and check that it works. Can you make sense of the code?

We begun by creating a 2D array of shape 50 x 70. In numpy we refer to the dimensions of a multidimensional array as “axis 0”, “axis 1”, “axis 2”, “axis 3” etc. Here axis 0 is of length 50 and axis 1 is of length 70. You can see in the flag that the 50 and 70 correspond to the vertical and horizontal dimensions respectively (because the image of the flag is wider than it is tall). The array we started with is all zero, which gets mapped to red when we display it. There are other functions for initializing arrays but this is all we need for now.

The line flag[:, :W/3] = 2 is setting the blue section of the flag. It does this by indexing into the whole of axis 0 (the vertical), but only the first third of axis 1 (the horizontal).

The next line flag[:, W/3:2*W/3] = 1 is similar, setting the white portion of the flag. Here the relevant section spans from the end of the first third, to the end of the second third.

That was all we had to do in this example.

Vlag van Nederland


Have a go at doing this yourself. Note that you may have to guess a little bit for some of the syntax.

Expand to view my suggested solution…

H, W = 50, 70
# remember that our defaults are: red=0, white=1, blue=2
flag = zeros([H, W])
flag[H/3:2*H/3, :] = 1
flag[2*H/3:, :] = 2 # set last third to blue
show_flag(flag, title="Van Gogh would love numpy")

This looks pretty similar to the French version, but we’re indexing into axis 0 rather than axis 1. Note that here we don’t need to set the first third to blue, but we do need to set the last third. Hopefully you were able to guess the necessary syntax.

Easy, right?

Les Quatre Bandes of Mauritius


You should be able to make the array for this yourself, but I need to tell you how to customize the color mapping:

show_flag(flag, title="What's the capital of Mauritius?",
          colors=['#00A551', '#EA2839', '#1A206D', '#FFD500'])

I found the four color codes using the colorpicker in Chrome devtools – google for help if you’ve never done that kind of thing.

My solution looks like this:

H, W = 50, 70
flag = zeros([H, W])
flag[:H/4, :] = 1
flag[H/4:2*H/4,:] = 2
flag[2*H/4:3*H/4,:] = 3
show_flag(flag, title="What's the capital of Mauritius?",
          colors=['#00A551', '#EA2839', '#1A206D', '#FFD500'])



There shouldn’t be anything new here really. But if you want to get obsessive you can spend more time fiddling with the exact proportions.

My solution looks like this:

H, W = 100, 140
flag = zeros([H, W])
flag[:, W*0.3:W*0.48] = 2
flag[H*0.33:H*0.56, :] = 2
flag[:, W*0.35:W*0.44] = 1
flag[H*0.39:H*0.5, :] = 1
show_flag(flag, title="How deep is your fjord?",
          colors=['#EF2B2D', '#002868', '#ffffff'])

Note that I doubled W and H to increase resolution a bit. Also note that we are now using non-integer indices for the arrays, or rather we are not explicitly using integers: numpy will round our floating point numbers down to the nearest integer before doing anything with them.

The Jamaican Cross


How are we going to deal with those diagonal lines? Well, hopefully you already are familiar with the basic equation of a straight line y = mx + c. Here we are essentially going to use that, but we’ll treat it as an inequality rather than an equality, i.e. for a given y at a specific x horizontal position, we want to know is it above or below the line? The code is now a bit more complicated:

from numpy import newaxis as nax
H, W = 100, 140
x = arange(W)[nax, :] 
y = arange(H)[:, nax]
flag =  y > x*0.75 - W*0.1
show_flag(flag, title="Everything's gonna be alright",
          colors=['#000000', '#FED100']) 
    # we'll also need '#009B3A' (green) later


The first line is just a convenient way to shorten the word newaxis to nax, we can then use it on the third and fourth lines: arange(W) creates a 1D array of the numbers [0, 1, 2, ..., W-1]. When we put [nax, :] after that, we are changing the shape of this 1D array to make it a 2D array of shape 1 x W. This means we can treat the x array as a single horizontal “row” in the flag, giving the x-coordinate of each column. Equivalently, the y array is a column, giving the y-coordinates of each row. Once we’ve setup these two variables, we are able to combine them in interesting ways.

By “setup” I mean that we’ve created two arrays with the same number of axes (i.e. 2), and the axes lengths “agree” with each other. What does it mean for two arrays to have axes lengths that “agree”? Well, it means specifically that each axis either has a length of 1, or it has exactly the same length as in the other array. For example 10 x 1 x 17 agrees with 10 x 23 x 1, but it doesn’t agree with 17 x 1 x 10. Once we have agreement, numpy can automatically “broadcast” any operations (such as adding, subtracting, comparing) out to match the full length in all axes.

TODO: a diagram of broadcasting would help a lot here

So here, we broadcast out the row x and the column y with the “greater-than” operation, and the result is a 2D array of trues and falses, telling us which elements in the array are above/below the line. When we display the flag the trues are converted to 1s and the falses to 0s (and then the 0s and 1s are mapped to colors as with previous flags). Note (if you’d hadn’t already) that the 0-end of the vertical axis is at the top of the flag and the H-end is at the bottom. If you want to flip this round you can add the line gca().invert_yaxis() at the end of your show_flag function (and then re-evaluate the function to update it).

We are not done yet, but hopefully you can now see how you might go about constructing the flag. If you get stuck, don’t feel too bad jumping to the suggested solution as I haven’t explained everything that you’re going to need to use here.

from numpy import newaxis as nax
H, W = 100, 140
flag = zeros([H, W])
x = arange(W)[nax, :] 
y = arange(H)[:, nax]
yellow_part = ((y > x*0.75 - W*0.1) & (y < x*0.75 + W*0.1)) 
flag[yellow_part] = 2
yellow_part = ((y > W*0.6 - x*0.75) & (y < W*0.8 - x*0.75)) 
flag[yellow_part] = 2
green_part = ((y >= x*0.75 + W*0.1) & (y >= W*0.8 - x*0.75))
flag[green_part] = 1
green_part = ((y <= x*0.75 - W*0.1) & (y <= W*0.6 - x*0.75))
flag[green_part] = 1
show_flag(flag, title="Everything's gonna be alright",
          colors=['#000000', '#009B3A', '#FED100'])

Right, that’s a little bit messy, but it’s largely unavoidable. There are two new things here.

Firstly, we are using & to combine pairs of 2D arrays: the combined array has the same shape as original two arrays (numpy would let us use broadcast again here if we needed to) and its elements are true where the elements of both arrays were true, and false elsewhere. With this mechanism we are able to say: “I want the band where y is greater than something *and* less than something else.“.

The second new thing is that we are indexing into flag using a 2D array of trues and falses rather than bands of indices. In addition to these two methods of indexing there are one (or two?) more that we haven’t covered yet.

Nisshōki of Japan


This uses trues and falses like we did in Jamaica, but now we need to create a circle.

from numpy import newaxis as nax
H, W = 100, 140
flag = hypot(arange(-W/2, W/2)[nax, :],
             arange(-H/2, H/2)[:, nax]) > 30
show_flag(flag, title="Pythonic haiku? Not really.",
          colors=['#ff0000', '#ffffff']) 

I Galanolefki of Greece


Although you could just use the basic stuff we learnt at the start, here we create all the stripes with a single command.

H, W = 100, 140
flag = zeros([H, W])
is_stripe = arange(H) % 22 > 10
flag[is_stripe, :] = 1
flag[:55, :50] = 0
flag[:55, 20:30] = 1
flag[22:33, :50] = 1
show_flag(flag, title="'flag' is not of Greek origin",
          colors=['#0D5EAF', '#ffffff']) 

* I use the Spyder IDE as distributed with anaconda, but you could possibly use an IPython notebook if you prefer. The only modules we need for this tutorial are Numpy and Matplotlib. I use Python 2.7, but everything here is probably more or less fine in 3.x.

This tutorial took a good while to write, what with all the explanations of how to use numpy and the actual messing around to make the flags. If you found it useful let me know in the comments. If I have time I may add some more examples to deal with stepped indices, reverse indices (maybe?), and perhaps suggest how custom-written functions can be used to draw complicated shapes like circles or stars…and then those functions can be reused for later flags.

Non-trivial vectorizations

For a more general discussion on optimizing, see my previous post on perfectly timed optimization.

Before we get started, I should say that the following stories are likely to be of most interest to Matlab or Numpy users.  In compiled languages, the general lack of overheads will probably do a pretty good job of making fast programs, as will any loop optimization passes.  Auto-vectorization is getting better over time, but is unlikely to work for the examples here (and in fact using these methods in a compiled language may be a pessimization). However, these tricks do become relevant again when one switches to computing on the GPU.

Example 1.

Imagine we have an array giving times for 10k events, and our task is to compute a histogram giving all the pairwise time differences. Or rather, we must compute just the lower end of that histogram, where time differences are within some specified window.


Some raw-python code might look like the following. Note that for simplicity I assume the desired histogram bins are one time-unit wide…

def make_histogram(times, window):
    hist = [0]*window
    for ii, time_ii in enumerate(times):
        for time_jj in times[ii+1:]:
            delta = time_jj - time_ii
            if delta >= window:
            hist[delta] += 1
    return hist

In words: for each event, loop forward over the subsequent events, storing time-diffs in the histogram until you reach the end of the window. You could improve this slightly by pulling the delta>window check into a separate loop from the hist[delta]++ line: create a last_jj variable that gets incremented zero or more steps each ii-iteration.

Right, lets begin vectorizing….

Firstly, for each ii lets find the number of subsequent events in the window. The approach we use may not strictly speaking be using SIMD under the covers, but in the Matlab/Numpy world we could still call it a “pseudo-vectorized” primitive.

counts = times.searchsorted(times + window, side='left')\
                             - arange(len(times)) -1

# which would let us do something like...
for start_ii, (time_ii, count_ii) in enumerate(
                                 zip(times, counts)):
    deltas = times[start_ii+1:start_ii+1+count_ii] - time_ii
    hist += bincount(deltas, minlength=window)

searchsorted and bincount are part of Numpy, the equivalents in Matlab are histc (use the second output) and accumarray. Observe that by knowing the counts in advance we were able to pull the delta calculation out of the inner loop, and then as a result we could pull the increments out of the inner loop too.

The second step is a bit trickier, we want to pull the delta calculation out of the ii loop entirely. This calculation is just a long list of times[jj] - times[ii], so what we’ll need is two sets of indices, that look something like:

all_ii = [0 0 0 0 1 1 1 1 1 3 3 4 4 4 4 ...]
all_jj = [0 1 2 3 0 1 2 3 4 0 1 0 1 2 3 ...] + all_ii + 1

that is we want to repeat each ii the number of times specified by counts[ii], and all_jj is formed from all_iis by the addition of mini-ranges of length counts[ii]. In Numpy there is a function for generating all_ii, called repeat. In Matlab we have to build the function ourselves:

function ret = repeat(n)
    n_mask = logical(n);
    n_inds = find(n);
    n_inds(2:end) = n_inds(2:end) - n_inds(1:end-1);
    n_cumsum = ones('single');
    n_cumsum(2:length(n)+1) = cumsum(n)+1; 
    ret = zeros(n_cumsum(end)-1, 1, 'single');
    ret(n_cumsum(n_mask)) = n_inds; 
           % note: numel(n_mask) = numel(n_cumsum)-1
    ret = cumsum(ret);

For the multiple ranges in the construction of all_jj, we can use the following in Numpy (this translates to Matlab fairly easily, but be careful with one/zero-based indexing):

def multi_arange(n):     
    n_mask = n.astype(bool)
    n_cumsum = cumsum(n)
    ret = ones(n_cumsum[-1]+1, dtype=int)
    ret[n_cumsum[n_mask]] -= n[n_mask] 
    ret[0] -= 1
    return cumsum(ret)[:-1]

If you weren’t already a fan of cumsum, I hope you are warming to it – look how we used it twice in one function here (and we used it twice in the Matlab version of repeat)!

Ok, that’s pretty much it. We can now get rid of the ii-loop, and we’re done:

def make_histogram(times, window):
    counts = times.searchsorted(times + window, side='left')\
                                 - arange(len(times)) -1
    all_ii = repeat(arange(len(counts)), counts)
    all_jj = multi_arange(counts) + all_ii + 1
    deltas = times[all_jj] - times[all_ii]
    return bincount(deltas, minlength=window)

Note that this optimization only makes sense if the counts in each window is guaranteed to be fairly small, and the total deltas array fits easily in memory. If either of those things is not true you might want to stick with the intermediate version, or wrap some batching mechanism around the final version.

Obviously, if you are going to do this sort of thing yourself, make sure you are testing properly as you go along and provide assertions about array shapes and types etc. The two helper functions I showed should also be given some documentation: just showing a sample input/output should be enough here. Note that the examples I’ve given here are simplified versions of the actual code I use normally – the stuff here is fine for simple integer cases, but I recommend testing it properly yourself if you actually intend to use it!

The multi_arange function, and some other goodies are available in the numpy groupies package that I co-authored with @ml31415 on github.

Example 2.

Imagine we have a list of 60k (x, y) coordinates that correspond to the path someone walked. Imagine also, that there are errors in that path: at certain points the person rotated by some amount that was not reflected in the path data. Our task is to take the 2D array of xy data, together with a 1D array providing the rotation errors at each step, and generate a corrected 2D array of xy data.

Lets picture a simple example in which the person took just 50 steps (rather than 60k), and recorded two abrupt rotations. Our task could be to correct for 3 additional rotations, or we might need to correct for a series of very small rotations over multiple steps:


The function I wrote to solve this wasn’t actually needed in any analysis in the end, but I had fun writing it, not least because I got to name it cumrot!

I’m not going to give an un-vectorized code sample because I hope the above picture already describes the problem well enough (and I never tried writing it un-vectorized so I’m not sure what that would look like anyway).

Right, let’s get to it then… (edit: you may want to skip the interesting-looking stuff and read the “oh dear” section at the end.)

To begin, you need to know that a rotation of th radians about a point (x,y) can be expressed in matrix form as:

\left( \begin{array}{ccc} cos(th) & -sin(th) & [x - x cos(th) + y sin(th)] \\ sin(th) & cos(th) & [y - x sin(th) - y cos(th)] \\ 0  & 0 & 1 \end{array} \right)

So if we had just one rotation error, we could apply this rotation to all the data from that point forward. Even if we had a small number of corrections we could loop through each of them and apply them one after the other, i.e. this would suit the “with 3 rotations” example above. However, it would be pretty inefficient to do this for the “with gradual rotation” example, because there are many small rotations needing to be corrected for (and even that was just meant to be a simple small example).

Instead, lets imagine performing the rotations in reverse order, i.e. starting with the last one (but still applying the rotation to all the data from that point in the forwards direction) and then working back towards the first. If we do the rotations in this reversed order, then each pivot point is un-altered at the moment we need to create its rotation matrix. Thus, we can produce all the rotation matrices ahead of time. Which is a nice start.

But we can do better than that: observe that each point is being multiplied by all the preceding rotation matrices, however we could (by virtue of associativity) equivalently say that we multiply each point by a single composite-transformation matrix.

What does that give us? Well, *if* we can work out the necessary composite transformation matrices, then we can vectorize the second step, i.e. the multi-matrix-multiply, as follows:

x_out = a*x + b*y + c;
y_out = d*x + e*y + f;

where a b c d e f are all 60k-long 1D arrays, that together define the 60k separate transformation matrices of the form:

\left( \begin{array}{ccc} a & b & c \\ d & e & f \\ 0 & 0 & 1 \end{array} \right)

This is weird, right? Rather than using Numpy/Matlab to do matrix multiplies for us we’ve hand-written one. But ours is way better, because it’s actually 60k simultaneous multiplies! Ok, so the question is: how do we get the composite matrices needed for the multiply, above?

Well, it’s helpful that we’ve just seen how hand-writing matrix-multiplies sometimes makes sense. What we need to spot now is that we’re actually attempting to do a “cumulative” matrix multiply, i.e. each of the 60k steps is either an identity matrix or it’s a rotation matrix, and what we want is to do a cumulative matrix multiply along the length of this 60k stack of matrices. If you’re not familiar yet with basic cumulative sums (i.e. cumsum), then you’re going to be a bit out of your depth here. Even if you are a frequent user of cumsum you may want to go the the wikipedia page that has a nice description of how it can be computed in parallel. I reproduce the wikipedia diagram here:

‘Prefix sum 16’, David Eppstein. (source)

Ok. We know how to vectorize a bunch of matrix multiplies, and the above diagram shows us the pattern we need to do them in, so all that’s left is to write the code. Below, I show the first of the two main loops. In preparation for this loop, we zero-padded x and y to ensure they are of power-two length, and we computed a b c d e f for all the individual rotation matrices (see the matrix above with cos(th), sin(th) etc.).

for log_s in range(logn):
    s = 2**log_s
    s2 = 2*s
    # pick out the right-hand values. 
    a_r, b_r, c_r, d_r, e_r, f_r = a[s2-1::s2].copy(), \
               b[s2-1::s2].copy(), c[s2-1::s2].copy(), \
               d[s2-1::s2].copy(), e[s2-1::s2].copy(), \
    # and the left-hand values
    a_l, b_l, c_l, d_l, e_l, f_l = a[s-1::s2], \
            b[s-1::s2], c[s-1::s2], d[s-1::s2],\
            e[s-1::s2], f[s-1::s2]

    # store in right hand value locations
    a[s2-1::s2] = a_l*a_r + b_l*d_r    
    b[s2-1::s2] = a_l*b_r + b_l*e_r;
    c[s2-1::s2] = a_l*c_r + b_l*f_r + c_l;
    d[s2-1::s2] = d_l*a_r + e_l*d_r;
    e[s2-1::s2] = d_l*b_r + e_l*e_r;
    f[s2-1::s2] = d_l*c_r + e_l*f_r + f_l;

Note that I translated this from Matlab to Numpy without checking the results, but it’s probably correct?! More importantly, note that we have to explicitly copy the RHS values because we’re going to modify them in place bit by bit. Finally, note that although I was quite pleased with this when I wrote it, there is still room for improvement: you could use a single 2D array for abcdef, which would simplify the “picking” bits and might let you fuse together some of the actual assignment bits. For the case when most of the matrices are the just the identity matrix, you could further optimize this by performing the cumulative multiply on only the non-identity subset of the matrices, and then index into the result to expand it out to the appropriate length.

That’s all I wanted to say for this example. If you really want to see the complete solution then let me know (I have written it in Matlab). But hopefully you have enough here to write it yourself.

Oh dear! As I reached the end of writing about this, it occurred to me that there was in fact a much simpler way of doing the calculation: firstly, compute the cumulative sum for the extra rotation data; then convert the (x,y) data into a series of (dx, dy) steps; next, turn these Cartesian steps into polar steps and add the corresponding cumulative rotation angle for each step; finally, convert the steps back into Cartesian space and perform a cumulative sum in x and y.

This alternative method method requires slightly more trigonometry (cosine and sine at the end in addition to the arctan at the start), but the massive benefit is that all the hand-written matrix multiplies are replaced with standard cumulative sums: one for angle, one for x and one for y. The original solution may be slightly better in terms of resistance to floating point errors (but I’m not going to examine that in detail now). It would also probably be faster if the number of rotations is small (and we implement the suggested optimization for that scenario). I wonder whether there was a reason I wrote the original function with matrix multiplies? I notice that the code included the option to apply the cumulative transformation to some secondary position data, so perhaps that had something to do with it, but unfortunately I no longer remember the actual details of the intended use-case.

Example 3.

In contrast with the previous examples, I am going to give a rough explanation of the neuroscience motivating this task (because it would be rather abstract and contrived without it).

Neurons, roughly speaking, behave like Poisson processes: at any given moment they each have a “true” rate, but we can only observe that rate indirectly by counting the number of discrete “spikes” (aka action potentials) they generate. For example, if a neuron generates 4 spikes in one second and 6 the next second we might (using an appropriate formula) estimate that the “true” underlying rate was about 5 Hz.

For many neurons, the “true” rate correlates with something we can observe in the outside world, where the word “correlate” here is meant in the general sense, i.e. some sort of statistically observed coupling. For example, the rates of some neurons are modulated by the exact location of the organism within its environment, i.e. a particular neuron might have a rate of 4Hz each time the person/animal goes to a spot near the window, but have a rate of approximately zero elsewhere.

Our task here is to analyse the data from 50 of these place-sensitive neurons: we will get a stack of 50 “ratemaps” that tells us the average rate for each of the cells in every square centimeter of an experimental arena. And we will get a list of the spike times for each of the 50 cells over the course of a 20 minute experiment. The question is: how can we use the neuron spike data and ratemaps to estimate the location of the animal? Since the animal is moving around, we need to produce a separate estimate for every second, or lets say every quarter of a second.


The calculation we need to do (which can be found in a number of proper academic papers) is, in essence, quite simple: (1) count the number of spikes for each cell in each time window; (2) for each of the 1 centimeter-square bins in the arena, calculate the probability that the observed counts could have occurred; (3) either stop there, i.e. provide the likelihood-map for each time window as the output, or find the location of the maximum likelihood for each time window, and just give that as the output.

For a single 1-cm bin, at (x,y) in a single window, t, the calculation is:

L(x,y,t) = \prod_{i=0}^{50} \frac{rate_i(x,y)^{counts_i(t)} e^{-rate_i(x,y)}}{counts_i(t)!}

If real maths seems like hard work, then just make sure you don’t stare too closely. It’s the programming that we’re here for, so you don’t need to worry about the details of the equation really.

Right, and we’re off…

To begin with, I need to make a point that is going to be obvious to people with certain backgrounds (but is apparently not obvious to many neuroscientists):

\prod_i x_i=exp(\sum_i log(x_i))

This means we can re-write the equation for L(x,y,t) as a sum, in which the terms are nicely separated:

log(L(x,y,t)) = -\sum_i rate_i(x,y) + \sum_i [counts_i(t) log(rate_i(x,y))] -\sum_i log(counts_i(t)!)

Now, we can examine each term individually and come up with a way of vectorizing over the whole of x, y, and t.

The first term is easy:

# ratemaps is a 3D array of shape: [100 cm x 100cm x 50 neurons]
term_1 = sum(ratemaps, axis=-1)

Once we have the counts, the last term is also pretty easy. (For the counts we probably need to use bincount, either looping over cells, or using ravel_multi_index as I do in numpy-groupies.) We can even manage a “bonus” optimization because we know we’re dealing with small integers:

# counts is a 2D array of shape: [4800 windows x 50 neurons]
log_k_factorial = log(factorial(arange(max(counts)+1)))   
term_3 = sum(log_k_factorial[counts], axis=-1)

The bonus optimization was that we build a tiny lookup-table of log-factorials and then simply index into that. This is a real help as computing logs is seriously slow.

The middle term is the only interesting one. Notice that it’s roughly of the form \sum_i a_i b_i. Whenever you see something of this form (and it occurs surprisingly often in my experience) then you should immediately say to yourself “I bet that can be re-written as a matrix-multiply”. I’m not going to try and describe the exact circumstances in which it is possible to turn sums into matrix multiplies, but just remember that it’s worth looking out for them Note that if you need to transpose a large array in order to do this it may not be worth it in Matlab, but it’s free to do so in Numpy. Numpy also lets you do n-dimensional dot products, whereas Matlab just does 1 or 2D – however, by reshaping your ND arrays appropriately you should be able to re-imagine an ND multiply as a Matlab-friendly 2D multiply).

# counts is a 2D array of shape: [4800 windows x 50 neurons]
# ratemaps is a 3D array of shape: [100cm x 100cm x 50 
term_2 = dot(log(ratemaps), counts[:,:,nax]) # nax = np.newaxis

All that’s left is to add the three terms together, with the appropriate broadcasting in Numpy or bsxfun in Matlab, and we’re done. The result is about 6 lines long, and doesn’t look that horrible (well it’s ok in Numpy, but a bit uglier in Matlab). It’s also amazingly fast given how much number crunching it’s doing!

There may be some mistakes in this third example as I wasn’t as careful as with the previous two.

Example 4

Note that I added this example a few days after writing the original post, although the code actually predates that of the other examples.

This task is interesting in that we manage to *partially* vectorize a loop even though we can’t go the whole way. The trick is to perform a kind of manual profile guided optimization.

The task is relatively simple: in an experiment, an LED traced out a trajectory in 2D space and we now have an array of 60k (x,y) coordinates corresponding to that trajectory. The problem is that in a few places the camera doing the tracking made an error, and spuriously recorded that the LED “jumped” over a large distance in an unrealistically short time.

speed filter

The filtering algorithm we need to implement is roughly the following:

def filter_speed(xy, max_speed):
   is_good = zeros(len(xy), dtype=bool)
   ii = 0
   while ii < len(xy):
       is_good[ii] = True 
       from_xy = xy[ii]
       for jj, to_xy in enumerate(xy[ii+1:]):
           # dist is diff then hypot (though sqrt is redundant)
           if dist(from_xy, to_xy) < (jj-ii)*max_speed:
              ii = jj
   # now we're supposed to interpolate where is_good is False

Notice how each iteration of the ii loop depends on the previous iteration: we need to know which points are good and which are bad before we can make claims about subsequent points. (Look at the diagram if this isn’t immediately clear.) This kind of problem is vaguely similar to a regex search where you need to scan along a long string looking for instances a complex pattern: however, I believe that regexes are generally implemented with a state machine and lookup table, which is not something that makes sense here.

In python I opted to compile this (or something pretty similar) with Cython. Job done: easy 10x speed up. And that is probably the right approach to take, really. However, some months earlier, back in Matlab, I was more reluctant to write a MEX file and instead I spent a bit of effort vectorizing. My approach is probably ridiculous in all but the most obscure situations, but I’ll include it here as I think it’s fairly interesting.

So how can we vectorize this loop, when each iteration clearly depends on the previous iteration? The answer (as I already mentioned) is that we can only partially vectorize, but that we can still get a pretty decent payoff. The observation we need to make is that, in general, the data is of a reasonably high “quality”, meaning there won’t be that many errors (if there are a lot of errors then something needs to be fixed in the experimental setup so as to bring the quality of the data back to standard). Thus if we vectorize the sections of loop with no errors, or very simple errors, we can jump over those sections when we encounter them in the loop, and dwell on only the difficult bits.

I’m not going to give the full code, because it’s quite long and rather ugly, but I’ll try and give you a flavor of it (as I said, don’t try this at home unless you really need to). Note that I’ve translated to python/numpy here:

# Is a step from xy[t] to xy[t+1] slow enough?
step1_is_good = dist(xy[:-1], xy[1:]) < 1*max_speed

# If not, is a step from xy[t] to xy[t+2] slow enough?
bad_idx, _ = (~step1_is_good).nonzero()
step2_is_good = dist(xy[bad_idx], xy[bad_idx+2]) < 2*max_speed

# do similar for t+3 (not shown), then summarize...
step_size = step1_is_good.astype(uint8) # i.e. = 1
step_size[bad_idx[step2_is_good]] = 2 
# and similar for 3, everything else is set to 4

Right, this is just the preparation. What we now have is a vector that might look something like the following:


We can now make some basic observations:

  • wherever we see the pattern 111X, or 311X, or 321X, or 221X, or 211X, or 212X, we know that X will be “stepped on” unless step_size=4 comes into play in this region. (i.e. we are interested in the pattern [321][21][1]X.)
  • wherever we see the pattern 112X, or 113X, or 131X, or 132X, we know that X will *not* be stepped on unless step_size=4 comes into play. (i.e. we are interested in the pattern [1][13][12]X)
  • we can’t always be sure whether such triple-prefixes will or will not cause X to be stepped on, e.g. in 231X it depends on whether or not the 2 is stepped on.

Ok, lets assume we have created an array that tells us which of the three categories each element falls in to. And, now if we convert step_size to step_to_idx, i.e. an absolute target address, we can do something interesting. For example, step_size=13121... becomes step_to_idx=14355.... The interesting thing is that we can now summarize chunks of this step_to_idx array: wherever there is a contiguous block of known steps and/or known no-steps, we can make every element point to the end of the block, saying: “if,during the loop execution, you find yourself stepping on to this element, then please go to the end of the block, because I’ve already dealt with the details from here up to that point“. In the loop we may “land” on such points either because of a combination of [123][123][123]X that we couldn’t deal with in advance, or because a 4 came in to play.

reduced looping

The black lines in the “loop stage”, above, show how the loop might eventually iterate through the data. Note how some regions we read from and write step data to for the first time, some regions had step data, that we have to modify, and some regions we can skip right over.

So now, because we expect that most of the data is pretty much error-free, we should find that the loop requires far fewer iterations to get through all the data.

Unless someone in the comments is really keen to see it, I’m not going to provide any further code. I’ll just say that all the pre-loop preparation is done using a few calls to nonzero (aka find in Matlab) and one call to cumsum. The loop itself is nothing special really.

That’s it for this post. It took a long time to write, but hopefully it contained some interesting bits and bobs. Do leave a comment if you have something to say!

The next post I write might be aimed at absolute beginners rather than whoever the target audience was for this post: I’d like to do a basic indexing tutorial in which you learn to “paint” flags of the world in 2D arrays.

Perfectly-timed optimization

I was going to write about specific vectorization tricks in Numpy and Matlab, but this is what I ended up with instead.  Stay tuned for the actual vectorization stuff.

Neuroscientists, and possibly biologists more generally, seem to take great pride in producing slow code.  They may even quote you Knuth’s famous line:

…premature optimization is the root of all evil.    -D. Knuth, 1974.

Having just gone and read the first section of that paper and done a bit of follow up on Wikipedia, I can now argue back: Knuth was writing at a time when people had only recently realized that programs should be structured as a series of functions and use such (seemingly basic) constructs as for/while loops and if/else blocks.  This was in contrast to the old style in which everything was basically a series of low-level jump instructions.  In this context, he pointed out: (1) programmers are typically poor at guessing which bit of their code is the bottleneck;  (2) generally 97% of the code doesn’t merit any kind of optimization; and (3), in his time, optimizing code usually meant taking something already spaghetti-like and turning it into – umm…- unpalatable spaghetti?

Those three arguments were of course valid, and remain so in many modern-day contexts, however programming has moved on since the 60s and 70s, and the original three arguments do not always hold. (Knuth hoped for a “Utopia 84” or “Newspeak” to appear within ten years of his article – C++ just about made the deadline, but it took a few more years to reach Java, Python, and the later improved versions of C++.)

So, lets talk about scientific analysis – the sort of thing people write in Matlab, Python (i.e. numpy), or some funky C++ library (I’ve never used Fortran so I’ll leave that to one side). In all cases the user is provided with amazingly compact syntax for doing fairly complicated things. For example, in python we can do:

# import stuff from numpy, and newaxis as nax
r = hypot(arange(-10,11)[:, nax], arange(-10,11)[nax, :])

Which means: position the range [-10, -9, ..., 9, 10] in the first two dimensions and then compute the hypotenuse at each point in the full two-dimensional space, giving:


I doubt whether that little snippet would have been a bottleneck in anyone’s analysis, so Knuth would have us ignore it from an optimization point-of-view.  However, if we know how to write and read this code why should we not do it like this?  It probably takes less effort than a pair of repmat‘s (my least favorite Matlab function – use bsxfun) and manual sqrt(x**2 + y**2). In fact, for some numbers you will get better results using hypot because it rearranges the equation to prevent overflow/underflow.  I certainly prefer this one-liner to anything more verbose, because the intention here is to make a “single thing” (by which I roughly mean “a pattern that can easily be understood at a glance with the appropriate plotting command”).  However, despite being concise and fast, I practically never seen this sort of code in scientific analysis, and get complaints when I write it myself.

If this really were a bottleneck, we would want to go further than this one-liner, using one or both of the following common tricks: (a) pre-compute once rather than re-compute every time; (b) take advantage of symmetry to do half, a quarter, or even an eighth of the work, possibly even requiring downstream code to use the minimal slice directly rather than expecting the full 2D array to be in memory.  Both of these optimizations complicate matters a little, for example caching can be dangerous in Numpy if you don’t explicitly set write=false (see setflags*) and you may need to handle arbitrary sizes of array, which could get messy. However, if this truly is taking up a serious percentage of run time it would probably be worth the 5 minutes effort needed to do this given that the speed payoff is as good as guaranteed.

However, to come back to the three points from Knuth….

It is almost certainly true it’s hard to predict the bottlenecks of large programs that involve lots of little bits of data or specialized algorithms. However if we are talking about basic linear algebra or scatter/gather operations (i.e. indexing) then it’s actually pretty easy to get a decent intuition.   Replacing “simple” code, with (“scary”) vectorized commands is almost always going to make that section of code 2-10 times faster and it will usually make it easier to read, so long as you get past the “scariness”.  For situations where things are already part-vectorized, or otherwise more complicated, it may be worth running the profiler to see what’s going on exactly – in modern IDEs, running a profiler requires minimal effort and can provide a wealth of useful information about control flow as well as bottlenecks.  Most people accept that the profiler is a useful tool when you’re really stuck for speed, however I would advocate using it right from the moment that you’ve finished drafting your function (and you have some representative data to run it with).   That is the time when:

  1. You have the best understanding of the science/logic behind the analysis. If you do need to optimize it later on it will require more effort to resurrect all the knowledge/thoughts you had previously.
  2. You understand your reasoning for drafting it a particular way. You may have made fundamental errors, or not landed on the absolutely most intuitive, or simplest, algorithmic form.  Thus, examining the function from a slightly different perspective (i.e. performance) may lead you to make corrections or readability improvements: redundant arithmetic can hurt both performance and readability.
  3. You can discover mistakes without fear of embarrassment or worse. Nobody likes discovering bugs hidden in old code, so optimizing functions months after they were written can be a nerve-wracking undertaking. In other words, doing the optimization early on can be a method of avoiding grief (although it may also be a recipe for allowing bugs to go unchecked in the longer term).
  4. You have the freedom to change exactly what the function does without affecting other code. As soon as you (or worse, other people!) start using the function, you lock yourself into accepting/providing particular inputs and outputs (python’s function call syntax helps but it’s not a magic bullet).  If you look at the function from a performance perspective, you may discover that the function can be made more general without significant performance degradation, or the opposite: made much faster for specific cases.  Either way it’s much easier to establish these facts early on and design with them in mind.
  5. You have the greatest to gain by improving performance. This is perhaps the most important point of all.  Often you will write a function, then gradually construct an analysis pipeline around it over the following days and weeks, and then finally actually run the whole thing on some real data/parameters.  By the time you realize that performance would have been helpful, you’ve already spent several seconds/minutes/hours/days staring at the screen, not doing anything particularly useful.  Also, every time you launch your slow concoction during development you have to work hard to preempt possible bugs so as to avoid wasting large amounts of time. You also have to be more selective in your choice of inputs, which means you might miss interesting results or waste effort thinking through stuff that the computer could be doing for you.

To wrap up then, I would say the following…

Bottlenecks can often be easily guessed. Many non-bottlenecks can be optimized with minimal effort and minimal impact on readability.  The profiler is very easy to use and can quickly help when you can’t guess, or are new to the language.  Optimizing early can lead to writing more interesting analyses further down the line, or free up time and mental effort for doing other stuff.  So, don’t wait to optimize: do it now!

*apparently write checks are implemented rather badly behind the scenes so this isn’t a proper guarantee.

This was my first post on wordpress. It took a lot longer to write than I’d expected and wasn’t particularly well structured.  I need to get better on both fronts if I am to do this regularly! Anyway, if you found it interesting, or want to argue back, please do so in the comments. As I said at the start, stay tuned for actual vecotrization tips.