Tag: python

# Should I be using pandas?

My PhD consisted partly of doing very hands-on neuroscience experiments, and partly of analyzing the “medium-sized data” sets produced by those experiements. I say “medium-sized” because the data was large enough to require care, but not so large that it couldn’t fit comfortably on a single disk in raw form, and fit in memory once partially processed.

At the start I was using Matlab for analysis, which seemed to be the weapon of choice for almost all neuroscientists (as far as I could tell)…. uhhhh, let me just give you an xkcd-inspired graph….

…so needless to say, I switched to python+numpy about a third of the way into the PhD and didn’t look back*.  This wasn’t too difficult to do as one of my colleagues had made the switch a few months earlier – I was offered a “starter pack” of functions for doing IO and maybe a few words of advice.

Although I did have a bit of extra-curricular experience with python, I’d not yet done any serious exploration of the language’s features or ecosystem.  Both of these (especially the second) took time: early on I had fun hacking around with magic methods to make a dictionary-like memoizing property-thing; weird mask-able lists; and something not quite as good as IPython’s autoreload feature. Later I even tried to write a system for un-parallelising plotting commands  (that I named “monobrow”).  However, I did eventually settle on an overall analysis design that looks a bit more sensible, and I’m going to describe bits of it here (it doesn’t include any parallelism unfortunately).

I’m not going to describe the actual analysis I do in detail, instead I’ll give a simplified analogy: pretend that I have a stack of about 1k images, and I need to run a handful of algorithms on each image separately, and then summarize the resulting metrics across all the images.  Also, lets say that these 1k images were from 10 different food-obsessed-photographers, who went on separate holidays of varying lengths, but followed a similar pic-taking pattern each day: photos of breakfast, photos of lunch, photos of dinner.

My actual data wasn’t to do with images, photographers, food, or holidays, but in terms of organizing the analysis, my task was similar.  I ended up with something encapsulated in two main classes: one at the level of the individual “image”, and one at the level of the “entire-dataset”.  I’ll describe the “image” class briefly and then go into a bit more detail for the “entire-dataset” class.

## The “image” class (aka Trial)

The analogized description I gave above doesn’t quite tell the full story; although there were some parts of the algorithms that had to be run on whole images, much of the computation was focused on one or more manually identified regions within the images.

The time needed to read an image off disk and do the initial processing wasn’t insignificant (nor was the space required to keep it in memory), so I hoped to reuse a single instance of each image when analyzing sub-regions within it.  However, since this was an optimization, I wanted to hide the details from the final top-level analysis script.  My solution required a bit of cooperation between the two classes: the “image” class would have a caching/memoization system, and the “entire-dataset” class would know to reuse instances of the image class across each of the subregions within an image, it would also know how to ask the image class to clear different bits of its cache.

image_id  subregion_id  image_instance
0             91     [img #8712fc12a00]
0             92     [img #8712fc12a00]
0             93     [img #8712fc12a00]
1             97     [img #1283efd7800]
1             98     [img #1283efd7800]
...


The “image” class (which was actually called “Trial”) looked like this:

Each of the blocky things corresponds to a mixin contained in a separate file.  The foundation of the class is the caching mixin, which is used by the mixins within the “core api” cage as well as (potentially) by the more exotic stuff above.  I’ll describe the overall picture first, and then come back to give a bit more detail specifically on the caching.

The most important design decision was to separate the random-bits-and-bobs from the “core api”.  The idea is that by very clearly defining a core API, you can commit to keeping one section of the code in good working order even if other bits of the code are a mess.  If a core function is not quite general, fast, or readable enough, you have to promise to work on it “in situ” rather than inlining it into whatever complex analysis you are currently working on.  Python is good at letting you gradually add options to existing functions, so that you slowly build the value of your interface and its implementation. Drawing the line around the core also makes it easier to share the exotic stuff with other people: you either provide them with a copy of your well-maintained code, or you just give them a well-maintained description of the interface.

The tricky thing I found when defining the “core api” was that there were actually several parts to it: I wanted to separate the caching interface, the IO, and the basic analysis, but it took me a few iterations before I realized that I could create the divisions, and exactly where they should be.  With the current design, I can have separate IO implementations for different file formats, yet still slot that under the top-most part of the core.  I can also play with different systems of caching, or use a no-cache implementation, without having to worry about refactoring lots of core and non-core files.

So, let’s look at memoization/caching in more detail. The main two methods of the TrialCache mixin are as follows:

def _cache_has(self, kind, names, key=None):
names = (names,) if not isinstance(names,
(list, tuple, set)) else names

# update/create the list of names for the given kind
if kind in self._cache_name_lists:
self._cache_name_lists[kind].update(set(names))
else:
self._cache_name_lists[kind] = set(names)

# initialize any of the attributes if they are missing
for name_ii in names:
if not hasattr(self, name_ii):
setattr(self, name_ii, None if key is None else {})

# return False if any of the attrs are not already computed
for name_ii in names:
attr = getattr(self, name_ii)
if attr is None or (key is not None and key not in attr):
return False
return True

def _clear_cache(self, keep=(), drop=()):
if len(keep) and len(drop):
raise ValueError("use 'keep' or 'drop', not both.")
if any(" " in k for k in keep + drop):
raise NotImplementedError()

if len(keep):
kinds = [k for k in self._cache_name_lists.keys() if
all(kk in keep for kk in k.split(" "))]
elif len(drop):
kinds = [k for k in self._cache_name_lists.keys() if
any(kk in drop for kk in k.split(" "))]
else:
kinds = self._cache_name_lists.keys()

for k in kinds:
for name in self._cache_name_lists[k]:
if hasattr(self, name):
delattr(self, name)


cache_has is a slightly unusual function in so far as its name should be interpreted as a statement of intent as well as a question: the caller is saying “I intended for the cache to contain such-and-such from now on, but does it have it yet?

Within another mixin, you use it as follows:

@property
def red_count(self):
if not self._cache_has('color_data', '_red_count'):
self._red_count = compute_red_count(self._data, self._params)
return self._red_count

@smoothing.setter
def smoothing(self, val):
self._params['smoothing'] = val
self._clear_cache(drop=('color_data',))


The idea is that the computation of _red_count is guarded by the _cache_has intent/test, so that redundant re-computing can be avoided, yet we can still invalidate stored results when we need to, i.e. when we change the smoothing setting. You could have many computed things that depend on the smoothing setting and invalidate them all with a single command: you can define as many groupings as you need and either ask to invalidate specific groupings or invalidate all except specific groupings.

Note that the code above supports the use of lists of strings and/or strings for most arguments. It also has some builtin support for “lazy” dictionaries that I haven’t demonstrated here.

The code is not perfect: in retrospect it might have been better to hold all the cached items within an actual “cache” container, rather than attaching them to the self object directly and only managing them via lists of the attribute names. Also, I might want to (optionally) expose the caching as a decorator, which is a common paradigm for simple memoizing. If I was being really serious I might investigate some kind of data-flow caching system that lets you define a hierarchy of data dependencies, but that’s probably a lot more complicated…indeed that’s roughly what my ambitious C++ project, “Venomous“, is hoping to do. Anyway, all I wanted to point out is that with a few lines of mixin methods you can create a pretty powerful caching system that other mixins can use in a very compact way.

We’ll come back to this caching system in a bit, but next time from the point of view of the “entire-dataset” class….

## The “entire-dataset” class (aka ExpSet)

I don’t have a pretty picture for this class unfortunately, so I’ll just have to describe it in words.

For several months I was using a class that was a lot like the one I’m about to describe but it was cobbled together out of a strange mixture of pure-python lists, numpy arrays and weird home-made concoctions. Switching to pandas wasn’t all that simple, but it made things a lot better.

My impression is that pandas is viewed quite differently by different groups of users: finance/economics-y people might be in the habit of thinking that every dataframe has a timestamp column, with rows that can be aggregated by hour or by day. Geo-people might hold equivalent beliefs about GPS coordinates (I’m guessing). To me, pandas is simply a package that adds labels to the rows and columns of a numpy-like 2D array, and ensures that the labels stick with the rows and columns as you index into and make copies of the array. And, importantly for my usage-case, those labels can be hierarchical…

…well, in theory they can be hierarchical, but in practice the support for MultiIndex isn’t perfect (yet): the syntax for indexing/slicing and joining these kinds of dataframes is not exactly pleasant or easy to remember.

This lack of whole-hearted support for hierarchical indexing led me to encapsulate my main dataframe(s) inside a class of my own. The idea being that I could patch the syntactical gaps more easily and come up with a library of tools for doing slightly more complex things. I settled on having two separate dataframes, which I named simply df1 and df2. The first had one row for each “region of interest” within each image, and the second had one row for each image. The first thing I did was write a method to take the strange csv-tables I had and parse them out into the two dataframes, including instantiating all the images. (I didn’t mention this explicitly earlier, but the image class is designed to be entirely lazily loaded – the filename is merely recorded at construction, nothing is read off disk.)

Once I had the basic pair of dataframes I could start thinking about iterating over the rows, running the algorithms on the images. This is where the title of the post comes in: should I be using pandas for this sort of thing, given that pandas is supposed to be designed for performing operations in clever aggregate ways?

Well, it might look strange to the other users of Pandas, but I think the paradigm I’ve adopted is absolutely valid and should perhaps be more widely encouraged.

Before I show you an example of iterating over the rows of the dataframe, let me give you (most of) the code for the RowIndexer class I came up with:

class RowIndexer(object):
def __init__(self, obj, idx, **kwargs):
object.__setattr__(self, '_obj', obj)
object.__setattr__(self, '_idx', idx)
object.__setattr__(self, '_with_column_sub', None)
for k, v in kwargs.iteritems():
object.__setattr__(self, k, v)

def __getattr__(self, key):
original_key = key
if self._with_column_sub is not None:
sub_idx = self._with_column_sub.index('[attr]')
key = self._with_column_sub[:sub_idx] + (key,) + \
self._with_column_sub[sub_idx+1:]

# try a simple column indexing, then index names,
# then index names without sub
try:
return self._obj.get_value(self._idx, key)
except:
pass
try:
return self._idx[list(self._obj.index.names).index(key)]
except:
pass
if original_key == key:
raise KeyError("Could not find key '" + str(key) +
"' in column or index names.")
try:
return self._idx[list(self._obj.index.names)\
.index(original_key)]
except:
pass
raise KeyError("Could not find key '" + str(key)
+ " or " + str(original_key) +
"' in column or index names.")

def __setattr__(self, key, value):
if self._with_column_sub is not None:
sub_idx = self._with_column_sub.index('[attr]')
key = self._with_column_sub[:sub_idx] + (key,) \
+ self._with_column_sub[sub_idx+1:]
elif self._obj.columns.nlevels > 1:
key = (key,) + ('',) * (self._obj.columns.nlevels-1)
try:
self._obj.set_value(self._idx, key, value)
except TypeError:
self._obj[key] = self._obj[key].get_values()\
.astype(object)
self._obj.set_value(self._idx, key, value)

@contextmanager
def column_sub(self, *args, **kwargs):
if not args or len([a for a in args if a =='[attr]']
) > len(args)-1:
raise ValueError("you must specify at most one "
"'[attr]' spot, and at least "
"one sub label.")
raise ValueError("urecorgnised wkargs")
else:
if kwargs:
raise ValueError("unrecognised kwargs")

# establish total levels required
if '[attr]' not in args:
args = args + ('[attr]',)
target_nlevels = len(args)

object.__setattr__(self, '_with_column_sub', tuple(args))
try:
yield
finally:
object.__setattr__(self, '_with_column_sub', None)


Instances of this class are yielded by a generator in ExpSet as the user iterates over the rows of df1 and/or df2. The main idea is that assigning to attributes on the instance actually results in values being written to columns within a specific row of a dataframe.

for subregion, image_inst in dataset.iter():
subregion.redness = image_inst.get_redness(subregion.shape)


Here subregion is a RowIndexer instance that points to a different row on each iteration of the loop, and image_inst is the instance of the “image” class relevant for the given row. Inside the generator we make sure to yield all the subregions for a specific image before moving on to the next. When we are ready to switch images, we clear some/all of the cache for the old image, thus keeping the total memory usage manageable. Exactly which bits of the cache we want to clear depends on (a) what we are likely to want in future if/when we loop over the data again; (b) what was expensive to compute; and (c) how large things are in memory. These three points are common to pretty much any caching system, and the optimal choice is rarely an easy thing to pin down: here we don’t use any automatic heuristics, instead we allow the user to specify which bits of data to delete using the fine-grained caching machinery outlined earlier. For example, the user may delete the raw whole-image, but keep the down-sampled, filtered version.

The nice feature of this system is that the Image class is agnostic of the way in which you iterate: it will do its best to cache stuff and reuse anything from the cache when possible. Also, the user’s loop (i.e. the one shown here) can ignore the details of caching: they don’t even need to know that one image instance is shared across multiple subregions. If the user does want to customize the cache-clearing behavior, he simply provides an extra argument to the iterator. The iterator can also take care of displaying progress reports.

Try doing that in Matlab!!

An important feature of the RowIndexer class is that you don’t need to declare new columns ahead of time: they are automatically added to the parent dataframe as needed. The class also has a couple of bonus features, one of which is the ability to handle and create multi-indexed columns using a with-block:

subregion.n_people = image_inst.n_people(subregion.shape)
for zone in ['upper', 'lower']:
subregion.n_flowers = image_inst.n_flowers(subregion.shape)
subregion.n_clouds = image_inst.n_clouds(subregion.shape)
# the parent dataframe now has columns:
#  n_people |    flowers    |     clouds    |
#           | upper | lower | upper | lower |


All of this lets you do some fairly complex meta-analysis (still using caching, remember) in just a few clear lines of high-level code.

The final thing I want to discuss is a bit like the RowIndexer, but it applies to the whole dataframe, or in fact to the whole custom class in which the pair of dataframes are held.

As before, I’ll give the implementation first, but you don’t have to look at it too closely, especially as it depends on a few things that I haven’t included:

# method of "entire-dataset" class...
def view(self, **kwargs):
original_df1, original_df2, original_mode = (
self.df1, self.df2, self.mode)
sub_df1, sub_df2, sub_mode = self._get_sliced(**kwargs)
self.df1, self.df2, self.mode = sub_df1, sub_df2, sub_mode
self.df1.is_copy = self.df2.is_copy = None
try:
yield
except Exception:
raise
finally:
self.mode = original_mode
self.df1 = original_df1
self.df2 = original_df2

if sub_mode != original_mode:
sub_df1, sub_df2 = self._get_as_mode(
src=sub_mode, dest=original_mode,
df1=sub_df1, df2=sub_df2)
_update(original_df1, sub_df1)
_update(original_df2, sub_df2)


And we can use it as follows:

with dataset.view(meal='breakfast', holiday_type='european'):
for subregion, image_inst in dataset:
subregion.juice = image_inst.has_juice(subregion.shape)


The view method does two things: firstly, it exposes a nice way of slicing the dataset using keywargs that can be handled intelligently behind the scenes (using all the gruesome pandas syntax); and secondly it acts as a context-manager that gives us a filtered psuedo-view onto the whole dataset: once the with-block exits, any changes made are propagated back into the original dataset. When using raw pandas, some types of indexing generate copies and some types generate views: certainly if you have complex filtering requirements it is rare that you will end up with a view back onto the original dataframe. (For many users this may only be memory problem, rather than a logical-one, but given that we want to write into the rows of the dataframe we need to have some kind of system such as this.) In fact, in my ExpSet class, not only can you slice the data in interesting ways with this sytnax, you can also pivot it into different forms, do some computation, and then have it revert back to the original form when the with-block exits.

Identical slicing/pivoting functionality is provided in a second method called copy, as the name suggests, this is not intended for a with-block, but simply gives a do-what-you-want-with clone of the original dataset.

That’s pretty much all I wanted to say about the “entire-dataset” class. The two things I wish it did (better) were outputting nice HTML tables and providing a simple means of running loops in parallel. Pandas does do HTML tables, but I had to fork it in order to put images in the table (I did submit a pull request, but in the end it got rejected in favor of something slightly different). Having interactive tables that you can filter and sort in the browser would also be great (I think there are some tools built on top of pandas that do this to some extent, but I never tried them).

As with previous posts, this took a while to get onto the page. If you found it interesting, or you want to argue about something, please do post in the comments.

*except when helping other people with Matlab, ‘cos I’m such a nice person. Edit Jan-2016: I’ve just come across last year’s Stack Overflow survey in which it was found that two thirds of Matlab-using respondents did not want to continue using it, which put it near the top of the “dreaded” languages list.

# Flags of the world (indexing tutorial)

The code examples for this tutorial are currently only given for Python/Numpy. Conversion to Matlab is not difficult, but would be confusing for beginners (who are the target audience here). I may get round to doing the extra work at some point soon.

This tutorial is aimed at absolute beginners. Ideally get an experienced person to setup Python* for you because it is liable to be an off-putting hurdle otherwise. (When you are an expert yourself, you will have plenty of opportunities to repay your debt to the world!)

Before we get to the actual tutorial, you need a tiny bit of preparatory code that you don’t need to understand. Copy-paste the following few lines into your interpreter/script (ask your nominated “experienced person” how you do this, and then keep them around for another minute or so just to check that everything is working as it should):

def show_flag(data, colors=['#ff0000','#ffffff','#0000ff'],
title="my flag", fignum=1):
from matplotlib.colors import ListedColormap
from matplotlib.pylab import matshow, gca, clf
cm = ListedColormap(colors)
if fignum is not None:
clf()
matshow(data, cmap=cm, fignum=fignum)
gca().axis('off')
gca().set_title(title)


That function will let us display 2D arrays as colorful flags. Here is a very simple example to start with:

flag = [
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 2, 2, 2],
[0, 0, 0, 0, 2, 2, 2]
]
show_flag(flag)


Notice how the flag contains just the numbers 0 1 2. Inside the show_flag function these numbers are converted to colors (rather like the color-by-number pictures you may have done as a child). By default, 0 maps to red, 1 maps to white and 2 maps to blue. (I chose those colors because they’re pretty common for flags. I’ll show you how to customize the color mapping later on.)

Right, there’s one more thing you need before we get started:

# Use one (or both) of the following two lines:
from numpy import * # simple, but frowned upon
import numpy as np # better, but sometimes annoying


I will explain this briefly because it’s worth understanding, but I’ll have to use “scare-quotes” a bit as the explanation is rather hand-wavy: basically, Python is not a good language for doing heavy number crunching. If you try adding/subtracting/multiplying individual numbers thousands/millions/billions of times there will be a huge amount of “overhead” and you will get very slow (and complicated-looking) code. However, Python does allow, and encourage, the use of “plugins”. These plugins are free to do special kinds of “magic behind the scenes”, so long as they present a “Python-friendly exterior” that the user (i.e. you) can interact with in Python.

Numpy is one of these “plugins”, but it’s a rather special one in so far as it provides such significant “number-crunching functionality” that no other plugin can really compete with it. If you want to do number crunching in python you basically *need* to use Numpy. Most plugins that provide advanced number-crunching tools build on top of Numpy rather than replace it. (Not all plugins are about “number-crunching”, they do all kinds of things, like helping generate HTML for webpages etc.)

So now I can explain import numpy as np. This says to Python: “I have a plugin called ‘numpy’ and I’d like to use it here please. And I would like to refer to it as ‘np’ for short.” You can then use commands like np.ones(40), and python will “delegate” to Numpy to do its magic. If you use the alternative form from numpy import * then you are saying to python: “I have a plugin called ‘numpy’ and I’d like to use it here please. And I’m not going to tell you explicitly when I’m trying to use it – if I mention something from inside numpy, assume that is indeed what I meant.” This second version is sometimes easier because you don’t have to type np. all over the place, but it can also confuse your text editor and may introduce subtle bugs or confuse readers of your code. To keep the code in this tutorial compact, I’ll use the from numpy import * version.

Ok, finally we are ready to start…

## The French Tricolour

H, W = 50, 70
# remember that our defaults are: red=0, white=1, blue=2
flag = zeros([H, W])
flag[:, :W/3] = 2
flag[:, W/3:2*W/3] = 1
show_flag(flag, title="Vive la France")


Try it out and check that it works. Can you make sense of the code?

We begun by creating a 2D array of shape 50 x 70. In numpy we refer to the dimensions of a multidimensional array as “axis 0”, “axis 1”, “axis 2”, “axis 3” etc. Here axis 0 is of length 50 and axis 1 is of length 70. You can see in the flag that the 50 and 70 correspond to the vertical and horizontal dimensions respectively (because the image of the flag is wider than it is tall). The array we started with is all zero, which gets mapped to red when we display it. There are other functions for initializing arrays but this is all we need for now.

The line flag[:, :W/3] = 2 is setting the blue section of the flag. It does this by indexing into the whole of axis 0 (the vertical), but only the first third of axis 1 (the horizontal).

The next line flag[:, W/3:2*W/3] = 1 is similar, setting the white portion of the flag. Here the relevant section spans from the end of the first third, to the end of the second third.

That was all we had to do in this example.

## Vlag van Nederland

Have a go at doing this yourself. Note that you may have to guess a little bit for some of the syntax.

Expand to view my suggested solution…

H, W = 50, 70
# remember that our defaults are: red=0, white=1, blue=2
flag = zeros([H, W])
flag[H/3:2*H/3, :] = 1
flag[2*H/3:, :] = 2 # set last third to blue
show_flag(flag, title="Van Gogh would love numpy")


This looks pretty similar to the French version, but we’re indexing into axis 0 rather than axis 1. Note that here we don’t need to set the first third to blue, but we do need to set the last third. Hopefully you were able to guess the necessary syntax.

Easy, right?

## Les Quatre Bandes of Mauritius

You should be able to make the array for this yourself, but I need to tell you how to customize the color mapping:

show_flag(flag, title="What's the capital of Mauritius?",
colors=['#00A551', '#EA2839', '#1A206D', '#FFD500'])


I found the four color codes using the colorpicker in Chrome devtools – google for help if you’ve never done that kind of thing.

My solution looks like this:

H, W = 50, 70
flag = zeros([H, W])
flag[:H/4, :] = 1
flag[H/4:2*H/4,:] = 2
flag[2*H/4:3*H/4,:] = 3
show_flag(flag, title="What's the capital of Mauritius?",
colors=['#00A551', '#EA2839', '#1A206D', '#FFD500'])


## Norway

There shouldn’t be anything new here really. But if you want to get obsessive you can spend more time fiddling with the exact proportions.

My solution looks like this:

H, W = 100, 140
flag = zeros([H, W])
flag[:, W*0.3:W*0.48] = 2
flag[H*0.33:H*0.56, :] = 2
flag[:, W*0.35:W*0.44] = 1
flag[H*0.39:H*0.5, :] = 1
show_flag(flag, title="How deep is your fjord?",
colors=['#EF2B2D', '#002868', '#ffffff'])


Note that I doubled W and H to increase resolution a bit. Also note that we are now using non-integer indices for the arrays, or rather we are not explicitly using integers: numpy will round our floating point numbers down to the nearest integer before doing anything with them.

## The Jamaican Cross

How are we going to deal with those diagonal lines? Well, hopefully you already are familiar with the basic equation of a straight line y = mx + c. Here we are essentially going to use that, but we’ll treat it as an inequality rather than an equality, i.e. for a given y at a specific x horizontal position, we want to know is it above or below the line? The code is now a bit more complicated:

from numpy import newaxis as nax
H, W = 100, 140
x = arange(W)[nax, :]
y = arange(H)[:, nax]
flag =  y > x*0.75 - W*0.1
show_flag(flag, title="Everything's gonna be alright",
colors=['#000000', '#FED100'])
# we'll also need '#009B3A' (green) later


The first line is just a convenient way to shorten the word newaxis to nax, we can then use it on the third and fourth lines: arange(W) creates a 1D array of the numbers [0, 1, 2, ..., W-1]. When we put [nax, :] after that, we are changing the shape of this 1D array to make it a 2D array of shape 1 x W. This means we can treat the x array as a single horizontal “row” in the flag, giving the x-coordinate of each column. Equivalently, the y array is a column, giving the y-coordinates of each row. Once we’ve setup these two variables, we are able to combine them in interesting ways.

By “setup” I mean that we’ve created two arrays with the same number of axes (i.e. 2), and the axes lengths “agree” with each other. What does it mean for two arrays to have axes lengths that “agree”? Well, it means specifically that each axis either has a length of 1, or it has exactly the same length as in the other array. For example 10 x 1 x 17 agrees with 10 x 23 x 1, but it doesn’t agree with 17 x 1 x 10. Once we have agreement, numpy can automatically “broadcast” any operations (such as adding, subtracting, comparing) out to match the full length in all axes.

TODO: a diagram of broadcasting would help a lot here


So here, we broadcast out the row x and the column y with the “greater-than” operation, and the result is a 2D array of trues and falses, telling us which elements in the array are above/below the line. When we display the flag the trues are converted to 1s and the falses to 0s (and then the 0s and 1s are mapped to colors as with previous flags). Note (if you’d hadn’t already) that the 0-end of the vertical axis is at the top of the flag and the H-end is at the bottom. If you want to flip this round you can add the line gca().invert_yaxis() at the end of your show_flag function (and then re-evaluate the function to update it).

We are not done yet, but hopefully you can now see how you might go about constructing the flag. If you get stuck, don’t feel too bad jumping to the suggested solution as I haven’t explained everything that you’re going to need to use here.

from numpy import newaxis as nax
H, W = 100, 140
flag = zeros([H, W])
x = arange(W)[nax, :]
y = arange(H)[:, nax]
yellow_part = ((y > x*0.75 - W*0.1) & (y < x*0.75 + W*0.1))
flag[yellow_part] = 2
yellow_part = ((y > W*0.6 - x*0.75) & (y < W*0.8 - x*0.75))
flag[yellow_part] = 2
green_part = ((y >= x*0.75 + W*0.1) & (y >= W*0.8 - x*0.75))
flag[green_part] = 1
green_part = ((y <= x*0.75 - W*0.1) & (y <= W*0.6 - x*0.75))
flag[green_part] = 1
show_flag(flag, title="Everything's gonna be alright",
colors=['#000000', '#009B3A', '#FED100'])


Right, that’s a little bit messy, but it’s largely unavoidable. There are two new things here.

Firstly, we are using & to combine pairs of 2D arrays: the combined array has the same shape as original two arrays (numpy would let us use broadcast again here if we needed to) and its elements are true where the elements of both arrays were true, and false elsewhere. With this mechanism we are able to say: “I want the band where y is greater than something *and* less than something else.“.

The second new thing is that we are indexing into flag using a 2D array of trues and falses rather than bands of indices. In addition to these two methods of indexing there are one (or two?) more that we haven’t covered yet.

## Nisshōki of Japan

This uses trues and falses like we did in Jamaica, but now we need to create a circle.

from numpy import newaxis as nax
H, W = 100, 140
flag = hypot(arange(-W/2, W/2)[nax, :],
arange(-H/2, H/2)[:, nax]) > 30
show_flag(flag, title="Pythonic haiku? Not really.",
colors=['#ff0000', '#ffffff'])


## I Galanolefki of Greece

Although you could just use the basic stuff we learnt at the start, here we create all the stripes with a single command.

H, W = 100, 140
flag = zeros([H, W])
is_stripe = arange(H) % 22 > 10
flag[is_stripe, :] = 1
flag[:55, :50] = 0
flag[:55, 20:30] = 1
flag[22:33, :50] = 1
show_flag(flag, title="'flag' is not of Greek origin",
colors=['#0D5EAF', '#ffffff'])


* I use the Spyder IDE as distributed with anaconda, but you could possibly use an IPython notebook if you prefer. The only modules we need for this tutorial are Numpy and Matplotlib. I use Python 2.7, but everything here is probably more or less fine in 3.x.

This tutorial took a good while to write, what with all the explanations of how to use numpy and the actual messing around to make the flags. If you found it useful let me know in the comments. If I have time I may add some more examples to deal with stepped indices, reverse indices (maybe?), and perhaps suggest how custom-written functions can be used to draw complicated shapes like circles or stars…and then those functions can be reused for later flags.

# Non-trivial vectorizations

For a more general discussion on optimizing, see my previous post on perfectly timed optimization.

Before we get started, I should say that the following stories are likely to be of most interest to Matlab or Numpy users.  In compiled languages, the general lack of overheads will probably do a pretty good job of making fast programs, as will any loop optimization passes.  Auto-vectorization is getting better over time, but is unlikely to work for the examples here (and in fact using these methods in a compiled language may be a pessimization). However, these tricks do become relevant again when one switches to computing on the GPU.

### Example 1.

Imagine we have an array giving times for 10k events, and our task is to compute a histogram giving all the pairwise time differences. Or rather, we must compute just the lower end of that histogram, where time differences are within some specified window.

Some raw-python code might look like the following. Note that for simplicity I assume the desired histogram bins are one time-unit wide…

def make_histogram(times, window):
hist = [0]*window
for ii, time_ii in enumerate(times):
for time_jj in times[ii+1:]:
delta = time_jj - time_ii
if delta >= window:
break
hist[delta] += 1
return hist


In words: for each event, loop forward over the subsequent events, storing time-diffs in the histogram until you reach the end of the window. You could improve this slightly by pulling the delta>window check into a separate loop from the hist[delta]++ line: create a last_jj variable that gets incremented zero or more steps each ii-iteration.

Right, lets begin vectorizing….

Firstly, for each ii lets find the number of subsequent events in the window. The approach we use may not strictly speaking be using SIMD under the covers, but in the Matlab/Numpy world we could still call it a “pseudo-vectorized” primitive.

counts = times.searchsorted(times + window, side='left')\
- arange(len(times)) -1

# which would let us do something like...
for start_ii, (time_ii, count_ii) in enumerate(
zip(times, counts)):
deltas = times[start_ii+1:start_ii+1+count_ii] - time_ii
hist += bincount(deltas, minlength=window)


searchsorted and bincount are part of Numpy, the equivalents in Matlab are histc (use the second output) and accumarray. Observe that by knowing the counts in advance we were able to pull the delta calculation out of the inner loop, and then as a result we could pull the increments out of the inner loop too.

The second step is a bit trickier, we want to pull the delta calculation out of the ii loop entirely. This calculation is just a long list of times[jj] - times[ii], so what we’ll need is two sets of indices, that look something like:

all_ii = [0 0 0 0 1 1 1 1 1 3 3 4 4 4 4 ...]
all_jj = [0 1 2 3 0 1 2 3 4 0 1 0 1 2 3 ...] + all_ii + 1


that is we want to repeat each ii the number of times specified by counts[ii], and all_jj is formed from all_iis by the addition of mini-ranges of length counts[ii]. In Numpy there is a function for generating all_ii, called repeat. In Matlab we have to build the function ourselves:

function ret = repeat(n)
n_inds = find(n);
n_inds(2:end) = n_inds(2:end) - n_inds(1:end-1);
n_cumsum = ones('single');
n_cumsum(2:length(n)+1) = cumsum(n)+1;
ret = zeros(n_cumsum(end)-1, 1, 'single');
ret = cumsum(ret);
end


For the multiple ranges in the construction of all_jj, we can use the following in Numpy (this translates to Matlab fairly easily, but be careful with one/zero-based indexing):

def multi_arange(n):
n_cumsum = cumsum(n)
ret = ones(n_cumsum[-1]+1, dtype=int)
ret[0] -= 1
return cumsum(ret)[:-1]


If you weren’t already a fan of cumsum, I hope you are warming to it – look how we used it twice in one function here (and we used it twice in the Matlab version of repeat)!

Ok, that’s pretty much it. We can now get rid of the ii-loop, and we’re done:

def make_histogram(times, window):
counts = times.searchsorted(times + window, side='left')\
- arange(len(times)) -1
all_ii = repeat(arange(len(counts)), counts)
all_jj = multi_arange(counts) + all_ii + 1
deltas = times[all_jj] - times[all_ii]
return bincount(deltas, minlength=window)


Note that this optimization only makes sense if the counts in each window is guaranteed to be fairly small, and the total deltas array fits easily in memory. If either of those things is not true you might want to stick with the intermediate version, or wrap some batching mechanism around the final version.

Obviously, if you are going to do this sort of thing yourself, make sure you are testing properly as you go along and provide assertions about array shapes and types etc. The two helper functions I showed should also be given some documentation: just showing a sample input/output should be enough here. Note that the examples I’ve given here are simplified versions of the actual code I use normally – the stuff here is fine for simple integer cases, but I recommend testing it properly yourself if you actually intend to use it!

The multi_arange function, and some other goodies are available in the numpy groupies package that I co-authored with @ml31415 on github.

### Example 2.

Imagine we have a list of 60k (x, y) coordinates that correspond to the path someone walked. Imagine also, that there are errors in that path: at certain points the person rotated by some amount that was not reflected in the path data. Our task is to take the 2D array of xy data, together with a 1D array providing the rotation errors at each step, and generate a corrected 2D array of xy data.

Lets picture a simple example in which the person took just 50 steps (rather than 60k), and recorded two abrupt rotations. Our task could be to correct for 3 additional rotations, or we might need to correct for a series of very small rotations over multiple steps:

The function I wrote to solve this wasn’t actually needed in any analysis in the end, but I had fun writing it, not least because I got to name it cumrot!

I’m not going to give an un-vectorized code sample because I hope the above picture already describes the problem well enough (and I never tried writing it un-vectorized so I’m not sure what that would look like anyway).

Right, let’s get to it then… (edit: you may want to skip the interesting-looking stuff and read the “oh dear” section at the end.)

To begin, you need to know that a rotation of th radians about a point (x,y) can be expressed in matrix form as:

$\left( \begin{array}{ccc} cos(th) & -sin(th) & [x - x cos(th) + y sin(th)] \\ sin(th) & cos(th) & [y - x sin(th) - y cos(th)] \\ 0 & 0 & 1 \end{array} \right)$

So if we had just one rotation error, we could apply this rotation to all the data from that point forward. Even if we had a small number of corrections we could loop through each of them and apply them one after the other, i.e. this would suit the “with 3 rotations” example above. However, it would be pretty inefficient to do this for the “with gradual rotation” example, because there are many small rotations needing to be corrected for (and even that was just meant to be a simple small example).

Instead, lets imagine performing the rotations in reverse order, i.e. starting with the last one (but still applying the rotation to all the data from that point in the forwards direction) and then working back towards the first. If we do the rotations in this reversed order, then each pivot point is un-altered at the moment we need to create its rotation matrix. Thus, we can produce all the rotation matrices ahead of time. Which is a nice start.

But we can do better than that: observe that each point is being multiplied by all the preceding rotation matrices, however we could (by virtue of associativity) equivalently say that we multiply each point by a single composite-transformation matrix.

What does that give us? Well, *if* we can work out the necessary composite transformation matrices, then we can vectorize the second step, i.e. the multi-matrix-multiply, as follows:

x_out = a*x + b*y + c;
y_out = d*x + e*y + f;


where a b c d e f are all 60k-long 1D arrays, that together define the 60k separate transformation matrices of the form:

$\left( \begin{array}{ccc} a & b & c \\ d & e & f \\ 0 & 0 & 1 \end{array} \right)$

This is weird, right? Rather than using Numpy/Matlab to do matrix multiplies for us we’ve hand-written one. But ours is way better, because it’s actually 60k simultaneous multiplies! Ok, so the question is: how do we get the composite matrices needed for the multiply, above?

Well, it’s helpful that we’ve just seen how hand-writing matrix-multiplies sometimes makes sense. What we need to spot now is that we’re actually attempting to do a “cumulative” matrix multiply, i.e. each of the 60k steps is either an identity matrix or it’s a rotation matrix, and what we want is to do a cumulative matrix multiply along the length of this 60k stack of matrices. If you’re not familiar yet with basic cumulative sums (i.e. cumsum), then you’re going to be a bit out of your depth here. Even if you are a frequent user of cumsum you may want to go the the wikipedia page that has a nice description of how it can be computed in parallel. I reproduce the wikipedia diagram here:

Ok. We know how to vectorize a bunch of matrix multiplies, and the above diagram shows us the pattern we need to do them in, so all that’s left is to write the code. Below, I show the first of the two main loops. In preparation for this loop, we zero-padded x and y to ensure they are of power-two length, and we computed a b c d e f for all the individual rotation matrices (see the matrix above with cos(th), sin(th) etc.).

for log_s in range(logn):
s = 2**log_s
s2 = 2*s

# pick out the right-hand values.
a_r, b_r, c_r, d_r, e_r, f_r = a[s2-1::s2].copy(), \
b[s2-1::s2].copy(), c[s2-1::s2].copy(), \
d[s2-1::s2].copy(), e[s2-1::s2].copy(), \
f[s2-1::s2].copy()

# and the left-hand values
a_l, b_l, c_l, d_l, e_l, f_l = a[s-1::s2], \
b[s-1::s2], c[s-1::s2], d[s-1::s2],\
e[s-1::s2], f[s-1::s2]

# store in right hand value locations
a[s2-1::s2] = a_l*a_r + b_l*d_r
b[s2-1::s2] = a_l*b_r + b_l*e_r;
c[s2-1::s2] = a_l*c_r + b_l*f_r + c_l;
d[s2-1::s2] = d_l*a_r + e_l*d_r;
e[s2-1::s2] = d_l*b_r + e_l*e_r;
f[s2-1::s2] = d_l*c_r + e_l*f_r + f_l;


Note that I translated this from Matlab to Numpy without checking the results, but it’s probably correct?! More importantly, note that we have to explicitly copy the RHS values because we’re going to modify them in place bit by bit. Finally, note that although I was quite pleased with this when I wrote it, there is still room for improvement: you could use a single 2D array for abcdef, which would simplify the “picking” bits and might let you fuse together some of the actual assignment bits. For the case when most of the matrices are the just the identity matrix, you could further optimize this by performing the cumulative multiply on only the non-identity subset of the matrices, and then index into the result to expand it out to the appropriate length.

That’s all I wanted to say for this example. If you really want to see the complete solution then let me know (I have written it in Matlab). But hopefully you have enough here to write it yourself.

Oh dear! As I reached the end of writing about this, it occurred to me that there was in fact a much simpler way of doing the calculation: firstly, compute the cumulative sum for the extra rotation data; then convert the (x,y) data into a series of (dx, dy) steps; next, turn these Cartesian steps into polar steps and add the corresponding cumulative rotation angle for each step; finally, convert the steps back into Cartesian space and perform a cumulative sum in x and y.

This alternative method method requires slightly more trigonometry (cosine and sine at the end in addition to the arctan at the start), but the massive benefit is that all the hand-written matrix multiplies are replaced with standard cumulative sums: one for angle, one for x and one for y. The original solution may be slightly better in terms of resistance to floating point errors (but I’m not going to examine that in detail now). It would also probably be faster if the number of rotations is small (and we implement the suggested optimization for that scenario). I wonder whether there was a reason I wrote the original function with matrix multiplies? I notice that the code included the option to apply the cumulative transformation to some secondary position data, so perhaps that had something to do with it, but unfortunately I no longer remember the actual details of the intended use-case.

### Example 3.

In contrast with the previous examples, I am going to give a rough explanation of the neuroscience motivating this task (because it would be rather abstract and contrived without it).

Neurons, roughly speaking, behave like Poisson processes: at any given moment they each have a “true” rate, but we can only observe that rate indirectly by counting the number of discrete “spikes” (aka action potentials) they generate. For example, if a neuron generates 4 spikes in one second and 6 the next second we might (using an appropriate formula) estimate that the “true” underlying rate was about 5 Hz.

For many neurons, the “true” rate correlates with something we can observe in the outside world, where the word “correlate” here is meant in the general sense, i.e. some sort of statistically observed coupling. For example, the rates of some neurons are modulated by the exact location of the organism within its environment, i.e. a particular neuron might have a rate of 4Hz each time the person/animal goes to a spot near the window, but have a rate of approximately zero elsewhere.

Our task here is to analyse the data from 50 of these place-sensitive neurons: we will get a stack of 50 “ratemaps” that tells us the average rate for each of the cells in every square centimeter of an experimental arena. And we will get a list of the spike times for each of the 50 cells over the course of a 20 minute experiment. The question is: how can we use the neuron spike data and ratemaps to estimate the location of the animal? Since the animal is moving around, we need to produce a separate estimate for every second, or lets say every quarter of a second.

The calculation we need to do (which can be found in a number of proper academic papers) is, in essence, quite simple: (1) count the number of spikes for each cell in each time window; (2) for each of the 1 centimeter-square bins in the arena, calculate the probability that the observed counts could have occurred; (3) either stop there, i.e. provide the likelihood-map for each time window as the output, or find the location of the maximum likelihood for each time window, and just give that as the output.

For a single 1-cm bin, at (x,y) in a single window, t, the calculation is:

$L(x,y,t) = \prod_{i=0}^{50} \frac{rate_i(x,y)^{counts_i(t)} e^{-rate_i(x,y)}}{counts_i(t)!}$

If real maths seems like hard work, then just make sure you don’t stare too closely. It’s the programming that we’re here for, so you don’t need to worry about the details of the equation really.

Right, and we’re off…

To begin with, I need to make a point that is going to be obvious to people with certain backgrounds (but is apparently not obvious to many neuroscientists):

$\prod_i x_i=exp(\sum_i log(x_i))$

This means we can re-write the equation for $L(x,y,t)$ as a sum, in which the terms are nicely separated:

$log(L(x,y,t)) = -\sum_i rate_i(x,y) + \sum_i [counts_i(t) log(rate_i(x,y))] -\sum_i log(counts_i(t)!)$

Now, we can examine each term individually and come up with a way of vectorizing over the whole of x, y, and t.

The first term is easy:

# ratemaps is a 3D array of shape: [100 cm x 100cm x 50 neurons]
term_1 = sum(ratemaps, axis=-1)


Once we have the counts, the last term is also pretty easy. (For the counts we probably need to use bincount, either looping over cells, or using ravel_multi_index as I do in numpy-groupies.) We can even manage a “bonus” optimization because we know we’re dealing with small integers:

# counts is a 2D array of shape: [4800 windows x 50 neurons]
log_k_factorial = log(factorial(arange(max(counts)+1)))
term_3 = sum(log_k_factorial[counts], axis=-1)


The bonus optimization was that we build a tiny lookup-table of log-factorials and then simply index into that. This is a real help as computing logs is seriously slow.

The middle term is the only interesting one. Notice that it’s roughly of the form $\sum_i a_i b_i$. Whenever you see something of this form (and it occurs surprisingly often in my experience) then you should immediately say to yourself “I bet that can be re-written as a matrix-multiply”. I’m not going to try and describe the exact circumstances in which it is possible to turn sums into matrix multiplies, but just remember that it’s worth looking out for them Note that if you need to transpose a large array in order to do this it may not be worth it in Matlab, but it’s free to do so in Numpy. Numpy also lets you do n-dimensional dot products, whereas Matlab just does 1 or 2D – however, by reshaping your ND arrays appropriately you should be able to re-imagine an ND multiply as a Matlab-friendly 2D multiply).

# counts is a 2D array of shape: [4800 windows x 50 neurons]
# ratemaps is a 3D array of shape: [100cm x 100cm x 50
term_2 = dot(log(ratemaps), counts[:,:,nax]) # nax = np.newaxis


All that’s left is to add the three terms together, with the appropriate broadcasting in Numpy or bsxfun in Matlab, and we’re done. The result is about 6 lines long, and doesn’t look that horrible (well it’s ok in Numpy, but a bit uglier in Matlab). It’s also amazingly fast given how much number crunching it’s doing!

There may be some mistakes in this third example as I wasn’t as careful as with the previous two.

### Example 4

Note that I added this example a few days after writing the original post, although the code actually predates that of the other examples.

This task is interesting in that we manage to *partially* vectorize a loop even though we can’t go the whole way. The trick is to perform a kind of manual profile guided optimization.

The task is relatively simple: in an experiment, an LED traced out a trajectory in 2D space and we now have an array of 60k (x,y) coordinates corresponding to that trajectory. The problem is that in a few places the camera doing the tracking made an error, and spuriously recorded that the LED “jumped” over a large distance in an unrealistically short time.

The filtering algorithm we need to implement is roughly the following:

def filter_speed(xy, max_speed):
is_good = zeros(len(xy), dtype=bool)
ii = 0
while ii < len(xy):
is_good[ii] = True
from_xy = xy[ii]
for jj, to_xy in enumerate(xy[ii+1:]):
# dist is diff then hypot (though sqrt is redundant)
if dist(from_xy, to_xy) < (jj-ii)*max_speed:
ii = jj
break
# now we're supposed to interpolate where is_good is False


Notice how each iteration of the ii loop depends on the previous iteration: we need to know which points are good and which are bad before we can make claims about subsequent points. (Look at the diagram if this isn’t immediately clear.) This kind of problem is vaguely similar to a regex search where you need to scan along a long string looking for instances a complex pattern: however, I believe that regexes are generally implemented with a state machine and lookup table, which is not something that makes sense here.

In python I opted to compile this (or something pretty similar) with Cython. Job done: easy 10x speed up. And that is probably the right approach to take, really. However, some months earlier, back in Matlab, I was more reluctant to write a MEX file and instead I spent a bit of effort vectorizing. My approach is probably ridiculous in all but the most obscure situations, but I’ll include it here as I think it’s fairly interesting.

So how can we vectorize this loop, when each iteration clearly depends on the previous iteration? The answer (as I already mentioned) is that we can only partially vectorize, but that we can still get a pretty decent payoff. The observation we need to make is that, in general, the data is of a reasonably high “quality”, meaning there won’t be that many errors (if there are a lot of errors then something needs to be fixed in the experimental setup so as to bring the quality of the data back to standard). Thus if we vectorize the sections of loop with no errors, or very simple errors, we can jump over those sections when we encounter them in the loop, and dwell on only the difficult bits.

I’m not going to give the full code, because it’s quite long and rather ugly, but I’ll try and give you a flavor of it (as I said, don’t try this at home unless you really need to). Note that I’ve translated to python/numpy here:

# Is a step from xy[t] to xy[t+1] slow enough?
step1_is_good = dist(xy[:-1], xy[1:]) < 1*max_speed

# If not, is a step from xy[t] to xy[t+2] slow enough?

# do similar for t+3 (not shown), then summarize...
step_size = step1_is_good.astype(uint8) # i.e. = 1
# and similar for 3, everything else is set to 4


Right, this is just the preparation. What we now have is a vector that might look something like the following:

1111111211211241111211221321111124121211111111123111121...


We can now make some basic observations:

• wherever we see the pattern 111X, or 311X, or 321X, or 221X, or 211X, or 212X, we know that X will be “stepped on” unless step_size=4 comes into play in this region. (i.e. we are interested in the pattern [321][21][1]X.)
• wherever we see the pattern 112X, or 113X, or 131X, or 132X, we know that X will *not* be stepped on unless step_size=4 comes into play. (i.e. we are interested in the pattern [1][13][12]X)
• we can’t always be sure whether such triple-prefixes will or will not cause X to be stepped on, e.g. in 231X it depends on whether or not the 2 is stepped on.

Ok, lets assume we have created an array that tells us which of the three categories each element falls in to. And, now if we convert step_size to step_to_idx, i.e. an absolute target address, we can do something interesting. For example, step_size=13121... becomes step_to_idx=14355.... The interesting thing is that we can now summarize chunks of this step_to_idx array: wherever there is a contiguous block of known steps and/or known no-steps, we can make every element point to the end of the block, saying: “if,during the loop execution, you find yourself stepping on to this element, then please go to the end of the block, because I’ve already dealt with the details from here up to that point“. In the loop we may “land” on such points either because of a combination of [123][123][123]X that we couldn’t deal with in advance, or because a 4 came in to play.

The black lines in the “loop stage”, above, show how the loop might eventually iterate through the data. Note how some regions we read from and write step data to for the first time, some regions had step data, that we have to modify, and some regions we can skip right over.

So now, because we expect that most of the data is pretty much error-free, we should find that the loop requires far fewer iterations to get through all the data.

Unless someone in the comments is really keen to see it, I’m not going to provide any further code. I’ll just say that all the pre-loop preparation is done using a few calls to nonzero (aka find in Matlab) and one call to cumsum. The loop itself is nothing special really.

That’s it for this post. It took a long time to write, but hopefully it contained some interesting bits and bobs. Do leave a comment if you have something to say!

The next post I write might be aimed at absolute beginners rather than whoever the target audience was for this post: I’d like to do a basic indexing tutorial in which you learn to “paint” flags of the world in 2D arrays.