Should I be using pandas?

My PhD consisted partly of doing very hands-on neuroscience experiments, and partly of analyzing the “medium-sized data” sets produced by those experiements. I say “medium-sized” because the data was large enough to require care, but not so large that it couldn’t fit comfortably on a single disk in raw form, and fit in memory once partially processed.

At the start I was using Matlab for analysis, which seemed to be the weapon of choice for almost all neuroscientists (as far as I could tell)…. uhhhh, let me just give you an xkcd-inspired graph….

me and matlab

…so needless to say, I switched to python+numpy about a third of the way into the PhD and didn’t look back*.  This wasn’t too difficult to do as one of my colleagues had made the switch a few months earlier – I was offered a “starter pack” of functions for doing IO and maybe a few words of advice.

Although I did have a bit of extra-curricular experience with python, I’d not yet done any serious exploration of the language’s features or ecosystem.  Both of these (especially the second) took time: early on I had fun hacking around with magic methods to make a dictionary-like memoizing property-thing; weird mask-able lists; and something not quite as good as IPython’s autoreload feature. Later I even tried to write a system for un-parallelising plotting commands  (that I named “monobrow”).  However, I did eventually settle on an overall analysis design that looks a bit more sensible, and I’m going to describe bits of it here (it doesn’t include any parallelism unfortunately).

I’m not going to describe the actual analysis I do in detail, instead I’ll give a simplified analogy: pretend that I have a stack of about 1k images, and I need to run a handful of algorithms on each image separately, and then summarize the resulting metrics across all the images.  Also, lets say that these 1k images were from 10 different food-obsessed-photographers, who went on separate holidays of varying lengths, but followed a similar pic-taking pattern each day: photos of breakfast, photos of lunch, photos of dinner.

My actual data wasn’t to do with images, photographers, food, or holidays, but in terms of organizing the analysis, my task was similar.  I ended up with something encapsulated in two main classes: one at the level of the individual “image”, and one at the level of the “entire-dataset”.  I’ll describe the “image” class briefly and then go into a bit more detail for the “entire-dataset” class.

The “image” class (aka Trial)

The analogized description I gave above doesn’t quite tell the full story; although there were some parts of the algorithms that had to be run on whole images, much of the computation was focused on one or more manually identified regions within the images.

The time needed to read an image off disk and do the initial processing wasn’t insignificant (nor was the space required to keep it in memory), so I hoped to reuse a single instance of each image when analyzing sub-regions within it.  However, since this was an optimization, I wanted to hide the details from the final top-level analysis script.  My solution required a bit of cooperation between the two classes: the “image” class would have a caching/memoization system, and the “entire-dataset” class would know to reuse instances of the image class across each of the subregions within an image, it would also know how to ask the image class to clear different bits of its cache.

image_id  subregion_id  image_instance
  0             91     [img #8712fc12a00]
  0             92     [img #8712fc12a00]
  0             93     [img #8712fc12a00]
  1             97     [img #1283efd7800]
  1             98     [img #1283efd7800]
...

The “image” class (which was actually called “Trial”) looked like this:

trial_diagram

Each of the blocky things corresponds to a mixin contained in a separate file.  The foundation of the class is the caching mixin, which is used by the mixins within the “core api” cage as well as (potentially) by the more exotic stuff above.  I’ll describe the overall picture first, and then come back to give a bit more detail specifically on the caching.

The most important design decision was to separate the random-bits-and-bobs from the “core api”.  The idea is that by very clearly defining a core API, you can commit to keeping one section of the code in good working order even if other bits of the code are a mess.  If a core function is not quite general, fast, or readable enough, you have to promise to work on it “in situ” rather than inlining it into whatever complex analysis you are currently working on.  Python is good at letting you gradually add options to existing functions, so that you slowly build the value of your interface and its implementation. Drawing the line around the core also makes it easier to share the exotic stuff with other people: you either provide them with a copy of your well-maintained code, or you just give them a well-maintained description of the interface.

The tricky thing I found when defining the “core api” was that there were actually several parts to it: I wanted to separate the caching interface, the IO, and the basic analysis, but it took me a few iterations before I realized that I could create the divisions, and exactly where they should be.  With the current design, I can have separate IO implementations for different file formats, yet still slot that under the top-most part of the core.  I can also play with different systems of caching, or use a no-cache implementation, without having to worry about refactoring lots of core and non-core files.

So, let’s look at memoization/caching in more detail. The main two methods of the TrialCache mixin are as follows:

def _cache_has(self, kind, names, key=None):
    names = (names,) if not isinstance(names,
                            (list, tuple, set)) else names
        
    # update/create the list of names for the given kind
    if kind in self._cache_name_lists:
        self._cache_name_lists[kind].update(set(names))        
    else:
        self._cache_name_lists[kind] = set(names)
        
    # initialize any of the attributes if they are missing
    for name_ii in names:                 
        if not hasattr(self, name_ii):
            setattr(self, name_ii, None if key is None else {})
        
    # return False if any of the attrs are not already computed
    for name_ii in names:                 
        attr = getattr(self, name_ii)
        if attr is None or (key is not None and key not in attr):
            return False
    return True

def _clear_cache(self, keep=(), drop=()):
    if len(keep) and len(drop):
        raise ValueError("use 'keep' or 'drop', not both.")            
    if any(" " in k for k in keep + drop):
        raise NotImplementedError()
        
    if len(keep):
        kinds = [k for k in self._cache_name_lists.keys() if 
                 all(kk in keep for kk in k.split(" "))]
    elif len(drop):
        kinds = [k for k in self._cache_name_lists.keys() if
                 any(kk in drop for kk in k.split(" "))]
    else:
        kinds = self._cache_name_lists.keys()
            
    for k in kinds:
        for name in self._cache_name_lists[k]:
            if hasattr(self, name):
                delattr(self, name)

cache_has is a slightly unusual function in so far as its name should be interpreted as a statement of intent as well as a question: the caller is saying “I intended for the cache to contain such-and-such from now on, but does it have it yet?

Within another mixin, you use it as follows:

@property
def red_count(self):
    if not self._cache_has('color_data', '_red_count'):
        self._red_count = compute_red_count(self._data, self._params)
    return self._red_count

@smoothing.setter
def smoothing(self, val):
    self._params['smoothing'] = val
    self._clear_cache(drop=('color_data',))

The idea is that the computation of _red_count is guarded by the _cache_has intent/test, so that redundant re-computing can be avoided, yet we can still invalidate stored results when we need to, i.e. when we change the smoothing setting. You could have many computed things that depend on the smoothing setting and invalidate them all with a single command: you can define as many groupings as you need and either ask to invalidate specific groupings or invalidate all except specific groupings.

Note that the code above supports the use of lists of strings and/or strings for most arguments. It also has some builtin support for “lazy” dictionaries that I haven’t demonstrated here.

The code is not perfect: in retrospect it might have been better to hold all the cached items within an actual “cache” container, rather than attaching them to the self object directly and only managing them via lists of the attribute names. Also, I might want to (optionally) expose the caching as a decorator, which is a common paradigm for simple memoizing. If I was being really serious I might investigate some kind of data-flow caching system that lets you define a hierarchy of data dependencies, but that’s probably a lot more complicated…indeed that’s roughly what my ambitious C++ project, “Venomous“, is hoping to do. Anyway, all I wanted to point out is that with a few lines of mixin methods you can create a pretty powerful caching system that other mixins can use in a very compact way.

We’ll come back to this caching system in a bit, but next time from the point of view of the “entire-dataset” class….

The “entire-dataset” class (aka ExpSet)

I don’t have a pretty picture for this class unfortunately, so I’ll just have to describe it in words.

For several months I was using a class that was a lot like the one I’m about to describe but it was cobbled together out of a strange mixture of pure-python lists, numpy arrays and weird home-made concoctions. Switching to pandas wasn’t all that simple, but it made things a lot better.

My impression is that pandas is viewed quite differently by different groups of users: finance/economics-y people might be in the habit of thinking that every dataframe has a timestamp column, with rows that can be aggregated by hour or by day. Geo-people might hold equivalent beliefs about GPS coordinates (I’m guessing). To me, pandas is simply a package that adds labels to the rows and columns of a numpy-like 2D array, and ensures that the labels stick with the rows and columns as you index into and make copies of the array. And, importantly for my usage-case, those labels can be hierarchical…

…well, in theory they can be hierarchical, but in practice the support for MultiIndex isn’t perfect (yet): the syntax for indexing/slicing and joining these kinds of dataframes is not exactly pleasant or easy to remember.

This lack of whole-hearted support for hierarchical indexing led me to encapsulate my main dataframe(s) inside a class of my own. The idea being that I could patch the syntactical gaps more easily and come up with a library of tools for doing slightly more complex things. I settled on having two separate dataframes, which I named simply df1 and df2. The first had one row for each “region of interest” within each image, and the second had one row for each image. The first thing I did was write a method to take the strange csv-tables I had and parse them out into the two dataframes, including instantiating all the images. (I didn’t mention this explicitly earlier, but the image class is designed to be entirely lazily loaded – the filename is merely recorded at construction, nothing is read off disk.)

Once I had the basic pair of dataframes I could start thinking about iterating over the rows, running the algorithms on the images. This is where the title of the post comes in: should I be using pandas for this sort of thing, given that pandas is supposed to be designed for performing operations in clever aggregate ways?

Well, it might look strange to the other users of Pandas, but I think the paradigm I’ve adopted is absolutely valid and should perhaps be more widely encouraged.

Before I show you an example of iterating over the rows of the dataframe, let me give you (most of) the code for the RowIndexer class I came up with:

class RowIndexer(object):
    def __init__(self, obj, idx, **kwargs):
        object.__setattr__(self, '_obj', obj)
        object.__setattr__(self, '_idx', idx)
        object.__setattr__(self, '_with_column_sub', None)
        for k, v in kwargs.iteritems():
            object.__setattr__(self, k, v)

    def __getattr__(self, key):
        original_key = key
        if self._with_column_sub is not None:
            sub_idx = self._with_column_sub.index('[attr]')
            key = self._with_column_sub[:sub_idx] + (key,) + \
                             self._with_column_sub[sub_idx+1:]

        # try a simple column indexing, then index names,
        # then index names without sub
        try:
            return self._obj.get_value(self._idx, key)
        except:
            pass 
        try:
            return self._idx[list(self._obj.index.names).index(key)]
        except:
            pass
        if original_key == key:
            raise KeyError("Could not find key '" + str(key) +
                            "' in column or index names.")
        try:
            return self._idx[list(self._obj.index.names)\
                       .index(original_key)]
        except:
            pass
        raise KeyError("Could not find key '" + str(key) 
                       + " or " + str(original_key) +
                       "' in column or index names.")
            
    def __setattr__(self, key, value):
        if self._with_column_sub is not None:
            sub_idx = self._with_column_sub.index('[attr]')
            key = self._with_column_sub[:sub_idx] + (key,) \
                    + self._with_column_sub[sub_idx+1:]
        elif self._obj.columns.nlevels > 1:
            key = (key,) + ('',) * (self._obj.columns.nlevels-1)
        try:
            self._obj.set_value(self._idx, key, value)
        except TypeError:
            self._obj[key] = self._obj[key].get_values()\
                                           .astype(object)
            self._obj.set_value(self._idx, key, value)
            
    @contextmanager
    def column_sub(self, *args, **kwargs):
        if not args or len([a for a in args if a =='[attr]']
                           ) > len(args)-1:
            raise ValueError("you must specify at most one "
                             "'[attr]' spot, and at least "
                             "one sub label.")
        if 'padding' in kwargs:
            padding = kwargs['padding']
            if len(padding) > 1:
                raise ValueError("urecorgnised wkargs")
        else:
            padding = ""
            if kwargs:
                raise ValueError("unrecognised kwargs")
                
        # establish total levels required                     
        if '[attr]' not in args:
            args = args + ('[attr]',)
        target_nlevels = len(args)
            
        _pad_columns_to_level(self._obj, target_nlevels, padding)
                                
        object.__setattr__(self, '_with_column_sub', tuple(args))
        try:
            yield
        finally:
            object.__setattr__(self, '_with_column_sub', None)

Instances of this class are yielded by a generator in ExpSet as the user iterates over the rows of df1 and/or df2. The main idea is that assigning to attributes on the instance actually results in values being written to columns within a specific row of a dataframe.

for subregion, image_inst in dataset.iter():     
    subregion.redness = image_inst.get_redness(subregion.shape)

Here subregion is a RowIndexer instance that points to a different row on each iteration of the loop, and image_inst is the instance of the “image” class relevant for the given row. Inside the generator we make sure to yield all the subregions for a specific image before moving on to the next. When we are ready to switch images, we clear some/all of the cache for the old image, thus keeping the total memory usage manageable. Exactly which bits of the cache we want to clear depends on (a) what we are likely to want in future if/when we loop over the data again; (b) what was expensive to compute; and (c) how large things are in memory. These three points are common to pretty much any caching system, and the optimal choice is rarely an easy thing to pin down: here we don’t use any automatic heuristics, instead we allow the user to specify which bits of data to delete using the fine-grained caching machinery outlined earlier. For example, the user may delete the raw whole-image, but keep the down-sampled, filtered version.

The nice feature of this system is that the Image class is agnostic of the way in which you iterate: it will do its best to cache stuff and reuse anything from the cache when possible. Also, the user’s loop (i.e. the one shown here) can ignore the details of caching: they don’t even need to know that one image instance is shared across multiple subregions. If the user does want to customize the cache-clearing behavior, he simply provides an extra argument to the iterator. The iterator can also take care of displaying progress reports.

Try doing that in Matlab!!

An important feature of the RowIndexer class is that you don’t need to declare new columns ahead of time: they are automatically added to the parent dataframe as needed. The class also has a couple of bonus features, one of which is the ability to handle and create multi-indexed columns using a with-block:

subregion.n_people = image_inst.n_people(subregion.shape)
for zone in ['upper', 'lower']:
    with subregion.column_sub('[attr]', zone), image_inst.mask(zone):
       subregion.n_flowers = image_inst.n_flowers(subregion.shape)
       subregion.n_clouds = image_inst.n_clouds(subregion.shape)
# the parent dataframe now has columns: 
#  n_people |    flowers    |     clouds    |
#           | upper | lower | upper | lower |

All of this lets you do some fairly complex meta-analysis (still using caching, remember) in just a few clear lines of high-level code.

The final thing I want to discuss is a bit like the RowIndexer, but it applies to the whole dataframe, or in fact to the whole custom class in which the pair of dataframes are held.

As before, I’ll give the implementation first, but you don’t have to look at it too closely, especially as it depends on a few things that I haven’t included:

# method of "entire-dataset" class...
def view(self, **kwargs):
    original_df1, original_df2, original_mode = (
                           self.df1, self.df2, self.mode)
    sub_df1, sub_df2, sub_mode = self._get_sliced(**kwargs)
    self.df1, self.df2, self.mode = sub_df1, sub_df2, sub_mode
    self.df1.is_copy = self.df2.is_copy = None 
    try:
        yield
    except Exception:
        raise
    finally:
        self.mode = original_mode
        self.df1 = original_df1     
        self.df2 = original_df2
            
        if sub_mode != original_mode:
            sub_df1, sub_df2 = self._get_as_mode(
                         src=sub_mode, dest=original_mode, 
                         df1=sub_df1, df2=sub_df2)
            _update(original_df1, sub_df1)            
            _update(original_df2, sub_df2)    

And we can use it as follows:

with dataset.view(meal='breakfast', holiday_type='european'): 
   for subregion, image_inst in dataset:
       subregion.juice = image_inst.has_juice(subregion.shape)  

The view method does two things: firstly, it exposes a nice way of slicing the dataset using keywargs that can be handled intelligently behind the scenes (using all the gruesome pandas syntax); and secondly it acts as a context-manager that gives us a filtered psuedo-view onto the whole dataset: once the with-block exits, any changes made are propagated back into the original dataset. When using raw pandas, some types of indexing generate copies and some types generate views: certainly if you have complex filtering requirements it is rare that you will end up with a view back onto the original dataframe. (For many users this may only be memory problem, rather than a logical-one, but given that we want to write into the rows of the dataframe we need to have some kind of system such as this.) In fact, in my ExpSet class, not only can you slice the data in interesting ways with this sytnax, you can also pivot it into different forms, do some computation, and then have it revert back to the original form when the with-block exits.

Identical slicing/pivoting functionality is provided in a second method called copy, as the name suggests, this is not intended for a with-block, but simply gives a do-what-you-want-with clone of the original dataset.

That’s pretty much all I wanted to say about the “entire-dataset” class. The two things I wish it did (better) were outputting nice HTML tables and providing a simple means of running loops in parallel. Pandas does do HTML tables, but I had to fork it in order to put images in the table (I did submit a pull request, but in the end it got rejected in favor of something slightly different). Having interactive tables that you can filter and sort in the browser would also be great (I think there are some tools built on top of pandas that do this to some extent, but I never tried them).

As with previous posts, this took a while to get onto the page. If you found it interesting, or you want to argue about something, please do post in the comments.

*except when helping other people with Matlab, ‘cos I’m such a nice person. Edit Jan-2016: I’ve just come across last year’s Stack Overflow survey in which it was found that two thirds of Matlab-using respondents did not want to continue using it, which put it near the top of the “dreaded” languages list.

One response to “Should I be using pandas?”

Leave a comment