May 24, 2013

The Nipy blog

Unscientific programming

We various of the Berkeley NIPY crowd were in a Mexican restaurant yesterday lunchtime. We discussed the IPython notebook and scientific programming and reproducible research.

One question struck me as fundamental: do scientists have to learn to be software engineers?

I think the common answer to this is "No - scientific coding is different". The "No" camp points out that scientists use code in a different way to software engineers. Science is exploration, and so is scientific coding. Scientific code is provisional, often changing. There is no specfication, there is no production system that must just work. We are trying stuff out, seeing what works, adjusting to our better understanding of the data and the ideas.

Maybe the scientist's idea of the canonical software engineer is someone who writes and hosts a web application with some database behind it.

Accepting the "No", we look at things that software engineers should learn:

  1. Version control. Essential for a software engineer, apparently, because they need to be able to rollback their changes, and keep track of released and development versions. Desirable maybe for a scientist, but not essential, because the code never goes into production, and we always use the latest version of the code.
  2. Testing. Essential for a software engineer, otherwise they may release code that corrupts a database of financial transactions or delivers the wrong item to your address. Desirable maybe for a scientist, but not essential because the nature of the code changes so often that it would only slow things down to write exhaustive tests, only for the code to change track and make the tests obsolete.
  3. Code review. Essential for a software engineer, because they work in teams with other software engineers. This is their job, to write code, and they can learn from each other. Desirable maybe for a scientist, but not essential because each scientist owns their own problem and so only they can understand their code. Explaining the code is time-consuming and slows progress in developing new ideas. Others in the lab are not software engineers, so they are not trained to review code, they will not enjoy it and they will not do a good job.
  4. Releases. Essential for a software engineer because they may be paid to provide code that others can use. The release labels code that can be used. The software engineer can and should make sure the code works as advertised. Desirable maybe for a scientist, but not essential, because the code changes often, and the code may be written to solve a very specific problem that could not be of much interest to another researcher. If the other researcher is interested, they should make the effort to work out how to use the code, that is not the job of the author-scientist, they are not paid to support software, but to produce science.
  5. Documentation. Essential for a software engineer because they want their code to be used correctly by other people. Desirable but not essential for the scientist because scientific code is not for distribution except to other scientists who must take responsibility for the code themselves.

I think that these set of beliefs lie at the heart of the problem for reproducible science.

The beliefs rest on the following model of writing scientific code:

A folk model of scientific code

A scientist named A writes some code. They check the code. They confirm it is working. Most likely this means the code is free of major errors.

Another scientist named B is reading some results published by A. They can assume that the results are free of major errors.

Living with the folk model

Now everything makes sense. Version control, testing, code review, releases, documentation are great, of course, but if we can assume there are no major errors, then - why would B want to see my code? If B gets my code, why would she need to understand it? Why would she need to know what version it was, or how that related to my paper? She should assume that there are no major errors.

Rejecting the folk model

The folk model is not just too simple, it is flat-out wrong. We make mistakes all the time. All the time. If you know that, then you know that B needs to check your stuff. They can't do science without checking your stuff. If they need to check your stuff, you need to write your code for production. Then, nothing is optional. We all need; version control, testing, code review, releases, documentation.

by Matthew Brett at May 24, 2013 03:00 PM

The ubiquity of error

I was talking to an esteemed colleague about six months ago about how easy it was to believe that we do not make errors, unless we check.

He said that he had noticed this change when his students and research assistants started to use their own computer programs to analyze data. Before this, when he and others added up a column of numbers, they would check several times that they were correct. After, the student or RA would analyze some data with a program they had written themselves, and my friend would say "are you sure the program is right" and they would say "yes I'm sure". My friend would then go through the results and check; they would often find mistakes. He thought that there was something about using a computer that made it easy to believe that the result was right, even if the process of getting the result had the same chance of error as a simpler task like adding numbers.

To quote David Donoho:

In my own experience, error is ubiquitous in scientific computing, and one needs to work very diligently and energetically to eliminate it. One needs a very clear idea of what has been done in order to know where to look for likely sources of error. I often cannot really be sure what a student or colleague has done from his/her own presentation, and in fact often his/her description does not agree with my own understanding of what has been done, once I look carefully at the scripts. Actually, I find that researchers quite generally forget what they have done and misrepresent their computations.

Computing results are now being presented in a very loose, “breezy” way—in journal articles, in conferences, and in books. All too often one simply takes computations at face value. This is spectacularly against the evidence of my own experience. I would much rather that at talks and in referee reports, the possibility of such error were seriously examined.

-- David L. Donoho (2010). An invitation to reproducible computational research. Biostatistics Volume 11, Issue 3 Pp. 385-388

There is something about computational science that seems to make the idea of error less worrying or important:

In stark contrast to the sciences relying on deduction or empiricism, computational science is far less visibly concerned with the ubiquity of error. At conferences and in publications, it’s now completely acceptable for a researcher to simply say, “here is what I did, and here are my results.” Presenters devote almost no time to explaining why the audience should believe that they found and corrected errors in their computations. The presentation’s core isn’t about the struggle to root out error — as it would be in mature fields — but is instead a sales pitch: an enthusiastic presentation of ideas and a breezy demo of an implementation. Computational science has nothing like the elaborate mechanisms of formal proof in mathematics or meta-analysis in empirical science. Many users of scientific computing aren’t even trying to follow a systematic, rigorous discipline that would in principle allow others to verify the claims they make. How dare we imagine that computational science, as routinely practiced, is reliable!

-- David L. Donoho, Arian Maleki, Inam Ur Rahman, Morteza Shahram and Victoria Stodden (2009) Reproducible Research in Computational Harmonic Analysis. Computing in Science and Engineering 11(1) pp 8-18

by Matthew Brett at May 24, 2013 03:00 PM

May 23, 2013

Continuum Analytics

Accelerating Python Libraries with Numba (Part 2)

Welcome. This post is part of a series of Continuum Analytics Open Notebooks showcasing our projects, products, and services.

In this Continuum Open Notebook, you’ll learn more about how Numba works and how it reduces your programming effort, and see that it achieves comparable performance to C and Cython over a range of benchmarks.

by Aron Ahmadia at May 23, 2013 09:57 AM

May 21, 2013

Continuum Analytics news

Continuum Analytics Launches Full-Featured, In-Browser Data Analytics Environment

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Wakari version 1.0, an easy-to-use, cloud-based, collaborative Python environment for analyzing, exploring and visualizing large data sets

by Corinna Bahr at May 21, 2013 10:00 AM

May 20, 2013

Enthought

Enthought awarded $1M DOE SBIR grant to develop open-source Python HPC framework

We are excited to announce that Enthought is undertaking a multi-year project to bring the strengths of NumPy to high-performance distributed computing.  The goal is to provide a more intuitive and user-friendly interface to both distributed array computing and to high-performance parallel libraries.  We will release the project as open source, providing another tool in [...]

by mtranby at May 20, 2013 04:11 PM

Philip Herron

Woo SunShine!

Forgot to mention i am in San Francisco at the moment with Work just took this picture for the lulz:

whoop1

by redbrain at May 20, 2013 04:41 AM

Dealing with Flack

So this is kind of a rant/political type post. When i started my project GCCPY as soon as it was announced as part of Google Summer of code 2010. A person who i should probably leave nameless nearly immediately jumped on #gcc irc.oftc.net, and started super questioning me on why the hell i would approach python in this way. To the point where he said that i should not bother and just contribute to PyPy. And the moderators in the channel had to kind of step in because it was just way too much.

So ok fair enough, then everything was fairly quiet i get the odd email from interested people then i noticed my project was on reddit http://www.reddit.com/r/Python/comments/1dc0tl/python_frontend_to_gcc_xpost_from_rgcc/

Same guy on there generally a negative slate on my project again. Partly i don’t blame them since my wiki is not that detailed. But i tend to feel people don’t think for themselves and just start believing people from established projects are rock stars, to the point were new projects are pointless to them.

Everyone is entiled to their opinion but what i will say is, think for yourself. And _dont_ compare a project like GCCPY to PyPy. Event just look at PyPy’s main page they have over 25k in donations!! What do i have, my spare time at work and doing everything from scratch in C where as pypy is basically python and jit reusing libpython.so. So before you start slating something as not worth its and cant do x,y,z. I have my own opinion that i believe this is a valid way of implementing python. Yes its _no_ where near python 2 complete! Make it a full time job it would be done in a year. I will say from my benchmarks which i want to post at the end of the year at python con IE; gccpy is looking truly amazing.

Think for yourself reddit!

by redbrain at May 20, 2013 04:38 AM

May 19, 2013

Fabian Pedregosa

Numerical optimizers for Logistic Regression

Following a challenge proposed by Gael to my group I compared several implementations of Logistic Regression. The task was to implement a Logistic Regression model using standard optimization tools from scipy.optimize and compare them against state of the art implementations such as LIBLINEAR.

In this blog post I'll write down all the implementation details of this model, in the hope that not only the conclusions but also the process would be useful for future comparisons and benchmarks.

Function evaluation

The loss function for the $\ell_2$-regularized logistic regression, i.e. the function to be minimized is

$$ \mathcal{L}(w, \lambda, X, y) = - \sum_{i=1}^n \log(\phi(y_i w^T X_i)) + \lambda w^T w $$

where $\phi(t) = 1. / (1 + \exp(-t))$ is the logistic function, $\lambda w^T w$ is the regularization term and $X, y$ is the input data, with $X \in \mathbb{R}^{n \times p}$ and $y \in \{-1, 1\}^n$. However, this formulation is not great from a practical standpoint. Even for not unlikely values of $t$ such as $t = -100$, $\exp(100)$ will overflow, assigning the loss an (erroneous) value of $+\infty$. For this reason 1, we evaluate $\log(\phi(t))$ as

$$ \log(\phi(t)) = \begin{cases} - \log(1 + \exp(-t)) \text{ if } t > 0 \\ t - \log(1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases} $$

The gradient of the loss function is given by

$$ \nabla_w \mathcal{L} = \sum_{i=1}^n y_i X_i (\phi(y_i w^T X_i) - 1) + \lambda w $$

Similarly, the logistic function $\phi$ used here can be computed in a more stable way using the formula

$$ \phi(t) = \begin{cases} 1 / (1 + \exp(-t)) \text{ if } t > 0 \\ \exp(t) / (1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases} $$

Finally, we will also need the Hessian for some second-order methods, which is given by

$$ \nabla_w ^2 \mathcal{L} = X^T D X + \lambda I $$

where $I$ is the identity matrix and $D$ is a diagonal matrix given by $D_{ii} = \phi(y_i w^T X_i)(1 - \phi(y_i w^T X_i))$.

In Python, these function can be written as

def phi(t):
    # logistic function, returns 1 / (1 + exp(-t))
    idx = t > 0
    out = np.empty(t.size, dtype=np.float)
    out[idx] = 1. / (1 + np.exp(-t[idx]))
    exp_t = np.exp(t[~idx])
    out[~idx] = exp_t / (1. + exp_t)
    return out

def loss(w, X, y, alpha):
    # logistic loss function, returns Sum{-log(phi(t))}
    z = X.dot(w)
    yz = y * z
    idx = yz > 0
    out = np.zeros_like(yz)
    out[idx] = np.log(1 + np.exp(-yz[idx]))
    out[~idx] = (-yz[~idx] + np.log(1 + np.exp(yz[~idx])))
    out = out.sum() + .5 * alpha * w.dot(w)
    return out

def gradient(w, X, y, alpha):
    # gradient of the logistic loss
    z = X.dot(w)
    z = phi(y * z)
    z0 = (z - 1) * y
    grad = X.T.dot(z0) + alpha * w
    return grad

def hessian(s, w, X, y, alpha):
    # hessian of the logistic loss
    z = X.dot(w)
    z = phi(y * z)
    d = z * (1 - z)
    wa = d * X.dot(s)
    Hs = X.T.dot(wa)
    out = Hs + alpha * s
    return  out

Benchmarks

I tried several methods to estimate this $\ell_2$-regularized logistic regression. There is one first-order method (that is, it only makes use of the gradient and not of the Hessian), Conjugate Gradient whereas all the others are Quasi-Newton methods. The method I tested are:

  • CG = Conjugate Gradient as implemented in scipy.optimize.fmin_cg
  • TNC = Truncated Newton as implemented in scipy.optimize.fmin_tnc
  • BFGS = Broyden–Fletcher–Goldfarb–Shanno method, as implemented in scipy.optimize.fmin_bfgs.
  • L-BFGS = Limited-memory BFGS as implemented in scipy.optimize.fmin_l_bfgs_b. Contrary to the BFGS algorithm, which is written in Python, this one wraps a C implementation.
  • Trust Region = Trust Region Newton method 1. This is the solver used by LIBLINEAR that I've wrapped to accept any Python function in the package pytron

To assure the most accurate results across implementations, all timings were collected by callback functions that were called from the algorithm on each iteration. Finally, I plot the maximum absolute value of the gradient (=the infinity norm of the gradient) with respect to time.

The synthetic data used in the benchmarks was generated as described in 2 and consists primarily of the design matrix $X$ being Gaussian noise, the vector of coefficients is drawn also from a Gaussian distribution and the explained variable $y$ is generated as $y = \text{sign}(X w)$. We then perturb matrix $X$ by adding Gaussian noise with covariance 0.8. The number of samples and features was fixed to $10^4$ and $10^3$ respectively. The penalization parameter $\lambda$ was fixed to 1.

In this setting variables are typically uncorrelated and most solvers perform decently:

Benchmark Logistic

Here, the Trust Region and L-BFGS solver perform almost equally good, with Conjugate Gradient and Truncated Newton falling shortly behind. I was surprised by the difference between BFGS and L-BFGS, I would have thought that when memory was not an issue both algorithms should perform similarly.

To make things more interesting, we now make the design to be slightly more correlated. We do so by adding a constant term of 1 to the matrix $X$ and adding also a column vector of ones this matrix to account for the intercept. These are the results:

Benchmark Logistic

Here, we already see that second-order methods dominate over first-order methods (well, except for BFGS), with Trust Region clearly dominating the picture but with TNC not far behind.

Finally, if we force the matrix to be even more correlated (we add 10. to the design matrix $X$), then we have:

Benchmark Logistic

Here, the Trust-Region method has the same timing as before, but all other methods have got substantially worse.The Trust Region method, unlike the other methods is surprisingly robust to correlated designs.

To sum up, the Trust Region method performs extremely well for optimizing the Logistic Regression model under different conditionings of the design matrix. The LIBLINEAR software uses this solver and thus has similar performance, with the sole exception that the evaluation of the logistic function and its derivatives is done in C++ instead of Python. In practice, however, due to the small number of iterations of this solver I haven't seen any significant difference.


  1. A similar development can be found in the source code of LIBLINEAR, and is probably also used elsewhere. 

  2. "A comparison of numerical optimizers for logistic regression", P. Minka, URL 

  3. "Newton's Method for Large Bound-Constrained Optimization Problems", Chih-Jen Lin, Jorge J. More URL 

  4. IPython Notebook to reproduce the benchmarks source 

by Fabian Pedregosa at May 19, 2013 10:00 PM

Jake Vanderplas

A Javascript Viewer for Matplotlib Animations

I'll cut to the chase: here's what I've created: a javascript-based animation viewer, with hooks to embed it in IPython. It's best viewed in a modern browser (unfortunately Firefox does not currently qualify as "modern" due to its lack of HTML5 support)

In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.
In [2]:
# get the JSAnimation import at https://github.com/jakevdp/JSAnimation
from JSAnimation import examples
examples.basic_animation()
Out[2]:


Once Loop Reflect

I think the result is pretty good, if I do say so myself.

The Background

Last week, Fernando Perez visited UW to give a talk for the eScience institute. Over lunch we were discussing the possibility of building a Javascript-based animation viewer which could be embedded in IPython notebooks. I had written a short hack to embed mp4 movies in IPython, which works quite well: Michael Kuhlen of Berkeley ran with the idea and made this notebook, which embeds a 3D rendering of orbits within an N-body simulation.

The problem with this mp4 approach is that it requires installation of ffmpeg or mencoder with the proper video codec libraries. What we wanted was something that only requires Python and a web browser: something that could use Javascript to display frames rendered by Matplotlib.

Above you see the result of a week's worth of evenings hacking on Python, html, and Javascript -- my first real foray into the latter. The result is a small python package, available on my github page: https://github.com/jakevdp/JSAnimation. See the README and examples on that page for details of how this can be used.

So what's going on here?

You can dig into the code to see how it works, but here's the short version:

The package adds an IPython representation hook to the animation object, similar to the one I showed here. When the animation is displayed, IPython calls the new HTMLWriter to convert the animation to an embeddable html document. This writer is capable of saving any animation to a stand-alone HTML file, with the frames either embedded or in a separate directory: this stand-alone file is created, read-in, and embedded into the document as raw HTML.

For IPython, the animation creates frames that are embedded directly in the HTML source via the base-64 representation. A base-64 representation is a standard way of encoding binary data to a normal string of text, which looks like this:

In [3]:
fig, ax = plt.subplots()
ax.plot(random.rand(100))
# write the figure to a temporary file, and encode the results to base64
import tempfile
with tempfile.NamedTemporaryFile(suffix='.png') as f:
    fig.savefig(f.name)
    data = open(f.name, 'rb').read().encode('base64')
    
# close the figure and display the data
plt.close(fig)
print data[:460]
iVBORw0KGgoAAAANSUhEUgAAAbAAAAEgCAYAAADVKCZpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz
AAALEgAACxIB0t1+/AAAIABJREFUeJztnXt0XVW977877zZN2iZ9QJNAgYQ+KLRotaCi4VFLKxQF
PNRzREdFrGgH6pFzz7l6zj3gHVaq5w4Pw3od9QEHAQvq8VpQCMgjqC1thVYKtEJ5tKTpO02aNGne
6/7xy+xee+251l6POddj799njIw2yc7ea6+91vrO7/f3m3OlDMMwwDAMwzAJoyjqDWAYhmEYP7CA
MQzDMImEBYxhGIZJJCxgDMMwTCJhAWMYhmESCQsYwzAMk0hYwBiGYZhEwgLGMAzDJBIWMIZhGCaR
sIAxDMMwiYQFjGEYhkkkLGAMwzBMImEBYxiGYRIJCxjDMAyTSFjAGIZhmETCAsYwDMMkEhYwhmE

We only print the first few sections of the data, as it is a rather large string. The magic of this is that contained in that string is all the information needed to reconstruct the original PNG image. We can see that directly by inserting the string into an HTML image tag, which results in a frame embedded in the document itself:

In [4]:
from IPython.display import HTML
HTML('<img src="data:image/png;base64,{0}">'.format(data))
Out[4]:

This sort of thing is similar to what goes on in the background every time you use embedded figures in an IPython notebook.

By embedding all the frames this way, we're able to use Javascript to switch between them at a given frame-rate using the javascript setInterval() function. The rest is just straightforward javascript event handling.

Embedding Your Own Animation

If you'd like to use this to create your own animation, you can follow the suggestions in the animation tutorial. To embed your animation in the notebook, import IPython_display from the JSAnimation package. This will add the _repr_html_ method to the animation class, so that creating the animation will lead to it being displayed via the Javascript embedding.

Here is an example showing how the above animation was created:

In [5]:
from matplotlib import animation
from JSAnimation import IPython_display

fig = plt.figure()
ax = plt.axes(xlim=(0, 10), ylim=(-2, 2))
line, = ax.plot([], [], lw=2)

def init():
    line.set_data([], [])
    return line,

def animate(i):
    x = np.linspace(0, 10, 1000)
    y = np.cos(i * 0.02 * np.pi) * np.sin(x - i * 0.02 * np.pi)
    line.set_data(x, y)
    return line,

animation.FuncAnimation(fig, animate, init_func=init,
                        frames=100, interval=30)
Out[5]:


Once Loop Reflect

Remaining Issues

One of my goals in this was to make it relatively lightweight: For this reason, it doesn't depend on JQuery, JQuery-UI, and other nice packages that might be suited for this type of application.

For that reason, the frame slider uses the HTML5 slider element, which is not yet supported by all browsers. In particular, if you're using older versions of IE or Firefox, the frame dragger will appear as an ugly numerical input box. Making this compatible with non-HTML5-compliant browsers would be possible, but would require a lot more javascript hacking.

Second, this does not scale well to large animations. Because each frame is individually embedded in the document, the size of the notebook can become very large very quickly. Typical web-ready video formats involve a lot of image compression. This tends to be easy for video: most frames look very similar to the last, with just a few changes. Implementing this sort of compression in Javascript would certainly be possible, but is well beyond my minimal Javascript hacking abilities. It would be very cool if someone could run with this idea and create what would amount to a Javascript video codec to reduce the size of these embedded animations. I think it could be done with some effort.

As it is, though, I think this is a pretty neat widget and will definitely find good uses.

I hope you've found this useful, and thanks for reading! This was a fun one.

This post was written in the IPython notebook: The notebook can be downloaded here, or viewed statically here

by Jake Vanderplas at May 19, 2013 02:00 PM

May 16, 2013

Continuum Analytics

Wakari and Financial Health Checks

Accrual Ratios are an important index for assessing the performance and continued viability of every publicly traded company. Using Wakari, TR CONNECT, and Thomson Reuters Financial Data I calculated accrual ratios for a variety of companies and explored news events, which may correlate with strong signals in the data. You can easily download and edit the notebook used in this post into your Wakari account.

by Benjamin Zaitlen at May 16, 2013 01:45 PM

May 13, 2013

The Nipy blog

Nipy World blog moved from there to here

The Nipy World blog used to be on Blogspot at http://nipyworld.blogspot.com. I'm restarting the blog here because it's easier and more fun to write the blog using Pelican. I hope that it will be easier for my fellow NIPistas to add their own posts using Github.

by Matthew Brett at May 13, 2013 02:42 AM

Nipy World blog moved from there to here

The Nipy World blog used to be on Blogspot at http://nipyworld.blogspot.com. I'm restarting the blog here because it's easier and more fun to write the blog using Pelican. I hope that it will be easier for my fellow NIPistas to add their own posts using Github.

by Matthew Brett at May 13, 2013 02:42 AM

Jake Vanderplas

Embedding Matplotlib Animations in IPython Notebooks

I've spent a lot of time on this blog working with matplotlib animations (see the basic tutorial here, as well as my examples of animating a quantum system, an optical illusion, the Lorenz system in 3D, and recreating Super Mario). Up until now, I've not have not combined the animations with IPython notebooks. The problem is that so far the integration of IPython with matplotlib is entirely static, while animations are by their nature dynamic. There are some efforts in the IPython and matplotlib development communities to remedy this, but it's still not an ideal setup.

I had an idea the other day about how one might get around this limitation in the case of animations. By creating a function which saves an animation and embeds the binary data into an HTML string, you can fairly easily create automatically-embedded animations within a notebook.

The Animation Display Function

As usual, we'll start by enabling the pylab inline mode to make the notebook play well with matplotlib.

In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.

Now we'll create a function that will save an animation and embed it in an html string. Note that this will require ffmpeg or mencoder to be installed on your system. For reasons entirely beyond my limited understanding of video encoding details, this also requires using the libx264 encoding for the resulting mp4 to be properly embedded into HTML5.

In [2]:
from tempfile import NamedTemporaryFile

VIDEO_TAG = """<video controls>
 <source src="data:video/x-m4v;base64,{0}" type="video/mp4">
 Your browser does not support the video tag.
</video>"""

def anim_to_html(anim):
    if not hasattr(anim, '_encoded_video'):
        with NamedTemporaryFile(suffix='.mp4') as f:
            anim.save(f.name, fps=20, extra_args=['-vcodec', 'libx264'])
            video = open(f.name, "rb").read()
        anim._encoded_video = video.encode("base64")
    
    return VIDEO_TAG.format(anim._encoded_video)

With this HTML function in place, we can use IPython's HTML display tools to create a function which will show the video inline:

In [3]:
from IPython.display import HTML

def display_animation(anim):
    plt.close(anim._fig)
    return HTML(anim_to_html(anim))

Example of Embedding an Animation

The result looks something like this -- we'll use a basic animation example taken from my earlier Matplotlib Animation Tutorial post:

In [4]:
from matplotlib import animation

# First set up the figure, the axis, and the plot element we want to animate
fig = plt.figure()
ax = plt.axes(xlim=(0, 2), ylim=(-2, 2))
line, = ax.plot([], [], lw=2)

# initialization function: plot the background of each frame
def init():
    line.set_data([], [])
    return line,

# animation function.  This is called sequentially
def animate(i):
    x = np.linspace(0, 2, 1000)
    y = np.sin(2 * np.pi * (x - 0.01 * i))
    line.set_data(x, y)
    return line,

# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=20, blit=True)

# call our new function to display the animation
display_animation(anim)
Out[4]:
Your browser does not support the video tag.

Making the Embedding Automatic

We can go a step further and use IPython's display hooks to automatically represent animation objects with the correct HTML. We'll simply set the _repr_html_ member of the animation base class to our HTML converter function:

In [5]:
animation.Animation._repr_html_ = anim_to_html

Now simply creating an animation will lead to it being automatically embedded in the notebook, without any further function calls:

In [6]:
animation.FuncAnimation(fig, animate, init_func=init,
                        frames=100, interval=20, blit=True)
Out[6]:
Your browser does not support the video tag.

So simple! I hope you'll find this little hack useful!

This post was created entirely in IPython notebook. Download the raw notebook here, or see a static view on nbviewer.

by Jake Vanderplas at May 13, 2013 02:00 AM

May 12, 2013

Mathieu Blondel

Large-scale sparse multiclass classification

I’m thrilled to announce that my paper “Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification” (published in the Machine Learning journal) is now online: PDF, BibTeX [*].

Abstract

Over the past decade, l1 regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixed-norm (e.g., l1/l2) regularization has been utilized as a way to select entire groups of features. In this paper, we propose a novel direct multiclass formulation specifically designed for large-scale and high-dimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs l1/l2 regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fast-to-evaluate multiclass models. For optimization, we employ two globally-convergent variants of block coordinate descent, one with line search (Tseng and Yun, 2009) and the other without (Richtárik and Takáč, 2012). We present the two variants in a unified manner and develop the core components needed to efficiently solve our formulation. The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably to other solvers such as FOBOS, FISTA and SpaRSA. Furthermore, we show that our formulation obtains very compact multiclass models and outperforms l1/l2- regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.

Code

The code of the proposed multiclass method is available in my Python/Cython machine learning library, lightning. Below is an example of how to use it on the News20 dataset.

from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.primal_cd import CDClassifier
 
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target
 
clf = CDClassifier(penalty="l1/l2",
                   loss="squared_hinge",
                   multiclass=True,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)
clf.fit(X, y)
# accuracy
print clf.score(X, y) 
# percentage of selected features
print clf.n_nonzero(percentage=True)

To use the variant without line search (as presented in the paper), add the max_steps=0 option to CDClassifier.

Data

I also released the Amazon7 dataset used in the paper. It contains 1,362,109 reviews of Amazon products. Each review may belong to one of 7 categories (apparel, book, dvd, electronics, kitchen & housewares, music, video) and is represented as a 262,144-dimensional vector. It is, to my knowledge, one of the largest publically available multiclass classification dataset.

[*] The final publication is available here.

by Mathieu at May 12, 2013 12:52 PM

May 11, 2013

Matthieu Brucher

Annoucement: scikits.optimization 0.3

I’m please to announce a new version for scikits.optimization. The main focus of this iteration was to finish usual unconstrained optimization algorithms.

Changelog

  • Fixes on the Simplex state implementation
  • Added several Quasi-Newton steps (BFGS, rank 1 update…)

The scikit can be installed with pip/easy_install or downloaded from PyPI

Old announces:

Buy Me a Coffee!



Other Amount:



Your Email Address :



by Matt at May 11, 2013 06:58 AM

May 08, 2013

Continuum Analytics news

Continuum Analytics Releases Anaconda v1.5

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, announced today the release of the latest version of Anaconda, its premium collection of libraries for Python. Anaconda enables big data management, analysis, and cross-platform visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 1.5, includes the addition of netCDF4, as well as updates to Python, NumPy, SciPy, IPython, MatPlotLib, Pandas, and Cython.

by Corinna Bahr at May 08, 2013 03:00 PM

Jake Vanderplas

Migrating from Octopress to Pelican

After nine months on Octopress, I've decided to move on.

I should start by saying that Octopress is a great platform for static blogging: it's powerful, flexible, well-supported, well-integrated with GitHub pages, and has tools and plugins to do just about anything you might imagine. There's only one problem:

It's written in Ruby.

Now I don't have anything against Ruby per se. However, it was starting to seem a bit awkward that a blog called Pythonic Perambulations was built with Ruby, especially given the availability of so many excellent Python-based static site generators (Hyde, Nikola, and Pelican in particular).

Additionally, a few things with Octopress were starting to become difficult: first, I wanted a way to easily insert IPython notebooks into posts. Sure, I developed a hackish solution to notebooks in Octopress which had worked well enough for a while, but a cleaner method would have involved digging into the Ruby source code and writing a full-fledged Octopress extension, based on nbconvert. This would have involved a fair bit of effort to learn Ruby and figure out how to best interface it with the Python nbconvert code. Second, Ruby has so many strange and difficult pieces: GemFiles, RVM, rake... and I never took the time to really understand the real purpose of all of them (self-reflection: what parts of Python would seem strange and difficult if I hadn't been using them for so many years?). The black-box nature of the process, at least in my own case, was starting to bother me.

But the kicker was this: In January I got a new computer, and after a reasonable amount of effort was unable to successfully build my blog. I've been writing my posts exclusively on my old laptop which I somehow managed to successfully set up last August. But that laptop now has a sorely outdated Ubuntu distro that I couldn't upgrade for fear of losing the ability to update my blog. Needless to say, this was not the most effective setup.

It was time to switch my blog engine to Python.

Choosing a Python Static Generator

I started asking around, and found that there were three solid contenders for a Python-based platform to replace Octopress: Hyde, Nikola, and Pelican. I gave Hyde a test-run a few weeks ago in re-making my website, and I really like it: it's clean, straightforward, powerful, and easy to use. The documentation is a bit lacking, though, and I think it would take a fair bit more effort at this point to build a more complicated site with Hyde.

Nikola and Pelican both seem to be well-loved by their users, but I had to choose one. I went with Pelican for one simple reason: it has more GitHub forks. I'm sure this is entirely unfair to Nikola and all the contributors who have poured their energy into the project, but I had to choose one way or another. I'm pleased to say that Pelican has not been a disappointment: I've found it to be flexible and powerful. It has an active developer-base, and makes available a wide array of themes and plugins. For the few extra pieces I needed, I found the plugin and theming API to be well-documented and straightforward to use.

Migrating to Pelican from Octopress

I won't attempt to write a one-size-fits-all guide to migrating to Pelican from Octopress: there are too many possibile combinations of formats, plugins, themes, etc. But I will walk through my own process in some detail, in hopes that it might help others who find themselves in a similar predicament.

I had several goals when doing this migration:

  • I wanted, as much as possible, to maintain the look and feel of the blog. I like the default Octopress theme: it's simple, clean, compact, and includes all the aspects I need for a good blog.
  • I wanted, as much as possible, to leave the source of my posts unmodified: luckily Pelican supports writing posts in markdown and allows easy insertion of custom plugins, so this was relatively easy to accommodate.
  • I wanted to maintain the history of Disqus comments for each page, as well as the Twitter and Google Pages tools.
  • I wanted, as much as possible, to maintain the same URLs for all content, including posts, notebooks, images, and videos.
  • I wanted a clean way to insert html-formatted IPython notebooks into blog posts. Nearly half my posts are written as notebooks, and the old way of including them was becoming much too cumbersome.

I was able to suitably address all these goals with Pelican in a few evenings' effort. Some of it was already built-in to the Pelican settings architecture, some required customization of themes and extensions, and some required writing some brand new plugins. I'll summarize these aspects below:

Blog theme

As I mentioned, I wanted to keep the look and feel of the blog consistent. Luckily, someone had gone before me and created an octopress Pelican theme which did most of the heavy lifting. I contributed a few additional features, including the ability to specify Disqus tags and maintain comment history, to add Twitter, Google Plus, and Facebook links in the sidebar and footer, to add a custom site search box which appears in the upper right of the navigation panel, as well as a few other tweaks. The result is what you see here: nearly identical to the old layout, with all the bells and whistles included.

Octopress Markdown to Pelican Markdown

Octopress has a few plugins which add some syntactic sugar to the markdown language. These are tags specified in Liquid-style syntax:

{% tag arg1 arg2 ... %}

I have made extensive use of these in my octopress posts, primarily to insert videos, images, and code blocks from file. In order to accommodate this in Pelican, I wrote a Pelican plugin which wraps a custom Markdown preprocessor written via the Markdown extension API which can correctly interpret these types of tags. The tags ported from octopress thus far are:

The Image Tag

The image tag allows insertion of an image into the post with a specified size and position:

{% img [position] /url/to/img.png [width] [height] [title] [alt] %}

Here is an example of the result of the image tag:

A Galaxy

The Video Tag

The video tag allows embedding of an HTML5/Flash-compatible video into the post:

{% video /url/to/video.mp4 [width] [height] [/url/to/poster.png] %}

Here is an example of the output of the video tag:

(see this post for a description of this video).

The Code Include Tag

The include_code tag allows the insertion of formatted lines from a file into the post, with a title and a link to the source file:

{% include_code filename.py [lang:python] [title] %}

Here is an example of the output of the code include tag:

Hello World hello_world.py download
import sys
import os
print("hello_world")

For more information on using these tags, refer to the module doc-strings.

Maintaining Disqus Comment Threads

Static blogs are fast, lightweight, and easy to deploy. A disadvantage, though, is the inability to natively include dynamic elements such as comment threads. Disqus is a third-party service that skirts this disadvantage very seamlessly. All it takes is to add a small javascript snippet with some identifiers in the appropriate place on your page, and Disqus takes care of the rest. To keep the comment history on each page required assuring that the site identifier and page identifiers remained the same between blog versions. This part happens within the theme, and my Disqus PR to the Pelican Octopress theme made this work correctly.

Maintaining the URL structure

By default, Octopress stores posts with a structure looking like blog/YYYY/MM/DD/article-slug/. The Pelican default is different, but easy enough to change. In the pelicanconf.py settings file, this corresponds to the following:

ARTICLE_URL = 'blog/{date:%Y}/{date:%m}/{date:%d}/{slug}/'
ARTICLE_SAVE_AS = 'blog/{date:%Y}/{date:%m}/{date:%d}/{slug}/index.html'

Next, at the top of the markdown file for each article, the metadata needs to be slightly modified from the form used by Octopress -- here is the actual metadata used in the document that generates this page:

Title: Migrating from Octopress to Pelican
date: 2013-05-07 17:00
slug: migrating-from-octopress-to-pelican

Additionally, the static elements of the blog (images, videos, IPython notebooks, code snippets, etc.) must be put within the correct directory structure. These static files should be put in paths which are specified via the STATIC_PATHS setting:

STATIC_PATHS = ['images', 'figures', 'downloads']

Pelican presented a challenge here: as of the time of this writing, Pelican has a hard-coded 'static' subdirectory where these static paths are saved. I submitted a pull request to Pelican that replaces this hard-coded setting with a configurable path: because the change conflicts with a bigger refactoring of the code which is ongoing, the PR will not be merged. But until this new refactoring is finished, I'll be using my own branch of Pelican to make this blog, and specify the correct static paths.

Inserting Notebooks

The ability to seamlessly insert IPython notebooks into posts was one of the biggest drivers of my switch to Pelican. Pelican has an ipython notebook plugin available, but I wasn't completely happy with it. The plugin implements a reader which treats the notebooks themselves as the source of a post, leading to the requirement to insert blog metadata into the notebook itself. This is a suitable solution, but for my own purposes I much prefer a solution in which the content of a notebook could be inserted into a stand-alone post, such that the notebook and the blog metadata are completely separate.

To accomplish this, I added a submodule to my liquid_tags Pelican plugin which allows the insertion of notebooks using the following syntax:

{% notebook path/to/notebook.ipynb [cells[i:j]] %}

This inserts the notebook at the given location in the post, optionally selecting a given range of notebook cells with standard Python list slicing syntax.

The formatting of notebooks requires some special CSS styles which must be inserted into the header of each page where notebooks are shown. Rather than requiring the theme to be customized to support notebooks, I decided on a solution where an EXTRA_HEADER setting is used to specify html and CSS which should be added to the header of the main page. The notebook plugin saves the required header to a file called _nb_header.html within the main source directory. To insert the appropriate formatting, we add the following lines to the settings file, pelicanconf.py:

EXTRA_HEADER = open('_nb_header.html').read().decode('utf-8')

In the theme file, within the header tag, we add the following:

 {% if EXTRA_HEADER %}
 {{ EXTRA_HEADER }}
 {% endif %}

Here is the result: a short notebook inserted directly into the post:

This Is An IPython Notebook

Here is some code:

In [1]:
import numpy
print numpy.random.random(10)
[ 0.25463203  0.55637185  0.02743774  0.57380221  0.52378531  0.95099357
  0.70975568  0.19575853  0.589227    0.06959599]

Here is some math:

$$e^{i\pi} + 1 = 0$$

With all those things in place, the blog was ready to be built. The result is right in front of you: you're reading it. If you'd like to see the source from which this blog built, it's available at http://www.github.com/jakevdp/PythonicPerambulations. Feel free to adapt the configurations and theme to suit your own needs.

I'm glad to be working purely in Python from now on!

by Jake Vanderplas at May 08, 2013 12:00 AM

May 07, 2013

Continuum Analytics

Accelerating Python Libraries with Numba (Part 1)

Welcome. This post is part of a series of Continuum Analytics Open Notebooks showcasing our projects, products, and services.

In this Continuum Open Notebook, you’ll see how Numba accelerates the performance of the GrowCut image segmentation window function by three orders of magnitude in two lines of Python.

by Aron Ahmadia at May 07, 2013 11:07 AM

May 02, 2013

Matthieu Brucher

Comparison of optimization algorithms

In the next version of scikits.optimization, I’ve added some Quasi-Newton steps. Before this version is released, I thought I would compare several methods of optimizing the Rosenbrock function.

Optimizers

What is great with the Rosenbrock cost function can be summed up in a few points:

  1. It is hard to optimize
  2. Gradient can be easily computed

I’ve decided to compare the number of function and gradient calls as well as the cost behavior for several usual optimization algorithms. So the contestants will be:

  • A Nelder-Mead/Polytope/simplex optimizer
  • SSA, a simplex with simulated annealing (think of amotsa from Numerical Recipes)
  • GA, a genetic algorithm
  • a gradient-based optimizer
  • CG, a conjugate-gradient optimizer
  • BFGS, a quasi-Newton optimizer

The first 4 are from scikits.optimization, the last 2 are based on proprietary code that cannot be published, but it’s interesting to compare with other tools that are used to compare gradient-free complex cost functions.

Results

I’ve made a small slideshow with the derivative-free algorithms. First you have for each of the three algorithms the number of function calls versus iteration, the cost versus iteration and finally the location of testing parameters.
Click to view slideshow.

This slideshow is for the derivative-based algorithms.
Click to view slideshow.

Conclusion

I was quite surprised by some algorithm behaviors. Clearly, the conjugate gradient algorithm behaves far better than the simple gradient, but the BFGS followed the Rosenbrock valley far better. A good quasi-Newton can be really efficient (not a brent because it needs to solve a linear equation), although a conjugate gradient can be enough in some cases.

For the gradient-free algorithms, SSA really behaved badly. This is mainly due because the hyperparameters that must be adequately tuned. This function is quite simple, but my first trial at setting these parameters was far more efficient for GA or the simplex than for SSA. So I would go for GA for gradient-free optimization: few and easy hyper parameters and a good browse of the search space.

The code for the 4 first tests and display plots

by Matt at May 02, 2013 07:03 AM

May 01, 2013

Fabian Pedregosa

Logistic Ordinal Regression

TL;DR: I've implemented a logistic ordinal regression or proportional odds model. Here is the Python code

The logistic ordinal regression model, also known as the proportional odds was introduced in the early 80s by McCullagh [1, 2] and is a generalized linear model specially tailored for the case of predicting ordinal variables, that is, variables that are discrete (as in classification) but which can be ordered (as in regression). It can be seen as an extension of the logistic regression model to the ordinal setting.

Given $X \in \mathbb{R}^{n \times p}$ input data and $y \in \mathbb{N}^n$ target values. For simplicity we assume $y$ is a non-decreasing vector, that is, $y_1 \leq y_2 \leq ...$. Just as the logistic regression models posterior probability $P(y=j|X_i)$ as the logistic function, in the logistic ordinal regression we model the cummulative probability as the logistic function. That is,

$$ P(y \leq j|X_i) = \phi(\theta_j - w^T X_i) = \frac{1}{1 + \exp(w^T X_i - \theta_j)} $$

where $w, \theta$ are vectors to be estimated from the data and $\phi$ is the logistic function defined as $\phi(t) = 1 / (1 + \exp(-t))$.

Toy example with three classes denoted in different colors. Also shown the vector of coefficients $w$ and the thresholds $\theta_0$ and $\theta_1$

Compared to multiclass logistic regression, we have added the constrain that the hyperplanes that separate the different classes are parallel for all classes, that is, the vector $w$ is common across classes. To decide to which class will $X_i$ be predicted we make use of the vector of thresholds $\theta$. If there are $K$ different classes, $\theta$ is a non-decreasing vector (that is, $\theta_1 \leq \theta_2 \leq ... \leq \theta_{K-1}$) of size $K-1$. We will then assign the class $j$ if the prediction $w^T X$ (recall that it's a linear model) lies in the interval $[\theta_{j-1}, \theta_{j}[$. In order to keep the same definition for extremal classes, we define $\theta_{0} = - \infty$ and $\theta_K = + \infty$.

The intuition is that we are seeking a vector $w$ such that $X w$ produces a set of values that are well separated into the different classes by the different thresholds $\theta$. We choose a logistic function to model the probability $P(y \leq j|X_i)$ but other choices are possible. In the proportional hazards model 1 the probability is modeled as $-\log(1 - P(y \leq j | X_i)) = \exp(\theta_j - w^T X_i)$. Other link functions are possible, where the link function satisfies $\text{link}(P(y \leq j | X_i)) = \theta_j - w^T X_i$. Under this framework, the logistic ordinal regression model has a logistic link function and the proportional hazards model has a log-log link function.

The logistic ordinal regression model is also known as the proportional odds model, because the ratio of corresponding odds for two different samples $X_1$ and $X_2$ is $\exp(w^T(X_1 - X_2))$ and so does not depend on the class $j$ but only on the difference between the samples $X_1$ and $X_2$.

Optimization

Model estimation can be posed as an optimization problem. Here, we minimize the loss function for the model, defined as minus the log-likelihood:

$\begin{align} \mathcal{L}(w, \theta) &= - \sum_{i=1}^n \log(\phi(\theta_{y_i} - w^T X_i) - \phi(\theta_{y_i -1} - w^T X_i)) \\ &= \sum_{i=1}^n w^T X_i - \log(\exp(\theta_{y_i}) - \exp(\theta_{y_i-1})) + \log(\phi(\theta_{y_i -1} - w^T X_i)) + \log(\phi(\theta_{y_i} - w^T X_i)) \\ \end{align}$

In this sum all terms are convex on $w$, thus the loss function is convex over $w$. It might be also jointly convex over $w$ and $\theta$, although I haven't checked. I use the function fmin_slsqp in scipy.optimize to optimize $\mathcal{L}$ under the constraint that $\theta$ is a non-decreasing vector. There might be better options, I don't know. If you do know, please leave a comment!.

Using the formula $\log(\phi(t))^\prime = (1 - \phi(t))$, we can compute the gradient of the loss function as

$\begin{align} \nabla_w \mathcal{L}(w, \theta) &= \sum_{i=1}^n X_i (1 - \phi(\theta_{y_i} - w^T X_i)) - \phi(\theta_{y_i-1} - w^T X_i)) \\ % \nabla_\theta \mathcal{L}(w, \theta) &= \sum_{i=1}^n - \frac{e_{y_i} \exp(\theta_{y_i}) - e_{y_i -1} \exp(\theta_{y_i -1})}{\exp(\theta_{y_i}) - \exp(\theta_{y_i-1})} \\ \nabla_\theta \mathcal{L}(w, \theta) &= \sum_{i=1}^n e_{y_i} \left(1 - \phi(\theta_{y_i} - w^T X_i) - \frac{1}{1 - \exp(\theta_{y_i -1} - \theta_{y_i})}\right) \\ & \qquad + e_{y_i -1}\left(1 - \phi(\theta_{y_i -1} - w^T X_i) - \frac{1}{1 - \exp(- (\theta_{y_i-1} - \theta_{y_i}))}\right) \end{align}$

where $e_i$ is the $i$th canonical vector.

Code

I've implemented a Python version of this algorithm using Scipy's optimize.fmin_slsqp function. This takes as arguments the loss function, the gradient denoted before and a function that is > 0 when the inequalities on $\theta$ are satisfied.

Code can be found here as part of the minirank package, which is my sandbox for code related to ranking and ordinal regression. At some point I would like to submit it to scikit-learn but right now the I don't know how the code will scale to medium-scale problems, but I suspect not great. On top of that I'm not sure if there is a real demand of these models for scikit-learn and I don't want to bloat the package with unused features.

Performance

I compared the prediction accuracy of this model in the sense of mean absolute error (IPython notebook) on the boston house-prices dataset. To have an ordinal variable, I rounded the values to the closest integer, which gave me a problem of size 506 $\times$ 13 with 46 different target values. Although not a huge increase in accuracy, this model did give me better results on this particular dataset:

Here, ordinal logistic regression is the best-performing model, followed by a Linear Regression model and a One-versus-All Logistic regression model as implemented in scikit-learn.


  1. "Regression models for ordinal data", P. McCullagh, Journal of the royal statistical society. Series B (Methodological), 1980 

  2. "Generalized Linear Models", P. McCullagh and J. A. Nelder (Book) 

  3. "Loss Functions for Preference Levels : Regression with Discrete Ordered Labels", Jason D. M. Rennie, Nathan Srebro 

by Fabian Pedregosa at May 01, 2013 10:00 PM

Josef Perkoltd

Power Plots in statsmodels

I just want to show another two plots for the statistical power of a test, since I didn't have time for this earlier

The code to produce them is just calling the methods of the power classes, for example for the one sample t-test.

Read more »

by Josef Perktold (noreply@blogger.com) at May 01, 2013 10:21 PM

April 30, 2013

Continuum Analytics

MKL Optimizations for Anaconda

We are happy to announce a new Anaconda add-on product called MKL Optimizations”, which allows packages in Anaconda to take advantage of the Math Kernel Library (MKL) by Intel.

by Ilan Schnell at April 30, 2013 03:15 PM

April 29, 2013

Jake Vanderplas

Benchmarking Nearest Neighbor Searches in Python

I recently submitted a scikit-learn pull request containing a brand new ball tree and kd-tree for fast nearest neighbor searches in python. In this post I want to highlight-ipynb some of the features of the new ball tree and kd-tree code that's part of this pull request, compare it to what's available in the scipy.spatial.cKDTree implementation, and run a few benchmarks showing the performance of these methods on various data sets.

My first-ever open source contribution was a C++ Ball Tree code, with a SWIG python wrapper, that I submitted to scikit-learn. A Ball Tree is a data structure that can be used for fast high-dimensional nearest-neighbor searches: I'd written it for some work I was doing on nonlinear dimensionality reduction of astronomical data (work that eventually led to these two papers), and thought that it might find a good home in the scikit-learn project, which Gael and others had just begun to bring out of hibernation.

After a short time, it became clear that the C++ code was not performing as well as it could be. I spent a bit of time writing a Cython adaptation of the Ball Tree, which is what currently resides in the sklearn.neighbors module. Though this implementation is fairly fast, it still has several weaknesses:

  • It only works with a Minkowski distance metric (of which Euclidean is a special case). In general, a ball tree can be written to handle any true metric (i.e. one which obeys the triangle inequality).
  • It implements only the single-tree approach, not the potentially faster dual-tree approach in which a ball tree is constructed for both the training and query sets.
  • It implements only nearest-neighbors queries, and not any of the other tasks that a ball tree can help optimize: e.g. kernel density estimation, N-point correlation function calculations, and other so-called Generalized N-body Problems.

I had started running into these limits when creating astronomical data analysis examples for astroML, the Python library for Astronomy and Machine Learning Python that I released last fall. I'd been thinking about it for a while, and finally decided it was time to invest the effort into updating and enhancing the Ball Tree. It took me longer than I planned (in fact, some of my first posts on this blog last August came out of the benchmarking experiments aimed at this task), but just a couple weeks ago I finally got things working and submitted a pull request to scikit-learn with the new code.

Features of the New Ball Tree and KD Tree

The new code is actually more than simply a new ball tree: it's written as a generic N dimensional binary search tree, with specific methods added to implement a ball tree and a kd-tree on top of the same core functionality. The new trees have a lot of very interesting and powerful features:

  • The ball tree works with any of the following distance metrics, which match those found in the module scipy.spatial.distance:

    ['euclidean', 'minkowski', 'manhattan', 'chebyshev', 'seuclidean', 'mahalanobis', 'wminkowski', 'hamming', 'canberra', 'braycurtis', 'matching', 'jaccard', 'dice', 'kulsinski', 'rogerstanimoto', 'russellrao', 'sokalmichener', 'sokalsneath', 'haversine']

Alternatively, the user can specify a callable Python function to act as the distance metric. While this will be quite a bit slower than using one of the optimized metrics above, it adds nice flexibility.

  • The kd-tree works with only the first four of the above metrics. This limitation is primarily because the distance bounds are less efficiently calculated for metrics which are not axis-aligned.

  • Both the ball tree and kd-tree implement k-neighbor and bounded neighbor searches, and can use either a single tree or dual tree approach, with either a breadth-first or depth-first tree traversal. Naive nearest neighbor searches scale as $\mathcal{O}[N^2]$; the tree-based methods here scale as $\mathcal{O}[N \log N]$.

  • Both the ball tree and kd-tree have their memory pre-allocated entirely by numpy: this not only leads to code that's easier to debug and maintain (no memory errors!), but means that either data structure can be serialized using Python's pickle module. This is a very important feature in some contexts, most notably when estimators are being sent between multiple machines in a parallel computing framework.

  • Both the ball tree and kd-tree implement fast kernel density estimation (KDE), which can be used within any of the valid distance metrics. The supported kernels are

    ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine']

the combination of these kernel options with the distance metric options above leads to an extremely large number of effective kernel forms. Naive KDE scales as $\mathcal{O}[N^2]$; the tree-based methods here scale as $\mathcal{O}[N \log N]$.

  • Both the ball tree and kd-tree implement fast 2-point correlation functions. A correlation function is a statistical measure of the distribution of data (related to the Fourier power spectrum of the density distribution). Naive 2-point correlation calculations scale as $\mathcal{O}[N^2]$; the tree-based methods here scale as $\mathcal{O}[N \log N]$.

Comparison with cKDTree

As mentioned above, there is another nearest neighbor tree available in the SciPy: scipy.spatial.cKDTree. There are a number of things which distinguish the cKDTree from the new kd-tree described here:

  • like the new kd-tree, cKDTree implements only the first four of the metrics listed above.

  • Unlike the new ball tree and kd-tree, cKDTree uses explicit dynamic memory allocation at the construction phase. This means that the trained tree object cannot be pickled, and must be re-constructed in place of being serialized.

  • Because of the flexibility gained through the use of dynamic node allocation, cKDTree can implement a more sophisticated building methods: it uses the "sliding midpoint rule" to ensure that nodes do not become too long and thin. One side-effect of this, however, is that for certain distributions of points, you can end up with a large proliferation of the number of nodes, which may lead to a huge memory footprint (even memory errors in some cases) and potentially inefficient searches.

  • The cKDTree builds its nodes covering the entire $N$-dimensional data space. this leads to relatively efficient build times because node bounds do not need to be recomputed at each level. However, the resulting tree is not as compact as it could be, which potentially leads to slower query times. The new ball tree and kd tree code shrinks nodes to only cover the part of the volume which contains points.

With these distinctions, I thought it would be interesting to do some benchmarks and get a detailed comparison of the performance of the three trees. Note that the cKDTree has just recently been re-written and extended, and is much faster than its previous incarnation. For that reason, I've run these benchmarks with the current bleeding-edge scipy.

Preparing the Benchmarks

But enough words. Here we'll create some scripts to run these benchmarks. There are several variables that will affect the computation time for a neighbors query:

  • The number of points $N$: for a brute-force search, the query will scale as $\mathcal{O}[N^2]$ . Tree methods usually bring this down to $\mathcal{O}[N \log N]$ .
  • The dimension of the data, $D$ : both brute-force and tree-based methods will scale approximately as $\mathcal{O}[D]$ . For high dimensions, however, the curse of dimensionality can make this scaling much worse.
  • The desired number of neighbors, $k$ : $k$ does not affect build time, but affects query time in a way that is difficult to quantify
  • The tree leaf size, leaf_size: The leaf size of a tree roughly specifies the number of points at which the tree switches to brute-force, and encodes the tradeoff between the cost of accessing a node, and the cost of computing the distance function.
  • The structure of the data: though data structure and distribution do not affect brute-force queries, they can have a large effect on the query times of tree-based methods.
  • Single/Dual tree query: A single-tree query searches for neighbors of one point at a time. A dual tree query builds a tree on both sets of points, and traverses both trees at the same time. This can lead to significant speedups in some cases.
  • Breadth-first vs Depth-first search: This determines how the nodes are traversed. In practice, it seems not to make a significant difference, so it won't be explored here.
  • The chosen metric: some metrics are slower to compute than others. The metric may also affect the structure of the data, the geometry of the tree, and thus the query and build times.

In reality, query times depend on all seven of these variables in a fairly complicated way. For that reason, I'm going to show several rounds of benchmarks where these variables are modified while holding the others constant. We'll do all our tests here with the most common Euclidean distance metric, though others could be substituted if desired.

We'll start by doing some imports to get our IPython notebook ready for the benchmarks. Note that at present, you'll have to install scikit-learn off my development branch for this to work. In the future, the new KDTree and BallTree will be part of a scikit-learn release.

In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.
In [2]:
import numpy as np
from scipy.spatial import cKDTree
from sklearn.neighbors import KDTree, BallTree

Data Sets

For spatial tree benchmarks, it's important to use various realistic data sets. In practice, data rarely looks like a uniform distribution, so running benchmarks on such a distribution will not lead to accurate expectations of the algorithm performance.

For this reason, we'll test three datasets side-by-side: a uniform distribution of points, a set of pixel values from images of hand-written digits, and a set of flux observations from astronomical spectra.

In [3]:
# Uniform random distribution
uniform_N = np.random.random((10000, 4))
uniform_D = np.random.random((1797, 128))
In [4]:
# Digits distribution
from sklearn.datasets import load_digits
digits = load_digits()

print digits.images.shape
(1797, 8, 8)
In [5]:
# We need more than 1797 digits, so let's stack the central
# regions of the images to inflate the dataset.
digits_N = np.vstack([digits.images[:, 2:4, 2:4],
                      digits.images[:, 2:4, 4:6],
                      digits.images[:, 4:6, 2:4],
                      digits.images[:, 4:6, 4:6],
                      digits.images[:, 4:6, 5:7],
                      digits.images[:, 5:7, 4:6]])
digits_N = digits_N.reshape((-1, 4))[:10000]

# For the dimensionality test, we need up to 128 dimesnions, so
# we'll combine some of the images.
digits_D = np.hstack((digits.data,
                      np.vstack((digits.data[:1000], digits.data[1000:]))))
# The edge pixels are all basically zero.  For the dimensionality tests
# to be reasonable, we want the low-dimension case to probe interir pixels
digits_D = np.hstack([digits_D[:, 28:], digits_D[:, :28]])
In [6]:
# The spectra can be downloaded with astroML: see http://www.astroML.org
from astroML.datasets import fetch_sdss_corrected_spectra
spectra = fetch_sdss_corrected_spectra()['spectra']
spectra.shape
Out[6]:
(4000, 1000)
In [7]:
# Take sections of spectra and stack them to reach N=10000 samples
spectra_N = np.vstack([spectra[:, 500:504],
                       spectra[:, 504:508],
                       spectra[:2000, 508:512]])
# Take a central region of the spectra for the dimensionality study
spectra_D = spectra[:1797, 400:528]
In [8]:
print uniform_N.shape, uniform_D.shape
print digits_N.shape, digits_D.shape
print spectra_N.shape, spectra_D.shape
(10000, 4) (1797, 128)
(10000, 4) (1797, 128)
(10000, 4) (1797, 128)

We now have three datasets with similar sizes. Just for the sake of visualization, let's visualize two dimensions from each as a scatter-plot:

In [9]:
titles = ['Uniform', 'Digits', 'Spectra']
datasets_D = [uniform_D, digits_D, spectra_D]
datasets_N = [uniform_N, digits_N, spectra_N]

fig, ax = plt.subplots(1, 3, figsize=(12, 3.5))

for axi, title, dataset in zip(ax, titles, datasets_D):
    axi.plot(dataset[:, 1], dataset[:, 2], '.k')
    axi.set_title(title, size=14)

We can see how different the structure is between these three sets. The uniform data is randomly and densely distributed throughout the space. The digits data actually comprise discrete values between 0 and 16, and more-or-less fill certain regions of the parameter space. The spectra display strongly-correlated values, such that they occupy a very small fraction of the total parameter volume.

Benchmarking Scripts

Now we'll create some scripts that will help us to run the benchmarks. Don't worry about these details for now -- you can simply scroll down past these and get to the plots.

In [10]:
from time import time

def average_time(executable, *args, **kwargs):
    """Compute the average time over N runs"""
    N = 5
    t = 0
    for i in range(N):
        t0 = time()
        res = executable(*args, **kwargs)
        t1 = time()
        t += (t1 - t0)
    return res, t * 1. / N
In [11]:
TREE_DICT = dict(cKDTree=cKDTree, KDTree=KDTree, BallTree=BallTree)
colors = dict(cKDTree='black', KDTree='red', BallTree='blue', brute='gray', gaussian_kde='black')

def bench_knn_query(tree_name, X, N, D, leaf_size, k,
                    build_args=None, query_args=None):
    """Run benchmarks for the k-nearest neighbors query"""
    Tree = TREE_DICT[tree_name]
    
    if build_args is None:
        build_args = {}
        
    if query_args is None:
        query_args = {}
        
    NDLk = np.broadcast(N, D, leaf_size, k)
        
    t_build = np.zeros(NDLk.size)
    t_query = np.zeros(NDLk.size)
    
    for i, (N, D, leaf_size, k) in enumerate(NDLk):
        XND = X[:N, :D]
        
        if tree_name == 'cKDTree':
            build_args['leafsize'] = leaf_size
        else:
            build_args['leaf_size'] = leaf_size
        
        tree, t_build[i] = average_time(Tree, XND, **build_args)
        res, t_query[i] = average_time(tree.query, XND, k, **query_args)
        
    return t_build, t_query
In [12]:
def plot_scaling(data, estimate_brute=False, suptitle='', **kwargs):
    """Plot the scaling comparisons for different tree types"""
    # Find the iterable key
    iterables = [key for (key, val) in kwargs.iteritems() if hasattr(val, '__len__')]
    if len(iterables) != 1:
        raise ValueError("A single iterable argument must be specified")
    x_key = iterables[0]
    x = kwargs[x_key]
    
    # Set some defaults
    if 'N' not in kwargs:
        kwargs['N'] = data.shape[0]
    if 'D' not in kwargs:
        kwargs['D'] = data.shape[1]
    if 'leaf_size' not in kwargs:
        kwargs['leaf_size'] = 15
    if 'k' not in kwargs:
        kwargs['k'] = 5
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 4),
                           subplot_kw=dict(yscale='log', xscale='log'))
        
    for tree_name in ['cKDTree', 'KDTree', 'BallTree']:
        t_build, t_query = bench_knn_query(tree_name, data, **kwargs)
        ax[0].plot(x, t_build, color=colors[tree_name], label=tree_name)
        ax[1].plot(x, t_query, color=colors[tree_name], label=tree_name)
        
        if tree_name != 'cKDTree':
            t_build, t_query = bench_knn_query(tree_name, data,
                                               query_args=dict(breadth_first=True, dualtree=True),
                                               **kwargs)
            ax[0].plot(x, t_build, color=colors[tree_name], linestyle='--')
            ax[1].plot(x, t_query, color=colors[tree_name], linestyle='--')
            
    if estimate_brute:
        Nmin = np.min(kwargs['N'])
        Dmin = np.min(kwargs['D'])
        kmin = np.min(kwargs['k'])
        
        # get a baseline brute force time by setting the leaf size large,
        # ensuring a brute force calculation over the data
        _, t0 = bench_knn_query('KDTree', data, N=Nmin, D=Dmin, leaf_size=2 * Nmin, k=kmin)
        
        # use the theoretical scaling: O[N^2 D]
        if x_key == 'N':
            exponent = 2
        elif x_key == 'D':
            exponent = 1
        else:
            exponent = 0
            
        t_brute = t0 * (np.array(x, dtype=float) / np.min(x)) ** exponent
        ax[1].plot(x, t_brute, color=colors['brute'], label='brute force (est.)')
            
    for axi in ax:
        axi.grid(True)
        axi.set_xlabel(x_key)
        axi.set_ylabel('time (s)')
        axi.legend(loc='upper left')
        axi.set_xlim(np.min(x), np.max(x))
        
    info_str = ', '.join([key + '={' + key + '}' for key in ['N', 'D', 'k'] if key != x_key])
    ax[0].set_title('Tree Build Time ({0})'.format(info_str.format(**kwargs)))
    ax[1].set_title('Tree Query Time ({0})'.format(info_str.format(**kwargs)))
    
    if suptitle:
        fig.suptitle(suptitle, size=16)
    
    return fig, ax

Benchmark Plots

Now that all the code is in place, we can run the benchmarks. For all the plots, we'll show the build time and query time side-by-side. Note the scales on the graphs below: overall, the build times are usually a factor of 10-100 faster than the query times, so the differences in build times are rarely worth worrying about.

A note about legends: we'll show single-tree approaches as a solid line, and we'll show dual-tree approaches as dashed lines. In addition, where it's relevant, we'll estimate the brute force scaling for ease of comparison.

Scaling with Leaf Size

We will start by exploring the scaling with the leaf_size parameter: recall that the leaf size controls the minimum number of points in a given node, and effectively adjusts the tradeoff between the cost of node traversal and the cost of a brute-force distance estimate.

In [13]:
leaf_size = 2 ** np.arange(10)
for title, dataset in zip(titles, datasets_N):
    fig, ax = plot_scaling(dataset, N=2000, leaf_size=leaf_size, suptitle=title)

Note that with larger leaf size, the build time decreases: this is because fewer nodes need to be built. For the query times, we see a distinct minimum. For very small leaf sizes, the query slows down because the algorithm must access many nodes to complete the query. For very large leaf sizes, the query slows down because there are too many pairwise distance computations. If we were to use a less efficient metric function, the balance between these would change and a larger leaf size would be warranted. This benchmark motivates our setting the leaf size to 15 for the remaining tests.

Scaling with Number of Neighbors

Here we'll plot the scaling with the number of neighbors $k$. This should not effect the build time, because $k$ does not enter there. It will, however, affect the query time:

In [14]:
k = 2 ** np.arange(1, 10)
for title, dataset in zip(titles, datasets_N):
    fig, ax = plot_scaling(dataset, N=4000, k=k, suptitle=title,
                           estimate_brute=True)

Naively you might expect linear scaling with $k$, but for large $k$ that is not the case. Because a priority queue of the nearest neighbors must be maintained, the scaling is super-linear for large $k$.

We also see that brute force has no dependence on $k$ (all distances must be computed in any case). This means that if $k$ is very large, a brute force approach will win out (though the exact value for which this is true depends on $N$, $D$, the structure of the data, and all the other factors mentioned above).

Note that although the cKDTree build time is a factor of ~3 faster than the others, the absolute time difference is less than two milliseconds: a difference which is orders of magnitude smaller than the query time. This is due to the shortcut mentioned above: the cKDTree doesn't take the time to shrink the bounds of each node.

Scaling with the Number of Points

This is where things get interesting: the scaling with the number of points $N$ :

In [15]:
N = (10 ** np.linspace(2, 4, 10)).astype(int)
for title, dataset in zip(titles, datasets_N):
    plot_scaling(dataset, N=N, estimate_brute=True, suptitle=title)

We have set d = 4 and k = 5 in each case for ease of comparison. Examining the graphs, we see some common traits: all the tree algorithms seem to be scaling as approximately $\mathcal{O}[N\log N]$, and both kd-trees are beating the ball tree. Somewhat surprisingly, the dual tree approaches are slower than the single-tree approaches. For 10,000 points, the speedup over brute force is around a factor of 50, and this speedup will get larger as $N$ further increases.

Additionally, the comparison of datasets is interesting. Even for this low dimensionality, the tree methods tend to be slightly faster for structured data than for uniform data. Surprisingly, the cKDTree performance gets worse for highly structured data. I believe this is due to the use of the sliding midpoint rule: it works well for evenly distributed data, but for highly structured data can lead to situations where there are many very sparsely-populated nodes.

Scaling with the Dimension

As a final benchmark, we'll plot the scaling with dimension.

In [16]:
D = 2 ** np.arange(8)
for title, dataset in zip(titles, datasets_D):
    plot_scaling(dataset, D=D, estimate_brute=True, suptitle=title)

As we increase the dimension, we see something interesting. For more broadly-distributed data (uniform and digits), the dual-tree approach begins to out-perform the single-tree approach, by as much as a factor of 2. In bottom-right panel, we again see a strong effect of the cKDTree's shortcut in construction: because it builds nodes which span the entire volume of parameter space, most of these nodes are quite empty, especially as the dimension is increased. This leads to queries which are quite a bit slower for sparse data in high dimensions, and overwhelms by a factor of 100 any computational savings at construction.

Conclusion

In a lot of ways, the plots here are their own conclusion. But in general, this exercise convinces me that the new Ball Tree and KD Tree in scikit-learn are at the very least equal to the scipy implementation, and in some cases much better:

  • All three trees scale in the expected way with the number and dimension of the data
  • All three trees beat brute force by orders of magnitude in all but the most extreme circumstances.
  • The cKDTree seems to be less optimal for highly-structured data, which is the kind of data that is generally of interest.
  • The cKDTree has the further disadvantage of using dynamically allocated nodes, which cannot be serialized. The pre-allocation of memory for the new ball tree and kd tree solves this problem.

On top of this, the new ball tree and kd tree have several other advantages, including more flexibility in traversal methods, more available metrics, and more availale query types (e.g. KDE and 2-point correlation).

One thing that still puzzles me is the fact that the dual tree approaches don't offer much of an improvement over single tree. The literature on the subject would make me expect otherwise (FastLab, for example, quotes near-linear-time queries for dual tree approaches), so perhaps there's some efficiency I've missed.

In a later post, I plan to go into more detail and explore and benchmark some of the new functionalities added: the kernel density estimation and 2-point correlation function methods. Until then, I hope you've found this post interesting, and I hope you find this new code useful!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at April 29, 2013 03:56 PM

April 24, 2013

Josef Perkoltd

Help: I caught a bug

I think I must be turning too much into a statistician and econometrician lately, I must have caught a virus or something. Maybe it started already a while ago

The theme of the scipy conference this year is "Machine Learning & Tools for Reproducible Science". However, I'm not doing any sexy twitter analysis, I just spent some days coding tests for proportion, boring stuff like pairwise comparisons of proportions.

Anyway, I decided to submit a tutorial proposal for econometrics with statsmodels to the scipy conference, see (lightly edited) proposal below. Since my proposal didn't get accepted, my first response was: Wrong topic, Too much statistic, We just want numbers, not check whether the model is correct, and find out how to fix it.

That leaves me with more time to go back to figuring out which other basic statistical tests are still missing in Python.

Read more »

by Josef Perktold (noreply@blogger.com) at April 24, 2013 01:32 PM

Statistics in Python: Reproducing Research

This is just a short comment on a blog post.

Ferando Perez wrote a nice article about "Literate computing" and computational reproducibility: IPython in the age of data-driven journalism

In the second part, he explains that Vincent Arel-Bundock came up with an ipython notebook within three hours to replicate some criticism of an economics journal article. Vincent's notebook can be seen here.

What I found most striking was not the presentation as a notebook, although that makes it easy to read, instead it was: pandas, patsy and statsmodels, and no R in sight. We have come a long way with Statistics in Python since I started to get involved in it five years ago.

Vincent has made many improvements and contributions to statsmodels in the last year.

Aside

I'm not following much of the economics debates these days, so I only know what I read in the two references that Fernando gave.

My impression is that it's just the usual (mis)use of economics research results. Politicians like the numbers that give them ammunition for their position. As economist, you are either very careful about how to present the results, or you join the political game (I worked for several years in an agricultural department of a small country). (An example for the use of economics results in another area, blaming the financial crisis on the work on copulas.)

"Believable" research: If your results sound too good or too interesting to be true, maybe they are not, and you better check your calculations. Although mistakes are not uncommon, the business as usual part is that the results are often very sensitive to assumptions, and it takes time to figure out what results are robust. I have seen enough economic debates where there never was a clear answer that convinced more than half of all economists. A long time ago, when the Asian Tigers where still tigers, one question was: Did they grow because of or in spite of government intervention?

by Josef Perktold (noreply@blogger.com) at April 24, 2013 12:23 PM

April 22, 2013

Vlad Niculae

BibTeX-powered publications list for Pelican with pelican-bibtex

Hook

Wouldn’t you like to manage your academic publications list easily within the context of your static website? Without resorting to external services, or to software like bibtex2html, which is very nice but will then require restyling to fit your templates?

Look no more, with the help of pelican-bibtex you can now manage your papers from within Pelican!

Backstory

At Fabian‘s advice, I started playing around with Pelican, a static website/blog generator for Python. I like it better than the other generators I used before, so I chose it the next time I had to set up a website. I still didn’t make the courage to migrate my current website and blog to it, but I promise I will.

Pelican has a public plugins repository, but they have a license constraint for all contributions. My plugin isn’t complicated, but I had to “reverse engineer” undocumented parts of the pybtex API. I think that maybe that code that I used to render citations programatically can be useful to others, so I don’t want to release it under a restrictive license. For this reason, I publish pelican-bibtex in my personal GitHub account.

You can see it in action in the source code for the website I am working on at the moment, the home page of my research group. Example output generated using pelican-bibtex can be seen here.

Possible extensions

I have not dug in too deeply but I believe this plugin can be extended, with not much difficulty, to support referencing in Pelican blogs, and render BibTeX references at the end of every post. This idea was suggested by Avaris on #pelican, and I find it very cool. Since I don’t need this feature at the moment, it’s not a priority, but it’s something that I would like to see at some point.

by vene at April 22, 2013 08:45 AM

April 19, 2013

Fernando Perez

"Literate computing" and computational reproducibility: IPython in the age of data-driven journalism

As "software eats the world" and we become awash in the flood of quantitative information denoted by the "Big Data" buzzword, it's clear that informed debate in society will increasingly depend on our ability to communicate information that is based on data. And for this communication to be a truly effective dialog, it is necessary that the arguments made based on data can be deconstructed, analyzed, rebutted or expanded by others. Since these arguments in practice often rely critically on the execution of code (whether an Excel spreadsheet or a proper program), it means that we really need tools to effectively communicate narratives that combine code, data and the interpretation of the results.

I will point out here two recent examples, taken from events in the news this week, where IPython has helped this kind of discussion, in the hopes that it can motivate a more informed style of debate where all the moving parts of a quantitative argument are available to all participants.

Insight, not numbers: from literate programming to literate computing

The computing community has for decades known about the "literate programming" paradigm introduced by Don Knuth in the 70's and fully formalized in his famous 1992 book. Briefly, Knuth's approach proposes writing computer programs in a format that mixes the code and a textual narrative together, and from this format generating separate files that will contain either an actual code that can be compiled/executed by the computer, or a narrative document that explains the program and is meant for human consumption. The idea is that by allowing the authors to maintain a close connection between code and narrative, a number of benefits will ensue (clearer code, less programming errors, more meaningful descriptions than mere comments embedded in the code, etc).

I don't take any issue with this approach per se, but I don't personally use it because it's not very well suited to the kinds of workflows that I need in practice. These require the frequent execution of small fragments of code, in an iterative cycle where code is run to obtain partial results that inform the next bit of code to be written. Such is the nature of interactive exploratory computing, which is the bread and butter of many practicing scientists. This is the kind of workflow that led me to creating IPython over a decade ago, and it continues to inform basically every decision we make in the project today.

As Hamming famously said in 1962, "The purpose of computing is insight, not numbers.". IPython tries to help precisely in this kind of usage pattern of the computer, in contexts where there is no clear notion in advance of what needs to be done, so the user is the one driving the computation. However, IPython also tries to provide a way to capture this process, and this is where we join back with the discussion above: while LP focuses on providing a narrative description of the structure of an algorithm, our working paradigm is one where the act of computing occupies the center stage.

From this perspective, we therefore refer to the worfklow exposed by these kinds of computational notebooks (not just IPython, but also Sage, Mathematica and others), as "literate computing": it is the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components. For the goals of communicating results in scientific computing and data analysis, I think this model is a better fit than the literate programming one, which is rather aimed at developing software in tight concert with its design and explanatory documentation. I should note that we have some ideas on how to make IPython stronger as a tool for "traditional" literate programming, but it's a bit early for us to focus on that, as we first want to solidify the computational workflows possible with IPython.

As I mentioned in a previous blog post about the history of the IPython notebook, the idea of a computational notebook is not new nor ours. Several IPython developers used extensively other similar systems from a long time and we took lots of inspiration from them. What we have tried to do, however, is to take a fresh look at these ideas, so that we can build a computational notebook that provides the best possible experience for computational work today. That means taking the existence of the Internet as a given in terms of using web technologies, an architecture based on well-specified protocols and reusable low-level formats (JSON), a language-agnostic view of the problem and a concern about the entire cycle of computing from the beginning. We want to build a tool that is just as good for individual experimentation as it is for collaboration, communication, publication and education.

Government debt, economic growth and a buggy Excel spreadsheet: the code behind the politics of fiscal austerity

In the last few years, extraordinarily contentious debates have raged in the circles of political power and fiscal decision making around the world, regarding the relation between government debt and economic growth. One of the center pieces of this debate was a paper form Harvard economists C. Reinhart and K. Rogoff, later turned into a best-selling book, that argued that beyond 90% debt ratios, economic growth would plummet precipitously.

This argument was used (amongst others) by politicians to justify some of the extreme austerity policies that have been foisted upon many countries in the last few years. On April 15, a team of researchers from U. Massachusetts published a re-analysis of the original data where they showed how Rienhart and Rogoff had made both fairly obvious coding errors in their orignal Excel spreadsheets as well as some statistically questionable manipulations of the data. Herndon, Ash and Pollin (the U. Mass authors) published all their scripts in R so that others could inspect their calculations.

Two posts from the Economist and the Roosevelt Institute nicely summarize the story with a more informed policy and economics discussion than I can make. James Kwak has a series of posts that dive into technical detail and question the horrible choice of using Excel, a tool that should for all intents and purposes be banned from serious research as it entangles code and data in ways that more or less guarantee serious errors in anything but trivial scenarios. Victoria Stodden just wrote an excellent new post with specific guidance on practices for better reproducibility; here I want to take a narrow view of these same questions focusing strictly on the tools.

As reported in Mike Konczal's piece at the Roosevelt Institute, Herndon et al. had to reach out to Reinhart and Rogoff for the original code, which hadn't been made available before (apparently causing much frustration in economics circles). It's absolutely unacceptable that major policy decisions that impact millions worldwide had until now hinged effectively on the unverified word of two scientists: no matter how competent or honorable they may be, we know everybody makes mistakes, and in this case there were both egregious errors and debatable assumptions. As Konczal says, "all I can hope is that future historians note that one of the core empirical points providing the intellectual foundation for the global move to austerity in the early 2010s was based on someone accidentally not updating a row formula in Excel." To that I would add the obvious: this should never have happened in the first place, as we should have been able to inspect that code and data from the start.

Now, moving over to IPython, something interesting happened: when I saw the report about the Herndon et al. paper and realized they had published their R scripts for all to see, I posted this request on Twitter:

It seemed to me that the obvious thing to do would be to create a document that explained together the analysis and a bit of narrative using IPython, hopefully more easily used as a starting point for further discussion. What I didn't really expect is that it would take less than three hours for Vincent Arel-Bundock, a PhD Student in Political Science at U. Michigan, to come through with a solution:

I suggested that he turn this example into a proper repository on github with the code and data, which he quickly did:

So now we have a full IPython notebook, kept in a proper github repository. This repository can enable an informed debate about the statistical methodologies used for the analysis, and now anyone who simply installs the SciPy stack can not only run the code as-is, but explore new directions and contribute to the debate in a properly informed way.

On to the heavens: the New York Times' infographic on NASA's Kepler mission

As I was discussing the above with Vincent on Twitter, I came across this post by Jonathan Corum, an information designer who works as NY Times science graphics editor:

The post links to a gorgeous, animated infographic that summarizes the results that NASA's Kepler spacecraft has obtained so far, and which accompanies a full article at the NYT on Kepler's most recent results: a pair of planets that seem to have just the right features to possibly support life, a quick 1200 light-years hop from us.

Jonathan indicated that he converted his notebook to a Python script later on for version control and automation, though I explained to him that he could have continued using the notebook, since the --script flag would give him a .py file if needed, and it's also possible to execute a notebook just like a script, with a bit of additional support code:

In this case Jonathan's code isn't publicly available, but I am still very happy to see this kind of usage: it's a step in the right direction already and as more of this analysis is done with open-source tools, we move further towards the possibility of an informed discussion around data-driven journalism.

I also hope he'll release perhaps some of the code later on, so that others can build upon it for similar analyses. I'm sure lots of people would be interested and it wouldn't detract in any way from the interest in his own work which is strongly tied to the rest of the NYT editorial resources and strengths.

Looking ahead from IPython's perspective

Our job with IPython is to think deeply about questions regarding the intersection of computing, data and science, but it's clear to me at this point that we can contribute in contexts beyond pure scientific research. I hope we'll be able to provide folks who have a direct intersection with the public, such as journalists, with tools that help a more informed and productive debate.

Coincidentally, UC Berkeley will be hosting on May 4 a symposium on data and journalism, and in recent days I've had very productive interactions with folks in this space on campus. Cathryn Carson currently directs the newly formed D-Lab, whose focus is precisely the use of quantitative and datamethods in the social sciences, and her team has recently been teaching workshops on using Python and R for social scientists. And just last week I lectured in Raymond Yee's course (from the School of Information) where they are using the notebook extensively, following Wes McKinney's excellent Python for Data Analysis as the class textbook. Given all this, I'm fairly optimistic about the future of a productive dialog and collaborations on campus, given that we have a lot of the IPython team working full-time here.

Note: as usual, this post is available as an IPython notebook in my blog repo.

by Fernando Perez (noreply@blogger.com) at April 19, 2013 09:10 PM

Philip Herron

Attempting to be professional

So I’ve been extremely busy this year, I have something really exciting in the works but i don’t want to announce this until its 100% complete. But in the mean time I’ve been busy professionalizing my projects and moving them to github as the stable repos then i keep my own git repos on this server private. I’ve been getting paranoid that my server might die as i was playing around with it too much the other day.

I will announce some releases soon… Developing on a python front-end to gcc for close to 4 years i think and there have been no releases yet ;) i think its time for one.

by redbrain at April 19, 2013 11:47 AM

Josef Perkoltd

Binomial Proportions, Equivalence and Power - part 0

Just a pre-announcement because I have a nice graph.

I am looking into tests for binomial proportions, especially equivalence (TOST) and non-inferiority tests.

SAS provides a good overview over the available methods and power for it

Power and significance levels in testing for proportions have a saw tooth pattern because the observed proportions are discrete, see for example this SAS page

Unfortunately for my unit testing, I have not found any equivalence tests for proportions in R. Currently, I'm just trying to match some examples that I found on the internet.

And here is the plot for my power function. It shows the power as a function of the sample size, for either the normal approximation or the binomial distribution, of the test for equivalence, TOST two one-sided tests. The TOST test itself is based on the normal approximation.

by Josef Perktold (noreply@blogger.com) at April 19, 2013 11:01 AM

April 17, 2013

Josef Perkoltd

Debugging: Multiple Testing P-Value Corrections in Statsmodels

subtitle: "The earth is round after all"

series : "Adventures with Statistics in Python"

If you run an experiment and it shows that the earth is not round, then you better check your experiment, your instrument, and don't forget to look up the definition of "round"

Statsmodels has 11 methods for correcting p-values to take account of multiple testing (or it will have after I merge my latest changes).

The following mainly describes how it took me some time to figure out what the interpretation of the results of a Monte Carlo run is. I wrote the Monte Carlo to verify that the multiple testing p-value corrections make sense. I will provide some additional explanations of the multiple testing function in statsmodels in a follow-up post.

Let's start with the earth that doesn't look round.

experiment:
Monte Carlo with 5000 or 10000 replications, to see how well the p-value corrections are doing. We have 30 p-values from hypothesis tests, for 10 of those the null hypothesis is false.
instrument:
statsmodels.stats.multipletests to make the p-value correction

The first results

==========================================================================================
         b      s      sh     hs     h    hommel  fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
------------------------------------------------------------------------------------------
reject 9.6118 9.619  9.7178 9.7274 9.7178  9.72  10.3128 9.8724  10.5152  10.5474  10.5328
r_>k   0.0236 0.0246 0.0374 0.0384 0.0374 0.0376  0.2908 0.0736   0.3962   0.4118   0.4022
------------------------------------------------------------------------------------------

The headers are shortcuts for the p-value correction method. In the first line, reject, are the average number of rejections across Monte Carlo iterations. The second line, r_>k, are the fraction of cases where we reject more than 10 hypothesis. The average number of rejections is large because the alternative in the simulation is far away from the null hypothesis, and the corresponding p-values are small. So all methods are able to reject most of the false hypotheses.

The last three methods estimate, as part of the algorithm, the number of null hypotheses that are correct. All three of those methods reject a true null hypothesis in roughly 40% of all cases. All methods are supposed to limit the false discovery rate (FDR) to alpha which is 5% in this simulations. I expected the fraction in the last line to be below 0.05. So what's wrong?

It looks obvious, after the fact, but it had me puzzled for 3 days.

Changing the experiment: The above data are based on p-values that are the outcome of 30 independent t-tests, which is already my second version for generating random p-values. For my third version, I changed to a data generating process similar to Benjamini, Krieger and Yekutieli 2006, which is the article on which fdr_tsbky is based. None of the changes makes a qualitative difference to the results.

Checking the instrument: All p-values corrections except fdr_tsbky and fdr_gbs are tested against R. For the case at hand, the p-values for fdr_tsbh are tested against R's multtest package. However, the first step is a discrete estimate (number of true null hypothesis) and since it is discrete, the tests will not find differences that show up only in borderline cases. I checked a few more cases which also verify against R. Also, most methods have a double implementation, separately for the p-value correction and for the rejection boolean. Since they all give identical or similar answers, I start to doubt that there is a problem with the instrument.

Is the earth really round? I try to read through the proof that these adaptive methods limit the FDR to alpha, to see if I missed some assumptions, but give up quickly. These are famous authors, and papers that have long been accepted and been widely used. I also don't find any assumption besides independence of the p-values, which I have in my Monte Carlo. However, looking a bit more closely at the proofs shows that I don't really understand FDR. When I implemented these functions, I focused on the algorithms and only skimmed the interpretation.

What is the False Discovery Rate? Got it. I should not rely on vague memories of definitions that I read two years ago. What I was looking at, is not the FDR.

One of my new results (with a different data generating process in the Monte Carlo, but still 10 out of 30 hypotheses are false)

==============================================================================================
              b      s      sh     hs     h    hommel fdr_i  fdr_n  fdr_tsbky fdr_tsbh fdr_gbs
----------------------------------------------------------------------------------------------
reject      5.2924 5.3264 5.5316 5.5576 5.5272 5.5818 8.1904 5.8318   8.5982   8.692    8.633
rejecta     5.2596 5.2926 5.492  5.5176 5.488  5.5408 7.876  5.7744   8.162     8.23    8.1804
reject0     0.0328 0.0338 0.0396  0.04  0.0392 0.041  0.3144 0.0574   0.4362   0.462    0.4526
r_>k        0.0002 0.0002 0.0006 0.0006 0.0006 0.0006 0.0636 0.0016   0.1224   0.1344   0.1308
fdr         0.0057 0.0058 0.0065 0.0065 0.0064 0.0067 0.0336 0.0081   0.0438   0.046    0.0451
----------------------------------------------------------------------------------------------

reject : average number of rejections

rejecta : average number of rejections for cases where null hypotheses is false (10)

rejecta : average number of rejections for cases where null hypotheses is true (20)

r_>k : fraction of Monte Carlo iterations where we reject more than 10 hypotheses

fdr : average of the fraction of rejections when null is true out of all rejections

The last numbers look much better, the numbers are below alpha=0.05 as required, including the fdr for the last three methods.

"Consider the problem of testing m null hypotheses h1, ..., hm simultaneously, of which m0 are true nulls. The proportion of true null hypotheses is denoted by mu0 = m0/m. Benjamini and Hochberg(1995) used R and V to denote, respectively, the total number of rejections and the number of false rejections, and this notation has persisted in the literature. <...> The FDR was loosely defined by Benjamini and Hochberg(1995) as E(V/R) where V/R is interpreted as zero if R = 0." Benjamini, Krieger and Yekutieli 2006, page 2127

Some additional explanations are in this Wikipedia page

What I had in mind when I wrote the code for my Monte Carlo results, was the family wise error rate, FWER,

"The FWER is the probability of making even one type I error in the family, FWER = Pr(V >= 1)" Wikipedia

Although, I did not look up that definition either. What I actually used, is Pr(R > k) where k is the number of false hypothesis in the data generating process. Although, I had chosen my initial cases so Pr(R > k) is close to Pr(V > 0).

In the follow-up post I will go over the new Monte Carlo results, which now look all pretty good.

Reference

Benjamini, Yoav, Abba M. Krieger, and Daniel Yekutieli. 2006. “Adaptive Linear Step-up Procedures That Control the False Discovery Rate.” Biometrika 93 (3) (September 1): 491–507. doi:10.1093/biomet/93.3.491.

by Josef Perktold (noreply@blogger.com) at April 17, 2013 12:27 PM

Multiple Testing P-Value Corrections in Statsmodels

series : "Statistics in Python"

This is a follow-up to my previous posts, here and this post, which are on software development, and multiple comparisons which looks at a specific case of pairwise mean comparisons.

In the following, I provide a brief introduction to the statsmodels functions for p-value asjustements to correct for multiple testing problems and then illustrate some properties using several Monte Carlo experiments.

Read more »

by Josef Perktold (noreply@blogger.com) at April 17, 2013 12:26 PM

April 15, 2013

Jake Vanderplas

Code Golf in Python: Sudoku

Edit: based on suggestions from readers, the best solution is down to 162 characters! Read to the end to see how

A highlight-ipynb of PyCon each year for me is working on the little coding challenges offered by companies in the expo center. I love testing my Python prowess against the problems they pose (and being rewarded with a branded mug or T-shirt!) This year, several of the challenges involved what's become known as code golf: writing a solution with minimal keystrokes.

By way of example, take a look at this function definition:

In [1]:
def S(p):i=p.find('0');return[(s for v in
set(`5**18`)-{(i-j)%9*(i/9^j/9)*(i/27^j/27|i%9/3^j%9/3)or
p[j]for j in range(81)}for s in S(p[:i]+v+p[i+1:])),[p]][i<0]

This is a valid function definition (in Python 2.7) which executes a particular task. I'll give more information on the workings of this script later on, but for now I'll leave it to the reader to ponder over what it might do.

Given the level of obfuscation involved, you might wonder what the point is: you'd never want to write "real" code in this style, so why spend the time doing it? I'd argue that it's useful for more than just upping your geek cred: good Python code golf must utilize many quirks of the Python language in seeking brevity above all else. Learning to utilize these quirks can lead to a much deeper understanding of the Python language.

I thought about putting together a list of tricks that can help lead to short programs, but the problem is there are so many of them (and there are other pages out there which do this adequately enough). Instead, I decided to simply work through a step-by-step example of creating a code golf solution to a fun little problem: solving Sudoku.

You've probably seen Sudoku: it's a puzzle consisting of a 9x9 grid of numbers, with some spaces left blank. The grid must be filled so that each row, column, and 3x3 box contains the numbers 1-9. It's a generalization of the Latin Squares first studied by Leonhard Euler nearly 300 years ago.

The reason I chose to use Sudoku here is simple: not only is today Euler's birthday, but Sudoku is how I first learned Python. My first year of graduate school, my research advisor recommended that I learn Python for the project I was working on. Sudoku had just become popular in the US at the time, and I decided to learn Python by writing a Sudoku solver. I did it over my winter break, and the rest (so it's said) is history.

Note that this is by no means a new subject: you can read about Sudoku in Python in several places, and there are even a few code golf solutions floating around out there. In particular, you should take a look at this solution, which is the shortest solver I've seen, and from which I borrowed a few of the tricks used below.

Here we'll pose the problem in a slightly different way, which will give us the chance to develop a brand new short algorithm.

The Problem

Every code golf challenge must start with a well-defined problem. Here is ours:

  • Write a function S(p) which, given a valid Sudoku puzzle, returns an iterator over all solutions of the puzzle.

The puzzle will be in the form of a length-81 string of digits, with '0' denoting an empty grid space. The solved puzzles should also be length-81 strings, with the zeros replaced by solved values.

For example, a valid S(p) may produce the following results:

puz="027800061000030008910005420500016030000970200070000096700000080006027000030480007"
for s in S(puz):
    print(s)

327894561645132978918765423589216734463978215172543896794651382856327149231489657
327894561645132978918765423589216734463978215271543896794651382856327149132489657
327894561465132978918765423589216734643978215172543896794651382856327149231489657
327894561465132978918765423589216734643978215271543896794651382856327149132489657

puz = 81*'0'  # empty puzzle
print(next(S(puz)))

132598476598476132476132985319825764825764319764913258981257643647389521253641897

Notice that the function S() cannot simply return a list of valid solutions: if it did, then the empty puzzle example would need to produce all ~$10^{22}$ valid sudoku grids before the first solution could be accessed! Instead, it must make use of Python's extremely useful generator syntax. If you've never used generators and generator expressions in your Python code, stop reading this right now and go learn about them: they're one of the most unique and powerful features of the Python language.

As you'll see below, my best solution is 176 162 characters, and is the code snippet I showed above:

In [2]:
def S(p):i=p.find('0');return[(s for v in
set(`5**18`)-{(i-j)%9*(i/9^j/9)*(i/27^j/27|i%9/3^j%9/3)or
p[j]for j in range(81)}for s in S(p[:i]+v+p[i+1:])),[p]][i<0]

It's rather unenlightening in itself, so below I'll explain the steps I took to arrive at it, in hopes that you can learn from my thought process. Though this is the best solution I was able to come up with, I don't know whether or not a better one might be out there. If you can beat it, please post your solution in the blog comment thread!

Step 1: Focus on Correct Code

A code golf script must be more than simply short: it must be correct. For this reason, I generally start by simply writing correct code, and not for the moment worrying about brevity.

In the case of Sudoku, there are many rules and rubriks that can be used to create an efficient solver (read about some of them here). Using these, it is possible to solve most (all?) Sudoku puzzles without resorting to guess-and-check approaches. To implement this strategy, one approach might be to enumerate the sets of possible values for each grid space, and apply these rules to eliminate values until only a single possibility remains within each space.

Unfortunately, this is not a very suitable approach for code golf: the number of rules required to accomplish this is very large. Instead, we'll make use of the minimal amount of rules, and write a guess-and-check based solver.

Here's a first attempt, focusing on the algorithm rather than on brevity. We'll start by defining a test puzzle with four solutions, and write a small function that can test our solver:

In [3]:
puz = "027800061000030008910005420500016030000970200070000096700000080006027000030480007"

def test(S):
    # solve an empty puzzle
    print(next(S(81*'0')))
    print('')

    # find all four solutions of puz
    for s in S(puz):
        print(s)
In [4]:
# Write functions that, given an index 0 <= i < 81,
# return the indices of grid spaces in the same row,
# column, and box as entry i
def row_indices(i):
    start = i - i % 9
    return range(start, start + 9)

def col_indices(i):
    start = i % 9
    return range(start, start + 81, 9)

def box_indices(i):
    start = 27 * (i // 27) + 3 * ((i % 9) // 3)
    return [i for j in range(3) for i in range(start + 9 * j, start + 9 * j + 3)]

# compute and store the full set of connected indices for each i
connected = [(set.union(set(box_indices(i)),
                       set(row_indices(i)),
                       set(col_indices(i)))
              - set([i]))
             for i in range(81)]

# S(p) will recursively find solutions and "yield" them
def S(p):
    # First, find the number of empty squares and the number of
    # possible values within each square
    L = []
    for i in range(81):
        if p[i] == '0':
            vals = set('123456789') - set(p[n] for n in connected[i])
            if len(vals) == 0:
                return
            else:
                L.append((len(vals), i, vals))
    
    # if all squares are solved, then yield the current solution
    if len(L) == 0 and '0' not in p:
        yield p
        
    # otherwise, take the index with the smallest number of possibilities,
    # and recursively call S() for each possible value.
    else:
        N, i, vals = min(L)
        for val in vals:
            for s in S(p[:i] + val + p[i + 1:]):
                yield s

test(S)
132598476598476132476132985351249768789361254624857319943785621817624593265913847

327894561465132978918765423589216734643978215172543896794651382856327149231489657
327894561465132978918765423589216734643978215271543896794651382856327149132489657
327894561645132978918765423589216734463978215172543896794651382856327149231489657
327894561645132978918765423589216734463978215271543896794651382856327149132489657

This is the test output we expect: it quickly finds not only the four solutions of the test puzzle, but a solution derived from a completely empty puzzle. This is by no means a complete test suite, but it gives us good reason to believe that the code is correct.

Step 2: Simplify the Algorithm

For me, the biggest hurdle to writing concise programs was letting go of the compulsion to write clear and efficient code. In my research, the two most important aspects of code are its scalability and its readibility. I need my code to work on extremely large datasets, and I need a collaborator to be able to use my code to reproduce or extend my results. Code that doesn't meet these requirements is hardly worth writing. Code golf, though, is different: it's often an exercise in sacrificing efficiency and readability at the altar of brevity.

For the Sudoku problem, we can start in two obvious places.

  • We can condense the computation of the connected indices by using a nested list comprehension. List comprehensions are a way of shortening a loop to a single statement. In this case, the resulting algorithm is slightly less efficient, a bit less readable, but saves a lot of typing.

  • Rather than finding the grid space with the fewest possibilities to recursively guess at a solution, we simply choose any unknown grid space. This can be much less efficient, but saves a lot of typing.

Applying these two ideas leads to the following:

In [5]:
# store the full set of connected indices for each i
connected = [set([j for j in range(81)
              if (i%9==j%9) or (i//9==j//9)
                 or (i//27==j//27 and i%9//3==j%9//3)])
             for i in range(81)]
def S(p):
    # find any grid space without a known value
    i = p.find('0')
    
    # if no entry is zero, then yield the current solution
    if i < 0:
        yield p
        
    # otherwise, take this index and recursively call S()
    # for each possible value.
    else:
        for val in set('123456789') - set(p[n] for n in connected[i]):
            for s in S(p[:i] + val + p[i + 1:]):
                yield s
test(S)
132598476598476132476132985319825764825764319764913258981257643253641897647389521

327894561465132978918765423589216734643978215172543896794651382856327149231489657
327894561465132978918765423589216734643978215271543896794651382856327149132489657
327894561645132978918765423589216734463978215172543896794651382856327149231489657
327894561645132978918765423589216734463978215271543896794651382856327149132489657

This is good, but we can go further by moving the connected list definition into the S() function. Again, this is less efficient than computing the sets once beforehand, but it saves some typing:

In [6]:
def S(p):
    i = p.find('0')
    if i < 0:
        yield p
    else:
        for v in set('123456789')-set(p[j] for j in range(81)
                                      if (i%9==j%9) or (i//9==j//9)
                                      or (i//27==j//27 and i%9//3==j%9//3)):
            for s in S(p[:i]+v+p[i+1:]):
                yield s
test(S)
132598476598476132476132985319825764825764319764913258981257643253641897647389521

327894561465132978918765423589216734643978215172543896794651382856327149231489657
327894561465132978918765423589216734643978215271543896794651382856327149132489657
327894561645132978918765423589216734463978215172543896794651382856327149231489657
327894561645132978918765423589216734463978215271543896794651382856327149132489657

We can go a little further by using a set comprehension for the loop over possible values. Set comprehensions are like list comprehensions or generator expressions, but are denoted with curly brackets: {}.

We'll also use a trick here based on the way Python implements boolean logic. When you execute something like

(A or B)

you might expect the result to be either True or False. Instead, Python does something a bit clever. If the result is False, it returns A (which, naturally, evaluates to False). If the result is True, it returns A if A evaluates to True, and B otherwise. We can use this fact to remove the if statement completely from the set comprehension. We'll end up with some extra values within the second set, but the set difference conveniently removes these.

In [7]:
def S(p):
    i = p.find('0')
    if i < 0:
        yield p
    else:
        for v in set('123456789')-{(i%9!=j%9)and(i//9!=j//9)
                                   and(i//27!=j//27or i%9//3!=j%9//3)
                                   or p[j]for j in range(81)}:
            for s in S(p[:i]+v+p[i+1:]):
                yield s
test(S)
132598476598476132476132985319825764825764319764913258981257643253641897647389521

327894561465132978918765423589216734643978215172543896794651382856327149231489657
327894561465132978918765423589216734643978215271543896794651382856327149132489657
327894561645132978918765423589216734463978215172543896794651382856327149231489657
327894561645132978918765423589216734463978215271543896794651382856327149132489657

Step 3: Combining Expressions

Now we have the basics of the algorithm. We can keep shrinking the implementation by combining the two loops into a single generator expression. It's important that we use a generator expression (surrounded by ()) rather than a list comprehension (surrounded by []), because otherwise all possible solutions would need to be computed in order to return a single one!

For clarity, we'll create a temporary explicit container for the generator, which we can remove later. The result of combining the loops looks like this:

In [8]:
def S(p):
    i = p.find('0')
    if i < 0:
        yield p
    else:
        g = (s for v in set('123456789')
             - {(i%9!=j%9)and(i//9!=j//9)
                and(i//27!=j//27or i%9//3!=j%9//3)
                or p[j]for j in range(81)}
             for s in S(p[:i]+v+p[i+1:]))
        for s in g:
            yield s
test(S)
132598476598476132476132985319825764825764319764913258981257643253641897647389521

327894561465132978918765423589216734643978215172543896794651382856327149231489657
327894561465132978918765423589216734643978215271543896794651382856327149132489657
327894561645132978918765423589216734463978215172543896794651382856327149231489657
327894561645132978918765423589216734463978215271543896794651382856327149132489657

We can further combine the if-else statement into the generator expression to save some more room: if there are no zeros in p, we'll just loop over [p] instead of looping over the generator.

In [9]:
def S(p):
    i = p.find('0')
    g = (s for v in set('123456789')
         -{(i%9!=j%9)and(i//9!=j//9)and(i//27!=j//27or i%9//3!=j%9//3)
         or p[j]for j in range(81)}for s in S(p[:i]+v+p[i+1:]))
    for s in (g if i>=0 else[p]):  # parentheses here for clarity
        yield s
test(S)
132598476598476132476132985319825764825764319764913258981257643253641897647389521

327894561465132978918765423589216734643978215172543896794651382856327149231489657
327894561465132978918765423589216734643978215271543896794651382856327149132489657
327894561645132978918765423589216734463978215172543896794651382856327149231489657
327894561645132978918765423589216734463978215271543896794651382856327149132489657

Step 4: Sweating the Details

We've condensed the script about as much as we can now, but there are still some tiny changes we can make that will save a few characters here or there. This step is the difference between a code golf amateur and a true code golf pro. Some of the tricks I apply here would not have been obvious to me had I not come across this solution, so I don't think I can call myself a pro just yet!

First of all, we can shorten the definition of the full set of nine digits. Observe:

In [10]:
print(set('123456789'))
print(set(str(5**18)))
set([&apos1&apos, &apos3&apos, &apos2&apos, &apos5&apos, &apos4&apos, &apos7&apos, &apos6&apos, &apos9&apos, &apos8&apos])
set([&apos1&apos, &apos3&apos, &apos2&apos, &apos5&apos, &apos4&apos, &apos7&apos, &apos6&apos, &apos9&apos, &apos8&apos])

One character shorter! We're making progress.

Next, we can use compact bitwise operators to test whether square i and square j are related. Our previous expression was

(i%9!=j%9)and(i//9!=j//9)and(i//27!=j//27or i%9//3!=j%9//3)

we can equivalently write

(i-j)%9*(i//9^j//9)*(i//27^j//27|i%9//3^j%9//3)

which saves about 12 more characters.

Further, observe that the variable i, which denotes the index of the first zero in the puzzle string, will be -1 if the string has no zeros. The bitwise inverse of -1 is zero, so ~i will evaluate to False only if there are no zeros in the puzzle. This saves a couple more characters. The result is:

In [11]:
def S(p):
    i = p.find('0')
    g = (s for v in set(str(5**18))
         -{(i-j)%9*(i//9^j//9)*(i//27^j//27|i%9//3^j%9//3)
         or p[j]for j in range(81)}for s in S(p[:i]+v+p[i+1:]))
    for s in g if~i else[p]:
        yield s
test(S)
132598476598476132476132985319825764825764319764913258981257643253641897647389521

327894561465132978918765423589216734643978215172543896794651382856327149231489657
327894561465132978918765423589216734643978215271543896794651382856327149132489657
327894561645132978918765423589216734463978215172543896794651382856327149231489657
327894561645132978918765423589216734463978215271543896794651382856327149132489657

Finally, though it's standard to use four spaces for an indentation, Python will also recognize one-space indentations, which save white space characters. At the same time, we'll remove other unnecessary spaces, and move the definition of g into the statement where it's used. To make things easier to parse, we'll replace a required white-space with a line break (between or and p). Because this break falls between two parentheses, the lack of indentation is still parseable.

In [12]:
def S(p):
 i=p.find('0')
 for s in(s for v in set(str(5**18))-{(i-j)%9*(i//9^j//9)*(i//27^j//27|i%9//3^j%9//3)or
p[j]for j in range(81)}for s in S(p[:i]+v+p[i+1:]))if~i else[p]:
  yield s
test(S)
132598476598476132476132985319825764825764319764913258981257643253641897647389521

327894561465132978918765423589216734643978215172543896794651382856327149231489657
327894561465132978918765423589216734643978215271543896794651382856327149132489657
327894561645132978918765423589216734463978215172543896794651382856327149231489657
327894561645132978918765423589216734463978215271543896794651382856327149132489657

We've gotten our solution down to 182 characters! As far as I can tell, this is the best we can do in Python versions less than 3.2. Python 3.3, however, added the "yield from" statement, which can help us further shorten this. In a generator definition, writing

yield from G

is (for our purposes, anyway) essentially equivalent to writing

for g in G:
    yield g

so it fits the bill exactly. As a bonus, the removal of nested indentation allows us to write things on a single line, using the ; character in place of a new line:

In [ ]:
def S(p):i=p.find('0');yield from(s
for v in set(str(5**18))-{(i-j)%9*(i//9^j//9)*(i//27^j//27|i%9//3^j%9//3)or
p[j]for j in range(81)}for s in S(p[:i]+v+p[i+1:]))if~i else[p]

Using this new syntactic sugar buys us another twelve characters. We're down to 176 characters: not yet tweetable, but I think it's pretty good! Once again, if you see any further abbreviations that can be made, please let me know in the blog comments.

Another Approach

The other shortest sudoku script I've seen is this one, dating back eight years or so and coming in at 185 characters (see the source, and note that due to the change in integer division syntax, the python 3 version, here, is six characters longer than the python 2 version):

In [ ]:
def r(a):i=a.find('0');~i or exit(a);[m
in[(i-j)%9*(i//9^j//9)*(i//27^j//27|i%9//3^j%9//3)or a[j]for
j in range(81)]or r(a[:i]+m+a[i+1:])for m in'%d'%5**18]
from sys import*;r(argv[1])

This script has a slightly different purpose: it's meant to take an argument in the command line and output one answer. For this reason, a direct comparison of the two solutions is somewhat misleading. Taking away the command-line call brings the count down to 174 characters (note the from sys import* is still required for the exit() call). On the other hand, this script only finds a single solution, and does it in a clever but unorthodox way: in order to break out of the recursion efficiently, it returns the solution as an exit code. This works in the sense that the answer prints to the screen, but means that the script is only useful as a stand-alone application.

Regardless of judgments about which solution "won" this round of code golf, I hope you agree with me that this is a valuable exercise. To me, the end goal of code golf is not simply a concise program: it's the pursuit of a deeper knowledge of the ins and outs of the Python language itself.

Update

Several commenters on the blog and on reddit have suggested improvements to the algorithm. First of all, the conditional of the form

(genexp if~i else[p])

can be made one character shorter by using the fact that boolean variables are interpreted as either 1 or zero:

([p],genexp)[i<0]

Also, it was pointed out that the yield from can be replaced by a simple return in this case, because yield is not used anywhere in the function. So the shortest version of the function becomes this:

In [13]:
def S(p):i=p.find('0');return[(s for v in
set(str(5**18))-{(i-j)%9*(i//9^j//9)*(i//27^j//27|i%9//3^j%9//3)or
p[j]for j in range(81)}for s in S(p[:i]+v+p[i+1:])),[p]][i<0]

This is 171 characters!

But there's more. Now that the yield from is unnecessary, we can move to python 2.x and change all the Python 3-style integer division operators (//) to Python 2-style (/). This saves six more characters:

In [14]:
def S(p):i=p.find('0');return[(s for v in
set(str(5**18))-{(i-j)%9*(i/9^j/9)*(i/27^j/27|i%9/3^j%9/3)or
p[j]for j in range(81)}for s in S(p[:i]+v+p[i+1:])),[p]][i<0]

165 characters, but note that this requires Python 2.7.

There's one more thing we can add, as noted by a commenter below. In Python 2.x, back-ticks can be used as a shorthand for string representation (this is a feature removed in Python 3.x). Thus:

In [15]:
print(str(5**18))
print(`5**18`)
3814697265625
3814697265625

A problem, though, is that in 32-bit architectures, 5**18 is a long integer, so that the string representation is '3814697265625L' (note the L appended at the end). This would lead to incorrect solutions. But as long as we're assured that we're on a 64-bit platform, we can use this to save three more characters:

In [16]:
def S(p):i=p.find('0');return[(s for v in
set(`5**18`)-{(i-j)%9*(i/9^j/9)*(i/27^j/27|i%9/3^j%9/3)or
p[j]for j in range(81)}for s in S(p[:i]+v+p[i+1:])),[p]][i<0]

That brings our best to 162 characters, though it requires Python 2.7 and a 64-bit system. Thanks to all commenters who suggested these improvements!

This post was written in the IPython notebook. The raw notebook can be downloaded here. See also nbviewer for an online static view.

by Jake Vanderplas at April 15, 2013 11:00 PM

Fabian Pedregosa

Isotonic Regression

My latest contribution for scikit-learn is an implementation of the isotonic regression model that I coded with Nelle Varoquaux and Alexandre Gramfort. This model finds the best least squares fit to a set of points, given the constraint that the fit must be a non-decreasing function. The example on the scikit-learn website gives an intuition on this model.

isotonic regression

The original points are in red, and the estimated ones are in green. As you can see, there is one estimation (green point) for each data sample (red point). Calling $y \in \mathbb{R}^n$ the input data, the model can be written concisely as an optimization problem over $x$

$$ \text{argmin}_x \|y - x \|^2 \\ \text{subject to } x_0 \leq x_1 \leq \cdots \leq x_n $$

The algorithm implemented in scikit-learn 3 is the pool adjacent violators algorithm 1, which is an efficient linear time $\mathcal{O}(n)$ algorithm. The algorithm sweeps through the data looking for violations of the monotonicity constraint. When it finds one, it adjusts the estimate to the best possible fit with constraints. Sometimes it also needs to modify previous points to make sure the new estimate does not violate the constraints. The following picture shows how it proceeds at each iteration

isotonic regression


  1. "Active set algorithms for isotonic regression; A unifying framework", Michael J. Best, Nilotpal Chakravarti 

  2. Python notebook to generate the figures: ipynb and web version 

  3. The algorithm is used through the sklearn.isotonic.IsotonicRegression object (doc) or the function sklearn.isotonic.isotonic_regression (doc

by Fabian Pedregosa at April 15, 2013 10:00 PM

April 11, 2013

Enthought

Introducing Enthought Canopy

-Eric Jones, Enthought CEO Yesterday we launched Enthought Canopy, our next-generation, Python-based analysis environment and our follow-on to EPD. Since 2003, the Enthought Python Distribution (EPD) has helped hundreds of thousands of scientists, engineers and analysts develop and deploy with Python. 2013 is its 10th anniversary! It’s hard to believe it’s been that long. Time [...]

by mtranby at April 11, 2013 10:03 PM

April 09, 2013

Continuum Analytics

NumbaPro Monte Carlo Option Pricer

The GPU revolution of the past few years provides inexpensive access to hundreds of specialized computational units in a single silicon die. The challenge is efficiently accessing these and developing or adapting algorithms that can harness their power. Here, I’ll show how the NumbaPro module from Anaconda Accelerate can be used to parallelize a standard option pricing algorithm onto a GPU, giving a 14x speedup, using only a few extra lines of code.

by Siu Kwan Lam at April 09, 2013 10:30 AM

April 08, 2013

Official SymPy blog

Google Summer of Code 2013

SymPy was accepted by Google once again to participate in Google Summer of Code for 2013. Please go to http://www.google-melange.com/gsoc/org/google/gsoc2013/sympy for more information about how to apply and get started.

by Aaron Meurer (noreply@blogger.com) at April 08, 2013 05:04 PM

April 05, 2013

Matthew Rocklin

SymPy and Theano -- Matrix Expressions

Introduction

This post uses some LaTeX. You may want to read it on the original site.

This is the last of a three part series connecting SymPy and Theano to transform mathematical expressions into efficient numeric code (see part 1 and part 2). We have seen that it is simple and computationally profitable to combine the best parts of both projects.

In this post we’ll switch from computing scalar expressionss to computing matrix expressions. We’ll define the Kalman filter in SymPy and send it to Theano for code generation. We’ll then use SymPy to define a more performant blocked version of the same algorithm.

Kalman Filter

The Kalman filter is an algorithm to compute the Bayesian update of a normal random variable given a linear observation with normal noise. It is commonly used when an uncertain quantity is updated with the results of noisy observations. For example it is used in weather forecasting after weather stations report in with new measurements, in aircraft/car control to automatically adjust for external conditions real-time, or even on your smartphone’s GPS navigation as you update your position based on fuzzy GPS signals. It’s everywhere, it’s important, and it needs to be computed quickly and continuously. It suits our needs today because it can be completely defined with a pair of matrix expressions.

from sympy import MatrixSymbol, latex
n       = 1000                          # Number of variables in our system/current state
k       = 500                           # Number of variables in the observation
mu      = MatrixSymbol('mu', n, 1)      # Mean of current state
Sigma   = MatrixSymbol('Sigma', n, n)   # Covariance of current state
H       = MatrixSymbol('H', k, n)       # A measurement operator on current state
R       = MatrixSymbol('R', k, k)       # Covariance of measurement noise
data    = MatrixSymbol('data', k, 1)    # Observed measurement data

newmu   = mu + Sigma*H.T * (R + H*Sigma*H.T).I * (H*mu - data)      # Updated mean
newSigma= Sigma - Sigma*H.T * (R + H*Sigma*H.T).I * H * Sigma       # Updated covariance

print latex(newmu)
print latex(newSigma)

$$ \Sigma H^T \left(H \Sigma H^T + R\right)^{-1} \left(-data + H \mu\right) + \mu $$ $$ - \Sigma H^T \left(H \Sigma H^T + R\right)^{-1} H \Sigma + \Sigma $$

Theano Execution

The objects above are for symbolic mathematics, not for numeric computation. If we want to compute this expression we pass our expressions to Theano.

inputs = [mu, Sigma, H, R, data]
outputs = [newmu, newSigma]
dtypes = {inp: 'float64' for inp in inputs}

from sympy.printing.theanocode import theano_function
f = theano_function(inputs, outputs, dtypes=dtypes)

Theano builds a Python function that calls down to a combination of low-level C code, scipy functions, and calls to the highly optimized DGEMM routine for matrix multiplication. As input this function takes five numpy arrays corresponding to our five symbolic inputs and produces two numpy arrays corresponding to our two symbolic outputs. Recent work allows any SymPy matrix expression to be translated to and run by Theano.

import numpy
ninputs = [numpy.random.rand(*i.shape).astype('float64') for i in inputs]
nmu, nSigma = f(*ninputs)

Blocked Execution

These arrays are too large to fit comfortably in the fastest parts of the memory hierarchy. As a result each sequential C, scipy, or DGEMM call needs to move big chunks of memory around while it computes. After one operation completes the next operation moves around the same memory while it performs its task. This repeated memory shuffling hurts performance.

A common approach to reduce memory shuffling is to cut the computation into smaller blocks. We then perform as many computations as possible on a single block before moving on. This is a standard technique in matrix multiplication.

from sympy import BlockMatrix, block_collapse
A, B, C, D, E, F, G, K = [MatrixSymbol(a, n, n) for a in 'ABCDEFGK']
X = BlockMatrix([[A, B],
                 [C, D]])
Y = BlockMatrix([[E, F],
                 [G, K]])
print latex(X*Y)

$$ \begin{bmatrix} A & B \\ C & D \end{bmatrix} \begin{bmatrix} E & F \\ G & K \end{bmatrix}$$

print latex(block_collapse(X*Y))

$$ \begin{bmatrix} A E + B G & A F + B K \\ C E + D G & C F + D K\end{bmatrix} $$

We are now able to focus on substantially smaller chunks of the array. For example we can choose to keep A in local memory and perform all computations that involve A. We will still need to shuffle some memory around (this is inevitable) but by organizing with blocks we’re able to shuffle less.

This idea extends beyond matrix multiplication. For example, SymPy knows how to block a matrix inverse

print latex(block_collapse(X.I))

$$ \begin{bmatrix} \left(- B D^{-1} C + A\right)^{-1} & - A^{-1} B \left(- C A^{-1} B + D\right)^{-1} \\ - \left(- C A^{-1} B + D\right)^{-1} C A^{-1} & \left(- C A^{-1} B + D\right)^{-1} \end{bmatrix} $$

High performance dense linear algebra libraries hard-code all of these tricks into each individual routine. The call to the general matrix multiply routine DGEMM performs blocked matrix multiply within the call. The call to the general matrix solve routine DGESV can perform blocked matrix solve. Unfortunately these routines are unable to coordinate blocked computation between calls.

Fortunately, SymPy and Theano can.

SymPy can define and reduce the blocked matrix expressions using relations like what are shown above.

from sympy import blockcut, block_collapse
blocksizes = {
        Sigma: [(n/2, n/2), (n/2, n/2)],
        H:     [(k/2, k/2), (n/2, n/2)],
        R:     [(k/2, k/2), (k/2, k/2)],
        mu:    [(n/2, n/2), (1,)],
        data:  [(k/2, k/2), (1,)]
        }
blockinputs = [blockcut(i, *blocksizes[i]) for i in inputs]
blockoutputs = [o.subs(dict(zip(inputs, blockinputs))) for o in outputs]
collapsed_outputs = map(block_collapse, blockoutputs)

fblocked = theano_function(inputs, collapsed_outputs, dtypes=dtypes)

Theano is then able to coordinate this computation and compile it to low-level code. At this stage the expresssions/computations are fairly complex and difficult to present. Here is an image of the computation (click for zoomable PDF) as a directed acyclic graph.

Results

Lets time each function on the same inputs and see which is faster

>>> timeit f(*ninputs)
1 loops, best of 3: 2.69 s per loop

>>> timeit fblocked(*ninputs)
1 loops, best of 3: 2.12 s per loop

That’s a 20% performance increase from just a few lines of high-level code.

Blocked matrix multiply and blocked solve routines have long been established as a good idea. High level mathematical and array programming libraries like SymPy and Theano allow us to extend this good idea to arbitrary array computations.

Analysis

Good Things

First, lets note that we’re not introducing a new library for dense linear algebra. Instead we’re noting that pre-existing general purpose high-level tools can be composed to that effect.

Second, lets acknoledge that we could take this further. For example Theano seemlessly handles GPU interactions. We could take this same code to a GPU accelerated machine and it would just run faster without any action on our part.

Bad Things

However, there are some drawbacks.

Frequent readers of my blog might recall a previous post about the Kalman filter. In it I showed how we could use SymPy’s inference engine to select appropriate BLAS/LAPACK calls. For example we could infer that because $ H \Sigma H^T + R $ was symmetric positive definite we could use the substantially more efficient POSV routine for matrix solve rather than GESV (POSV uses the Cholesky algorithm for decomposition rather than straight LU). Theano doesn’t support the specialized BLAS/LAPACK routines though, so we are unable to take advantage of this benefit. The lower-level interface (Theano) is not sufficiently rich to use all information captured in the higher-level (SymPy) representation.

Also, I’ve noticed that the blocked version of this computation experiences some significant roundoff errors (on the order of 1e-3). I’m in the process of tracking this down. The problem must occur somewhere in the following tool-chain

SymPy -> Blocking -> Theano -> SciPy -> C routines -> BLAS

Debugging in this context can be wonderful if all elements are well unit-tested. If they’re not (they’re not) then tracking down errors like this requires an unfortunate breadth of expertise.

References

Scripts

April 05, 2013 07:00 AM

April 02, 2013

Continuum Analytics

SAT based Sudoku solver in Python

The Boolean satisfiability problem can be solved extremely efficiently using SAT solvers. Over the past 20 years, SAT solvers have drastically improved. In the early 90’s SAT solvers could only handle a few hundred variables and clauses, whereas today problems with millions of variables and clauses can be solved quickly. Most of these advances are due to better algorithms, not due to better hardware.

by Ilan Schnell at April 02, 2013 11:23 AM

March 31, 2013

Enthought

Fun with QtWebKit HTML5 Video

Solving the QtWebKit HTML5 Video DirectShow Problem A while back I was given the task of fixing the problems that our development team was having with playing H.264 or WebM video on Windows in a QWebView widget using the HTML5 <video> tag. The application in question is a hybrid of a traditional desktop application and [...]

by Robin Dunn at March 31, 2013 09:34 PM

March 29, 2013

Fabian Pedregosa

Householder matrices

Householder matrices are square matrices of the form

$$ P = I - \beta v v^T$$

where $\beta$ is a scalar and $v$ is a vector. It has the useful property that for suitable chosen $v$ and $\beta$ it makes the product $P x$ to zero out all of the coordinates but one, that is, $P x = \|x\| e_1$. The following code, given $x$, finds the values of $\beta, v$ that verify that property. The algorithm can be found in several textbooks 1

def house(x):
    """
    Given a vetor x, computes vectors v with v[0] = 1
    and scalar beta such that P = I - beta v v^T
    is orthogonal and P x = ||x|| e_1

    Parameters
    ----------
    x : array, shape (n,) or (n, 1)

    Returns
    -------
    beta : scalar
    v : array, shape (n, 1)
    """
    x = np.asarray(x)
    if x.ndim == 1:
        x = x[:, np.newaxis]
    sigma = linalg.norm(x[1:, 0]) ** 2
    v = np.vstack((1, x[1:]))
    if sigma == 0:
        beta = 0
    else:
        mu = np.sqrt(x[0, 0] ** 2 + sigma)
        if x[0, 0] <= 0:
            v[0, 0] = x[0, 0] - mu
        else:
            v[0, 0] = - sigma / (x[0, 0] + mu)
        beta = 2 * (v[0, 0] ** 2) / (sigma + v[0, 0] ** 2)
        v /= v[0, 0]
    return beta, v

As promised, this computes the parameters of $P$ such that $P x = \|x\| e_1$, exact to 15 decimals:

>>> n = 5
>>> x = np.random.randn(n)
>>> beta, v = house(x)
>>> P = np.eye(n) - beta * np.dot(v, v.T)
>>> print np.round(P.dot(x) / linalg.norm(x), decimals=15) 
[ 1. -0. -0.  0. -0.]

This property is what it makes Householder matrices useful in the context of numerical analysis. It can be used for example to compute the QR decomposition of a given matrix. The idea is to succesively zero out the sub-diagonal elements, thus leaving a triangular matrix at the end. In the first iteration we compute a Householder matrix $P_0$ such that $P_0 X$ has only zero below the diagonal of the first column, then compute a Householder matrix $P_1$ such that $P_1 X$ zeroes out the subdiagonal elements of the second column and so on. At the end we will have that $P_0 P_1 ... P_n X$ is an upper triangular matrix. Since all $P_i$ are orthogonal, the product $P_0 P_1 ... P_n$ is again an orthogonal matrix, namely the $Q$ matrix in the QR decomposition.

If we choose X as 20-by-20 random matrix, with colors representing different values

QR decomposition

we can see the process of the Householder matrices being applied one by one to obtain an upper triangular matrix

QR decomposition

A similar application of Householder matrices is to reduce a given symmetric matrix to tridiagonal form, which proceeds in a similar way as in the QR algorithm, only that now we multiply by the matrix $X$ by the left and right with the Householder matrices. Also, in this case we seek for Householder matrices that zero out the elements of the subdiagonal plus one, instead of just subdiagonal elements. This algorithm is used for example as a preprocessing step for most dense eigensolvers

Tridiagonalization


  1. "Matrix Computations" third edition, Golub & VanLoan (Algorithm 5.1.1). 

  2. Code to reproduce the figures can be found here, source for the IPython notebook can be found here 

by Fabian Pedregosa at March 29, 2013 11:00 PM

March 28, 2013

Josef Perkoltd

Inter-rater Reliability, Cohen's Kappa and Surprises with R

Introduction

This is part three in adventures with statsmodels.stats, after power and multicomparison.

This time it is about Cohen's Kappa, a measure of inter-rater agreement or reliability. Suppose we have two raters that each assigns the same subjects or objects to one of a fixed number of categories. The question then is: How well do the raters agree in their assignments? Kappa provides a measure of association, the largest possible value is one, the smallest is minus 1, and it has a corresponding statistical test for the hypothesis that agreement is only by chance. Cohen's kappa and similar measures have widespread use, among other fields, in medical or biostatistic. In one class of applications, raters are doctors, subjects are patients and the categories are medical diagnosis. Cohen's Kappa provides a measure how well the two sets of diagnoses agree, and a hypothesis test whether the agreement is purely random.

For more background see this Wikipedia page which was the starting point for my code.

Most of the following focuses on weighted kappa, and the interpretation of different weighting schemes. In the last part, I add some comments about R, which provided me with several hours of debugging, since I'm essentially an R newbie and have not yet figured out some of it's "funny" behavior.

Read more »

by Josef Perktold (noreply@blogger.com) at March 28, 2013 09:54 PM

Matthew Rocklin

SymPy and Theano -- Scalar Simplification

Introduction

This post uses some LaTeX. You may want to read it on the original site.

In my last post I showed how SymPy can benefit from Theano. In particular Theano provided a mature platform for code generation that outperformed SymPy’s attempt at the same problem. I argued that projects should stick to one specialty and depend on others for secondary concerns. Interfaces are better than add-ons.

In this post I’ll show how Theano can benefit from SymPy. In particular I’ll demonstrate the practicality of SymPy’s impressive scalar simplification routines for generating efficient programs.

After re-reading over this post I realize that it’s somewhat long. I’ve decided to put the results first in hopes that it’ll motivate you to keep reading.

Projectoperation count
SymPy27
Theano24
SymPy+Theano17

Now, lets find out what those numbers mean.

Example problem

We use a larger version of our problem from last time; a radial wavefunction corresponding to n = 6 and l = 2 for Carbon (Z = 6)

from sympy.physics.hydrogen import R_nl
from sympy.abc import x
n, l, Z = 6, 2, 6
expr = R_nl(n, l, x, Z)
print latex(expr)

$$\frac{1}{210} \sqrt{70} x^{2} \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x}$$

We want to generate code to compute both this expression and its derivative. Both SymPy and Theano can compute and simplify derivatives. In this post we’ll measure the complexity of a computation that simultaneously computes both the above expression and its derivative. We’ll arrive at this computation through a couple of different routes that use overlapping parts of SymPy and Theano. This will supply a couple of direct comparisons.

Disclaimer: I’ve chosen a larger expression here to exaggerate results. Simpler expressions yield less impressive results.

Simplification

We show the expression, it’s derivative, and SymPy’s simplification of that derivative. In each case we quantify the complexity of the expression by the number of algebraic operations

The target expression:

print latex(expr)

$$\frac{1}{210} \sqrt{70} x^{2} \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x}$$

print "Operations: ", count_ops(expr)
Operations:  17

It’s derivative

print latex(expr.diff(x))

$$ \frac{1}{210} \sqrt{70} x^{2} \left(- 4 x^{2} + 32 x - 56\right) e^{- x} - \frac{1}{210} \sqrt{70} x^{2} \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x} + \frac{1}{105} \sqrt{70} x \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x} $$

print "Operations: ", count_ops(expr.diff(x))
Operations:  48

The result of simplify on the derivative. Note the significant cancellation of the above expression.

print latex(simplify(expr.diff(x)))

$$ \frac{2}{315} \sqrt{70} x \left(x^{4} - 17 x^{3} + 90 x^{2} - 168 x + 84\right) e^{- x} $$

print "Operations: ", count_ops(simplify(expr.diff(x)))
Operations:  18

An unevaluated derivative object. We’ll end up passing this to Theano so that it computes the derivative on its own.

print latex(Derivative(expr, x))

$$ \frac{\partial}{\partial x}\left(\frac{1}{210} \sqrt{70} x^{2} \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x}\right) $$

Bounds on the cost of Differentiation

Scalar differentiation is actually a very simple transformation.

You need to know how to transform all of the elementary functions (exp, log, sin, cos, polynomials, etc...), the chain rule, and that’s it. Theorems behind automatic differentiation state that the cost of a derivative will be at most five times the cost of the original. In this case we’re guaranteed to have at most 17*5 == 85 operations in the derivative computation; this holds in our case because 48 < 85

However derivatives are often far simpler than this upper bound. We see that after simplification the operation count of the derivative is 18, only one more than the original. This is common in practice.

Theano Simplification

Like SymPy, Theano transforms graphs to mathematically equivalent but computationally more efficient representations. It provides standard compiler optimizations like constant folding, and common sub-expressions as well as array specific optimizations like the element-wise operation fusion.

Because users regularly handle mathematical terms Theano also provides a set of optimizations to simplify some common scalar expressions. For example Theano will convert expressions like x*y/x to y. In this sense it overlaps with SymPy’s simplify functions. This post is largely a demonstration that SymPy’s scalar simplifications are far more powerful than Theano’s and that their use can result in significant improvements. This shouldn’t be surprising. Sympians are devoted to scalar simplification to a degree that far exceeds the Theano community’s devotion to this topic.

Experiment

We’ll compute the derivative of our radial wavefunction and then simplify the result. We’ll do this using both SymPy’s derivative and simplify routines and using Theano’s derivative and simplify routines. We’ll then compare the two results by counting the number of required operations.

Here is some setup code that you can safely ignore:

def fgraph_of(*exprs):
    """ Transform SymPy expressions into Theano Computation """
    outs = map(theano_code, exprs)
    ins = theano.gof.graph.inputs(outs)
    ins, outs = theano.gof.graph.clone(ins, outs)
    return theano.gof.FunctionGraph(ins, outs)

def theano_simplify(fgraph):
    """ Simplify a Theano Computation """
    mode = theano.compile.get_default_mode().excluding("fusion")
    fgraph = fgraph.clone()
    mode.optimizer.optimize(fgraph)
    return fgraph

def theano_count_ops(fgraph):
    """ Count the number of Scalar operations in a Theano Computation """
    return len(filter(lambda n: isinstance(n.op, theano.tensor.Elemwise),
                      fgraph.apply_nodes))

In SymPy we create both an unevaluated derivative and a fully evaluated and sympy-simplified version. We translate each to Theano, simplify within Theano, and then count the number of operations both before and after simplification. In this way we can see the value added by both SymPy’s and Theano’s optimizations.

exprs = [Derivative(expr, x),    # derivative computed in Theano
         simplify(expr.diff(x))] # derivative computed in SymPy, also sympy-simplified

for expr in exprs:
    fgraph = fgraph_of(expr)
    simp_fgraph = theano_simplify(fgraph)
    print latex(expr)
    print "Operations:                             ", theano_count_ops(fgraph)
    print "Operations after Theano Simplification: ", theano_count_ops(simp_fgraph)

Theano Only

$$ \frac{\partial}{\partial x}\left(\frac{1}{210} \sqrt{70} x^{2} \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x}\right) $$

Operations:                              40
Operations after Theano Simplification:  21

SymPy + Theano

$$ \frac{2}{315} \sqrt{70} x \left(x^{4} - 17 x^{3} + 90 x^{2} - 168 x + 84\right) e^{- x} $$

Operations:                              13
Operations after Theano Simplification:  10

Analysis

On its own Theano produces a derivative expression that is about as complex as the unsimplified SymPy version. Theano simplification then does a surprisingly good job, roughly halving the amount of work needed (40 -> 21) to compute the result. If you dig deeper however you find that this isn’t because it was able to algebraically simplify the computation (it wasn’t) but rather because the computation contained several common sub-expressions. The Theano version looks a lot like the unsimplified SymPy version. Note the common sub-expressions like 56*x below.

$$ \frac{1}{210} \sqrt{70} x^{2} \left(- 4 x^{2} + 32 x - 56\right) e^{- x} - \frac{1}{210} \sqrt{70} x^{2} \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x} + \frac{1}{105} \sqrt{70} x \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x} $$

The pure-SymPy simplified result is again substantially more efficient (13 operations). Interestingly Theano is still able to improve on this, again not because of additional algebraic simplification but rather due to constant folding. The two projects simplify in orthogonal ways.

Simultaneous Computation

When we compute both the expression and its derivative simultaneously we find substantial benefits from using the two projects together.

orig_expr = R_nl(n, l, x, Z)
for expr in exprs:
    fgraph = fgraph_of(expr, orig_expr)
    simp_fgraph = theano_simplify(fgraph)
    print latex((expr, orig_expr))
    print "Operations:                             ", len(fgraph.apply_nodes)
    print "Operations after Theano Simplification: ", len(simp_fgraph.apply_nodes)

$$ \begin{pmatrix}\frac{\partial}{\partial x}\left(\frac{1}{210} \sqrt{70} x^{2} \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x}\right), & \frac{1}{210} \sqrt{70} x^{2} \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x}\end{pmatrix} $$

Operations:                              57
Operations after Theano Simplification:  24

$$ \begin{pmatrix}\frac{2}{315} \sqrt{70} x \left(x^{4} - 17 x^{3} + 90 x^{2} - 168 x + 84\right) e^{- x}, & \frac{1}{210} \sqrt{70} x^{2} \left(- \frac{4}{3} x^{3} + 16 x^{2} - 56 x + 56\right) e^{- x}\end{pmatrix} $$

Operations:                              27
Operations after Theano Simplification:  17

The combination of SymPy’s scalar simplification and Theano’s common sub-expression optimization yields a significantly simpler computation than either project could do independently.

To summarize

Projectoperation count
SymPy27
Theano24
SymPy+Theano17

References

March 28, 2013 07:00 AM

March 26, 2013

Josef Perkoltd

Multiple Comparison and Tukey HSD or why statsmodels is awful

Introduction

Statistical tests are often grouped into one-sample, two-sample and k-sample tests, depending on how many samples are involved in the test. In k-sample tests the usual Null hypothesis is that a statistic, for example the mean as in a one-way ANOVA, or the distribution in goodness-of-fit tests, is the same in all groups or samples. The common test is the joint test that all samples have the same value, against the alternative that at least one sample or group has a different value.

However, often we are not just interested in the joint hypothesis if all samples are the same, but we would also like to know for which pairs of samples the hypothesis of equal values is rejected. In this case we conduct several tests at the same time, one test for each pair of samples.

This results, as a consequence, in a multiple testing problem and we should correct our test distribution or p-values to account for this.

I mentioned some of the one- and two sample test in statsmodels before. Today, I just want to look at pairwise comparison of means. We have k samples and we want to test for each pair whether the mean is the same, or not.

Instead of adding more explanations here, I just want to point to R tutorial and also the brief description on Wikipedia. A search for "Tukey HSD" or multiple comparison on the internet will find many tutorials and explanations.

The following are examples in statsmodels and R interspersed with a few explanatory comments.

Read more »

by Josef Perktold (noreply@blogger.com) at March 26, 2013 04:59 PM

March 25, 2013

Michael Droettboom

matplotlib lessons learned

The future

Jake Vanderplass has a very thought-provoking essay about the future of visualization in Python. It’s an exciting time for visualization in Python with so many new options exploding onto the scene, and Jake has provided a nice summary. However, I don’t think it presents a very current view of matplotlib, which is still alive and well with funding sources, and moving to "modern" things like web frontends and web services, and has nascent ongoing project related to hardware acceleration. Importantly, it has thousands of person hours of investment in all of the large to tiny problems that have been found along the way.

In the browser

One of the directions that new plotting projects are taking is to be more integrated in the web browser. This has all of the advantages of cloud computing (zero install, distributed data), and integrates well with great projects like the IPython notebook.

matplotlib is already most of the way there. matplotlib’s git master has completely interactive and full featured plotting in the browser – meaning it can do everything any of the other matplotlib backends can do – by basically running something very similar to a VNC protocol between the server and the client. You can try it out today by building from git and using the WebAgg backend. Shortly, it will also be available as part of Google App Engine – so we’ll get some real experience running these things remotely in a real "Internet-enabled" mode. The integration work with IPython still needs to be done – and I hope this can be a major focus of discussion at SciPy when I’m there.

The VNC-like approach was ultimately arrived at after many months of experimenting with approaches more based on JS and HTML5 and/or SVG. The main problem one runs into with those approaches is working with large datasets – matplotlib has some very sophisticated designs to make working working with large data sets zippy and interactive (specifically path simplification, blitting of markers, dynamic down sampling of images) all of which are just really hard to implement efficiently in a browser. D3’s Javascript demos feel very zippy and efficient, until you realize how canned they are, or how much they rely on very specific means to shuttle reduced data and from the browser. There’s a place for interactive canned graphics as part of web publishing, but it’s much less useful for doing science on data for the first time. But in general from these experiments, I’ve become rather skeptical of approaches that try to do too much in the browser. Even though matplotlib is on the "old" paradigm of running on a server (or a local desktop), the advantage of that approach is that we control the whole stack and can optimize the heck out of the places that need to be optimized. Browsers are much more of a black box in that regard.

I don’t know if WebGL will offer some improvements here. It’s enough of a moving target that it should probably be re-examined on a regular basis.

On the GPU

And in the diametrically opposite direction, we have projects moving plotting onto the GPU. Particularly interesting to me here is the glagg project by Nicolas Rougier and others. It’s important to note for those not in the trenches that for high-quality 2D plotting on the GPU, things are much less straightforward than for 3D. Graphics cards don’t "just do" high-quality 2D rendering out of the box. It requires the use of custom vertex shaders (which are frankly works of extreme brilliance and also an exercise somewhat in putting round pegs in square holes and living to tell about it). Unfortunately, these things require rather recent graphics hardware and drivers and a fair bit of patience to get up and running. Things will get easier over time, but at the moment, a 100% software implementation still can’t be beat for portability and maximum accessibility for less technically-inclined users. But I look forward to where all of this is going.

Real benchmarking on real data needs to be performed to determine just how much faster these approaches will be for 2D plotting. (For 3D, which I discuss below, I think the advantages of hardware are more apparent).

Note

As a public service announcement, anyone looking for performance out of matplotlib should be using the Agg backend – it’s the only one with all optimizations available. The Mac OS-X Quartz backend is built on a closed source rendering library with some puzzling and surprising performance characteristics. Many of the attempts to speed up that backend involve workarounds for a black box that is not well understood. For the Agg-based backends, however, we control the stack from top-to-bottom and are able to optimize for real-world scientific plotting scenarios.

In 3-dimensions

matplotlib’s original focus has always been on 2D. Despite this, notably Benjamin Root and others continue to put a lot of effort into matplotlib’s 3D extensions, and they fill a niche for many users who want clean and crisp vector 3D for print, and it’s improving all the time. There are, of course, fundamental architectural problems with 3D in matplotlib (most importantly the lack of a proper z-ordering) that limit what can be done. That should be fixable with a few well-placed C/C++ extensions – I’m not certain that we need go whole hog to the GPU to fix that, but that’s certainly the obvious and well-trodden solution. I am concerned that too many of the new 3D projects seem to prioritize interactivity and hardware tricks at the expense of static quality. For this reason, matplotlib may continue to serve for some time as a high-quality print "backend" for some of these other 3-D based projects.

Interfaces

Another interesting direction of experimentation is in the area of user interface and API.

I think matplotlib owes a lot of its success to its shameless "embracing and extending" of the Matlab graphing API. Having taught matplotlib a few times to new users, I’m always impressed by how quickly new users pick it up.

The thing that’s a but cruftier and full of inconsistencies is matplotlib’s Object-Oriented interface. Things there often follow the pattern that was most easy to implement rather than what would be the most clean from the outside. It’s probably due time to start re-evaluating some of those API’s and breaking backward compatibility for the sake of more consistency going forward.

The Grammar of Graphics syntax from the R world is really interesting, and I think fills a middle ground. It’s more powerful (and I think a little more complex to learn at first) than the Matlab interface, but it has the nice property of self-consistency that once you learn a few things you can easily guess at how to do many others.

Peter Wang’s Bokeh project aims to bring Grammar of Graphics to Python, which I think is very cool. Note however, that even there, Bokeh includes a matlab-like interface, as does another plotting project Galry, so mlab is by no means dead.

Doomed to repeat

There are a lot of ways in which matplotlib can be improved. I encourage everyone to look at our MEPS to see some ongoing projects that are being discussed or worked on. There are some large, heroic and important projects there to bring matplotlib forward.

But I think more interestingly for this blog post is to focus on some of the really low-level early architectural decisions that have limited or made it difficult to move matplotlib forward. Importantly, these aren’t issues that I’m seeing discussed very often, but they are things I would try to tackle up front if I ever get a case of "2.0-itis" and were starting fresh today. Hopefully these injuries of experience can inform new projects – or they may inspire someone with loads of time to take on refactoring matplotlib. It would not be impossible to make these changes to matplotlib, but it would take a concerted long-term effort and the ability to break some backward compatibility for the common good.

Generic tree manipulations

matplotlib plots are more-or-less tree structures of objects that are "run" to draw things on the screen. (It isn’t strictly a tree, as some cross-referencing is required for things like referring to clip paths etc.) For example, the figure has any number of axes, each of which have lines plotted on them. Drawing involves starting at the figure and visiting each of its axes and each of its lines. All very straightforward. But there is no way to traverse that tree in a generic way to perform manipulations on it.

For example, you might want to have a plot with a number of different-colored lines that you want to make ready for black-and-white publication by changing the lines to have different dash patterns. Or, you might want to go through all of the text and change the font. Or, as much as it personally wouldn’t fit my workflow, many people would like a graphical editor that would allow one to traverse the tree of objects in the plot and change properties on them. There’s currently no way to do this in a generic way that would work on any plot with any kind of object in it.

I’m thinking what is needed is something like the much-maligned "Document Object Model (DOM)" is needed (if I say "ElementTree" instead, is that more appealing to Pythonistas?) That way, one could traverse this tree in a generic fashion and do all kinds of things without needing to be aware of what specifically is in the plot.

Type-checking, styles, properties, traits

matplotlib predates properties and traits and other things of that ilk, so it, not unreasonably, uses get_ and set_ methods for most things. Beyond the syntactic implications of this (which don’t bother me as much as some), they are rather inconsistent. Some are available as keyword arguments to constructors and plotting methods, but it’s inconsistent because some must be manually handled by the code while others are handled automatically. Some type-check their arguments immediately, whereas others will blow up on invalid input much later in some deeply nested backtrace. Some are mutable and cause an update of the plot when changed. Some seem mutable, but changing them has no effect. Traits (such as Enthought Traits or something else in that space) would be a great fit for this. It’s been examined a few times, and while the idea seems to be a good fit, the implementation was always the stumbling block. But it’s high time to look at this again.

Combining this with the tree manipulation suggestion above, we’d be able to do really powerful things like CSS-style styling of plots.

(Didn’t I just say above that web browsers weren’t well suited, yet I’m stealing some fundamentals of their design here…?)

Related to the above, matplotlib’s handling of colors and alpha-blending is all over the map. There needs to be a cleanup effort to make handling consistent throughout. Once that’s done, the way forward should be clear to manage CMYK colors internally for formats that support them (e.g. PDF). Ditto on other properties like line styles and marker styles.

Projections and ticking

Ticking is the process by which the positions of the grid lines, ticks and labels are determined. There are a number of third-party projects that build new projections on top of matplotlib (basemap, pywcsgrid2, cartopy to name a few). Unfortunately, they can’t really take advantage of the many (and subtly difficult) ticking algorithms in matplotlib because matplotlib’s tickers are so firmly grounded in the rectilinear world. matplotlib needs to improve its tickers to be more generic and more usable when the grid is rotated or warped in a myriad of ways so that all of this duplication of effort can be reduced.

Related to this, matplotlib have transformations (which determine how the data is mapped to the Cartesian space on screen), tickers (which determine the positions of grid lines), formatters (which determine how the tick’s text label is rendered) and locators (which choose pleasant-looking bounds for the data), all of which work in tandem to produce the labels, ticks and gridlines, but which have no relationship to each other. It should be easier to relate these things together, because you usually want a set that works well together. Phil Elson has done some work in this direction, but there’s still more that could be done.

Higher dimensionality

matplotlib’s 3D support feels tacked on structurally. It would be better if more parts were agnostic to the dimensionality of the data.

May you live in interesting times

It’s really exciting to watch all that’s going on, and thanks to Jake Vanderplass for getting this discussion rolling.

March 25, 2013 01:47 PM

March 23, 2013

Jake Vanderplas

Matplotlib and the Future of Visualization in Python

Last week, I had the privilege of attending and speaking at the PyCon and PyData conferences in Santa Clara, CA. As usual, there were some amazing and inspiring talks throughout: I would highly recommend browsing through the videos as they are put up on pyvideo.

One thing I spent a lot of time thinking, talking, and learning about during these two conferences was the topic of data visualization in Python. Data visualization seemed to be everywhere: PyData had two tutorials on matplotlib (the second given by yours truly), as well as a talk about NodeBox OpenGL and a keynote by Fernando Perez about IPython, including the notebook and the nice interactive data-visualization it allows. Pycon had a tutorial on network visualization, a talk on generating art in Python, and a talk on visualizing Github.

The latter of these I found to be a bit abstract -- more discussion of visual design principles than particular tools or results -- but contained one interesting nugget: though D3 is the current state-of-the-art for publication-quality online, interactive visualization, practitioners often use simpler tools like matplotlib or ggplot for their initial data exploration. This was a thought echoed by Lynn Cherny in her presentation about NodeBox OpenGL: it has some really nice features which allow the creation of beautiful, flexible, and interactive graphics within Python. But if you just want to scatter-plot x versus y to see what your data looks like, matplotlib might be a better option.

I found Lynn’s talk incredibly interesting, and not just because of the good-natured heckling I received from the stage! Her main point was that in the case of interactive & animated visualization, matplotlib is relatively limited. She introduced us to a project called NodeBox OpenGL, which is a cross-platform graphics creation package that has some nice Python bindings, and can create some absolutely beautiful graphics. This is one example of a physics flow visualization, taken from the above website:

'NodeBox OpenGL visualization'

Despite Lynn’s admission that, with regards to the limitations of matplotlib, I had taken a bit of the wind out of her sails with my interactive matplotlib tutorial the day before (in which many of the examples I used will be familiar to readers of this blog), I think her point on the limitations of matplotlib is very well-put. Though it remains my tool of choice for data visualization and the creation of publication-quality scientific graphics, matplotlib is also a decade-old platform, and sometimes shows its age.

Matplotlib’s History

Matplotlib is a multi-platform data visualization tool built upon the Numpy and Scipy framework. It was conceived by John Hunter in 2002, originally as a patch to IPython to enable interactive MatLab-style plotting via gnuplot from the IPython command-line. Fernando Perez was, at the time, scrambling to finish his PhD, and let John know he wouldn’t have time to review the patch for several months. John took this as a cue to set out on his own, and the matplotlib package was born, with version 0.1 released in 2003. It received an early boost when it was adopted as the plotting package of choice of the Space Telescope Science Institute, which financially supported matplotlib’s development and led to greatly expanded capabilities.

One of matplotlib’s most important features is its ability to play well with many operating systems and graphics backends. John Hunter highlighted this fact in a keynote talk he gave last summer, shortly before his sudden and tragic passing (video link): “We didn’t try to be the best in the beginning, we just tried to be there...” and fill-in the features as needed. In this talk, John outlined the reasons he thinks matplotlib succeeded in outlasting the dozens of competing packages:

  • it could be used on any operating system via its array of backends
  • it had a familiar interface: one similar to MatLab
  • it had a coherent vision: to do 2D graphics, and do them well
  • it found early institutional support, from astronomers at STScI and JPL
  • it had an outspoken advocate in Hunter himself, who enthusiastically promoted the project within the Python world
  • </