Boom!

Michael Droettboom’s Blog: Python, matplotlib, astropy and other tidbits

VirtualBox on Mac OS X

| Comments

I recently got a Mac Mini so I can start working on Macintosh-specific issues with matplotlib. Thanks again to Hans Petter Langtangen, the Director of the Center for Biomedical Computing at Simula for his gracious donation that supported the purchase.

One of the things I’d like to use this machine for is to test installers in various environments – a fresh install, or with MacPorts, or Fink etc. – to make sure the installers include everything they need. So I want to run Mac OS X in a virtual machine that I can reset on a regular basis to a known state. This is now allowed by the licensing of OS X, as long as it’s running on top of genuine Apple hardware, and you create no more than 5 instances.

It’s surprisingly hard to find information on installing Mac OS X in a virtual machine. Most of what Google finds is various hacks to run on non-Apple hardware. I, instead, want to do this legally.

Unfortunately, the Mac Mini came with no installation media, so you can’t just plug it in and install it in a virtual machine. I thought, ok, no problem – I’ll just pop over to the App Store and download it. Unfortunately, to do that, Apple wanted to charge me $29.99 for something I already own.

So next I looked at the recovery partition. Parallels is reportedly able to use the recovery partition directly to install in the virtual machine. However, I want to use VirtualBox, since it is open source, what I’m familiar with on Linux, and, most importantly, because it can be automated by vagrant. After trying in vain to point VirtualBox at the magic stuff in the recovery partition, I came upon a working solution. The following steps were done with Mountain Lion, and I have no idea whether they are applicable to other releases.

  • Grab an external USB drive and plug it in. I think it needs to be at least the size of a double-layer DVD, or 8.5 GB. This process will erase everything on it.

  • Reboot into recovery mode, by holding down Command+R during boot. (A word of advice to those new to Macs: wait until you hear the startup sound until you press the keys down, and hold them until the Apple logo disappears. Timing seems to be important here).

  • I list of utilities will appear. Open "Disk Manager" and repartition the external disk to have a "GUID" partition table with a single "HFS+" partition.

  • Exit "Disk Manager" and then go to "Install OS X". Install it to the external drive.

  • When the installation is done, the system will reboot back into your "host" installation on the internal drive. (I was surprised by this – you may need to remove the USB drive to boot into the internal drive, but I didn’t need to).

  • The result is a folder on the external drive called "OS X Install Data". Inside that folder is a disk image of the installer, InstallESD.dmg. Copy this to your internal hard disk. You can then remove the external USB drive, we’re done with it.

  • Unfortunately, there is still a small incompatibility with power management inside of VirtualBox that will cause the installer to hang during boot. The kernel extension that handles CPU power management needs to be replaced. I found the instructions for that here. I’m paraphrasing it here, and only including instructions for Mountain Lion.

    • Download InstallESD.dmg.tool.

    • Download NullCPUPowerManagement.kext

    • Run the following command:

      ./InstallESD.dmg.tool -i InstallESD.dmg -o Output.dmg -- NullCPUPowerManagement.kext
      
  • The Output.dmg is now a patched installer image that can be used to install OS X in VirtualBox.

  • In VirtualBox, create a new virtual machine and use its default settings for an OS X guest. Then, open the settings for the machine and go to the Storage tab. Add a new CD/DVD device to the IDE bus (it must be IDE: SATA did not work for me), and select the Output.dmg from the file dialog. Check the "Live CD/DVD" box.

  • You now should be able to boot into the installer and install OS X within VirtualBox. When the installer is ready to reboot, go back to VirtualBox settings and "eject" the virtual DVD before restarting.

Hopefully this will help others out.

Matplotlib Lessons Learned

| Comments

The future

Jake Vanderplass has a very thought-provoking essay about the future of visualization in Python. It’s an exciting time for visualization in Python with so many new options exploding onto the scene, and Jake has provided a nice summary. However, I don’t think it presents a very current view of matplotlib, which is still alive and well with funding sources, and moving to "modern" things like web frontends and web services, and has nascent ongoing project related to hardware acceleration. Importantly, it has thousands of person hours of investment in all of the large to tiny problems that have been found along the way.

In the browser

One of the directions that new plotting projects are taking is to be more integrated in the web browser. This has all of the advantages of cloud computing (zero install, distributed data), and integrates well with great projects like the IPython notebook.

matplotlib is already most of the way there. matplotlib’s git master has completely interactive and full featured plotting in the browser – meaning it can do everything any of the other matplotlib backends can do – by basically running something very similar to a VNC protocol between the server and the client. You can try it out today by building from git and using the WebAgg backend. Shortly, it will also be available as part of Google App Engine – so we’ll get some real experience running these things remotely in a real "Internet-enabled" mode. The integration work with IPython still needs to be done – and I hope this can be a major focus of discussion at SciPy when I’m there.

The VNC-like approach was ultimately arrived at after many months of experimenting with approaches more based on JS and HTML5 and/or SVG. The main problem one runs into with those approaches is working with large datasets – matplotlib has some very sophisticated designs to make working working with large data sets zippy and interactive (specifically path simplification, blitting of markers, dynamic down sampling of images) all of which are just really hard to implement efficiently in a browser. D3’s Javascript demos feel very zippy and efficient, until you realize how canned they are, or how much they rely on very specific means to shuttle reduced data and from the browser. There’s a place for interactive canned graphics as part of web publishing, but it’s much less useful for doing science on data for the first time. But in general from these experiments, I’ve become rather skeptical of approaches that try to do too much in the browser. Even though matplotlib is on the "old" paradigm of running on a server (or a local desktop), the advantage of that approach is that we control the whole stack and can optimize the heck out of the places that need to be optimized. Browsers are much more of a black box in that regard.

I don’t know if WebGL will offer some improvements here. It’s enough of a moving target that it should probably be re-examined on a regular basis.

On the GPU

And in the diametrically opposite direction, we have projects moving plotting onto the GPU. Particularly interesting to me here is the glagg project by Nicolas Rougier and others. It’s important to note for those not in the trenches that for high-quality 2D plotting on the GPU, things are much less straightforward than for 3D. Graphics cards don’t "just do" high-quality 2D rendering out of the box. It requires the use of custom vertex shaders (which are frankly works of extreme brilliance and also an exercise somewhat in putting round pegs in square holes and living to tell about it). Unfortunately, these things require rather recent graphics hardware and drivers and a fair bit of patience to get up and running. Things will get easier over time, but at the moment, a 100% software implementation still can’t be beat for portability and maximum accessibility for less technically-inclined users. But I look forward to where all of this is going.

Real benchmarking on real data needs to be performed to determine just how much faster these approaches will be for 2D plotting. (For 3D, which I discuss below, I think the advantages of hardware are more apparent).

Note

As a public service announcement, anyone looking for performance out of matplotlib should be using the Agg backend – it’s the only one with all optimizations available. The Mac OS-X Quartz backend is built on a closed source rendering library with some puzzling and surprising performance characteristics. Many of the attempts to speed up that backend involve workarounds for a black box that is not well understood. For the Agg-based backends, however, we control the stack from top-to-bottom and are able to optimize for real-world scientific plotting scenarios.

In 3-dimensions

matplotlib’s original focus has always been on 2D. Despite this, notably Benjamin Root and others continue to put a lot of effort into matplotlib’s 3D extensions, and they fill a niche for many users who want clean and crisp vector 3D for print, and it’s improving all the time. There are, of course, fundamental architectural problems with 3D in matplotlib (most importantly the lack of a proper z-ordering) that limit what can be done. That should be fixable with a few well-placed C/C++ extensions – I’m not certain that we need go whole hog to the GPU to fix that, but that’s certainly the obvious and well-trodden solution. I am concerned that too many of the new 3D projects seem to prioritize interactivity and hardware tricks at the expense of static quality. For this reason, matplotlib may continue to serve for some time as a high-quality print "backend" for some of these other 3-D based projects.

Interfaces

Another interesting direction of experimentation is in the area of user interface and API.

I think matplotlib owes a lot of its success to its shameless "embracing and extending" of the Matlab graphing API. Having taught matplotlib a few times to new users, I’m always impressed by how quickly new users pick it up.

The thing that’s a but cruftier and full of inconsistencies is matplotlib’s Object-Oriented interface. Things there often follow the pattern that was most easy to implement rather than what would be the most clean from the outside. It’s probably due time to start re-evaluating some of those API’s and breaking backward compatibility for the sake of more consistency going forward.

The Grammar of Graphics syntax from the R world is really interesting, and I think fills a middle ground. It’s more powerful (and I think a little more complex to learn at first) than the Matlab interface, but it has the nice property of self-consistency that once you learn a few things you can easily guess at how to do many others.

Peter Wang’s Bokeh project aims to bring Grammar of Graphics to Python, which I think is very cool. Note however, that even there, Bokeh includes a matlab-like interface, as does another plotting project Galry, so mlab is by no means dead.

Doomed to repeat

There are a lot of ways in which matplotlib can be improved. I encourage everyone to look at our MEPS to see some ongoing projects that are being discussed or worked on. There are some large, heroic and important projects there to bring matplotlib forward.

But I think more interestingly for this blog post is to focus on some of the really low-level early architectural decisions that have limited or made it difficult to move matplotlib forward. Importantly, these aren’t issues that I’m seeing discussed very often, but they are things I would try to tackle up front if I ever get a case of "2.0-itis" and were starting fresh today. Hopefully these injuries of experience can inform new projects – or they may inspire someone with loads of time to take on refactoring matplotlib. It would not be impossible to make these changes to matplotlib, but it would take a concerted long-term effort and the ability to break some backward compatibility for the common good.

Generic tree manipulations

matplotlib plots are more-or-less tree structures of objects that are "run" to draw things on the screen. (It isn’t strictly a tree, as some cross-referencing is required for things like referring to clip paths etc.) For example, the figure has any number of axes, each of which have lines plotted on them. Drawing involves starting at the figure and visiting each of its axes and each of its lines. All very straightforward. But there is no way to traverse that tree in a generic way to perform manipulations on it.

For example, you might want to have a plot with a number of different-colored lines that you want to make ready for black-and-white publication by changing the lines to have different dash patterns. Or, you might want to go through all of the text and change the font. Or, as much as it personally wouldn’t fit my workflow, many people would like a graphical editor that would allow one to traverse the tree of objects in the plot and change properties on them. There’s currently no way to do this in a generic way that would work on any plot with any kind of object in it.

I’m thinking what is needed is something like the much-maligned "Document Object Model (DOM)" is needed (if I say "ElementTree" instead, is that more appealing to Pythonistas?) That way, one could traverse this tree in a generic fashion and do all kinds of things without needing to be aware of what specifically is in the plot.

Type-checking, styles, properties, traits

matplotlib predates properties and traits and other things of that ilk, so it, not unreasonably, uses get_ and set_ methods for most things. Beyond the syntactic implications of this (which don’t bother me as much as some), they are rather inconsistent. Some are available as keyword arguments to constructors and plotting methods, but it’s inconsistent because some must be manually handled by the code while others are handled automatically. Some type-check their arguments immediately, whereas others will blow up on invalid input much later in some deeply nested backtrace. Some are mutable and cause an update of the plot when changed. Some seem mutable, but changing them has no effect. Traits (such as Enthought Traits or something else in that space) would be a great fit for this. It’s been examined a few times, and while the idea seems to be a good fit, the implementation was always the stumbling block. But it’s high time to look at this again.

Combining this with the tree manipulation suggestion above, we’d be able to do really powerful things like CSS-style styling of plots.

(Didn’t I just say above that web browsers weren’t well suited, yet I’m stealing some fundamentals of their design here…?)

Related to the above, matplotlib’s handling of colors and alpha-blending is all over the map. There needs to be a cleanup effort to make handling consistent throughout. Once that’s done, the way forward should be clear to manage CMYK colors internally for formats that support them (e.g. PDF). Ditto on other properties like line styles and marker styles.

Projections and ticking

Ticking is the process by which the positions of the grid lines, ticks and labels are determined. There are a number of third-party projects that build new projections on top of matplotlib (basemap, pywcsgrid2, cartopy to name a few). Unfortunately, they can’t really take advantage of the many (and subtly difficult) ticking algorithms in matplotlib because matplotlib’s tickers are so firmly grounded in the rectilinear world. matplotlib needs to improve its tickers to be more generic and more usable when the grid is rotated or warped in a myriad of ways so that all of this duplication of effort can be reduced.

Related to this, matplotlib have transformations (which determine how the data is mapped to the Cartesian space on screen), tickers (which determine the positions of grid lines), formatters (which determine how the tick’s text label is rendered) and locators (which choose pleasant-looking bounds for the data), all of which work in tandem to produce the labels, ticks and gridlines, but which have no relationship to each other. It should be easier to relate these things together, because you usually want a set that works well together. Phil Elson has done some work in this direction, but there’s still more that could be done.

Higher dimensionality

matplotlib’s 3D support feels tacked on structurally. It would be better if more parts were agnostic to the dimensionality of the data.

May you live in interesting times

It’s really exciting to watch all that’s going on, and thanks to Jake Vanderplass for getting this discussion rolling.

Matplotlib in the Browser: It’s Coming

| Comments

UPDATE: I am now sharing my code in the mdboom/mpl_browser_experiments github repository, rather than in a gist.

It’s occurred to me recently that in my previous blog posts Part II and Part I, I was going about the problem all wrong. Getting the plotting logic running in the browser, while perhaps ideal, was full of pitfalls. The browsers all render things in different ways and have different performance characteristics. Large data structures in Javascript just start to fall over after a certain point.

One effective way to deal with the large data structure problem is to just not send them to the browser at all, but instead, send rendered images. While the size of the data scientists will want to process with Numpy and matplotlib is growing all the time, the size of the images being rendered have natural limits based on the resolution of our displays. (Retina displays have recently bumped this up, but even then, display resolution increases slowly relative to RAM and disk space). So while for simple plots, sending and using the data wins, for anything beyond a reasonable amount of complexity, sending rendering images beats it (in terms of bandwidth) every time.

The other advantage of this approach is that it will work exactly like regular matplotlib. All of the effort and work that has already gone into matplotlib to make it as feature-rich and pixel-perfect as it is will apply immediately.

So I started to look at how we could just pipe what we already have – a high-quality, fast, and extensive rendering framework based on Agg – into the browser.

Experiments with VNC

An obvious pre-existing hammer on the shelf was VNC <www.realvnc.com>. There are free servers available for all of the big platforms, and there is, believe it or not, a client written entirely in Javascript that runs in the browser: noVNC. After a little tinkering (I’ll spare you the details), it’s possible to share a single GUI window in the matplotlib kernel with a browser. And it works, more or less, although a little slowly.

There are few problems with using VNC (and this mostly applies to its competitor NX as well):

  1. VNC servers hook directly into the GUI technology of your platform, so each platform handles setting up a server rather differently. I’m always loathe to reach for solutions that involve a lot of platform-specific differences – it becomes a nightmare to support.
  2. There’s a lot of unnecessary moving pieces. On X11, for example, the VNC wants to be an entire X server, with a window manager etc. The window being shared, of course, has to be implemented in some GUI framework or other. That’s a lot of extra stuff to install on a headless server that we don’t really need.
  3. The noVNC client has to interpret the binary VNC protocol in Javascript. Joel Martin and the rest of the team are total rockstars and they’ve pulled off something very impressive. But at the end of the day, it’s not a great fit, and it wastes a lot of cycles.

So VNC almost gets us there, and the fact that it works "almost well enough" gave me confidence that a more "conduit"-based approach would work. So I got to thinking about what the bare minimum thing is that could work.

The fact is that VNC, at least as it was being used in the above context, is just sending events from the keyboard and mouse from the client, and getting delta images from the server. It has a rather sophisticated way of compressing those delta images, but at the end of the day, that’s all it really does for us, and all we really need.

It turns out that creating delta images in PNG doesn’t work too badly. The empty pixels compress away rather well, and the compression and decompression can be handled in C at both ends of the pipe. That is, browsers know how to decompress PNGs inherently – they don’t need to run slow and complex Javascript to do so, so while it may not be the optimal protocol, it’s a good choice in a practical sense.

A proof of concept

So in this repository, I present a proof-of-concept for this approach. I have some hideously rough and undocumented code that, given a matplotlib figure, serves it interactively to a web browser. It requires only matplotlib and Tornado (which you probably already have if you already have a recent IPython). It’s obviously a long way from here to something that’s a true matplotlib backend and well-integrated with IPython notebook. This code in no way represents the final API. I also don’t do a lot of network programming, so I may be handling the AJAXy things suboptimally. However, I’d appreciate testing of this approach on different platforms and browsers to just prove its feasibility before putting in too much of that follow-on work.

To use it, just create a matplotlib figure, with whatever you want, and pass it to serve_figure.serve_figure(). For example, take the quadmesh example (something that would be really hard to implement in HTML5 canvas) and serve it:

import serve_figure

import numpy as np
from numpy import ma
from matplotlib import pyplot as plt

n = 12
x = np.linspace(-1.5,1.5,n)
y = np.linspace(-1.5,1.5,n*2)
X,Y = np.meshgrid(x,y);
Qx = np.cos(Y) - np.cos(X)
Qz = np.sin(Y) + np.sin(X)
Qx = (Qx + 1.1)
Z = np.sqrt(X**2 + Y**2)/5;
Z = (Z - Z.min()) / (Z.max() - Z.min())

# The color array can include masked values:
Zm = ma.masked_where(np.fabs(Qz) < 0.5*np.amax(Qz), Z)

fig = plt.figure()
ax = fig.add_subplot(121)
ax.set_axis_bgcolor("#bdb76b")
ax.pcolormesh(Qx,Qz,Z, shading='gouraud')
ax.set_title('Without masked values')

ax = fig.add_subplot(122)
ax.set_axis_bgcolor("#bdb76b")
col = ax.pcolormesh(Qx,Qz,Zm,shading='gouraud')
ax.set_title('With masked values')

serve_figure.serve_figure(fig, port=8888)

Open up your webbrowser to http://127.0.0.1:8888 and you should (hopefully) be in business. Open up a second browser window (whether locally or on another machine) and note that the two plots are automatically synchronized. The "data cursor" (that displays the current location of the mouse cursor in data coordinates) also works.

http://mdboom.github.com/images/firefox.png

Matplotlib running in Firefox

Some back-of-the-napkin thoughts about performance: The average size of each frame at the default resolution is around 16 kbytes. On a standard 1MB DSL connection, we should be able to pipe 7000 of those per second, so it should be fine in terms of bandwidth. Of course, there are other factors, such as the latency of the network and the CPU time necessary to decompress the PNG files etc. that are harder to take account of. This will require some real-world testing to really get a sense of how well it works.

There’s a lot of finesse to follow. For example, we should be able to shrink the bandwidth by another 20% by using a 1-bit alpha channel. The cursor shape doesn’t ever change like it does in a regular matplotlib window. It should be possible (though not yet) to support the interactive callbacks in matplotlib to handle the mouse events in arbitrary ways inside of the server. In principle, there are very few limitations to this approach, and it has the potential to be a true peer to the existing backends.

Watch the matplotlib and IPython projects – pull requests will be coming soon.

Amazon MP3 Ends Their Support for Linux

| Comments

I’ve bought digital music from Amazon for almost as long as they’ve been in that business. Their prices are very competitive, the MP3’s are dog-standard and DRM-free so they work on every device imaginable, and, until recently, they have supported Linux.

Admittedly, they haven’t supported Linux well for some time. Their own downloading app (when it was available) had fallen into disrepair and didn’t work on most modern Linux distributions. However, the simple, yet effective, clamz filled that void quite nicely.

Now, however, when you try to download more than one file at a time, you’re greeted with this message:

On Linux systems, Cloud Player only supports downloading songs one
at a time. To download your music, deselect all checkboxes, select
the checkbox for the song you want to download, then click the
"Download" button.

Changing the User Agent string and/or deleting cookies doesn’t seem to get over the wall. Other clamz users don’t appear to have any solutions.

When I contacted Amazon customer support to confirm that there was truly no workaround, the reply was:

I am sorry to know that you are not able to download multiple
songs from the cloud player. To download multiple songs you will
need the Amazon MP3 downloader. However since you are using a
Linux operating system, you will need to download one song at a
time from the cloud player.

Indeed, we have always been at war with Eastasia. And you’ve never been able to purchase and download MP3’s from Amazon on a Linux box. The hundreds of dollars I’ve spent doing just that certainly don’t mean anything. Thankfully, since it’s all DRM-free they can’t take away what I already have.

Sorry, Amazon, but I’m not a clicking monkey. I’ll need to go elsewhere. Can anyone else recommend another DRM-free service that doesn’t require jumping through so many hoops to download purchases?

John Hunter

| Comments

We have lost one of the greats of our community.

I first met John Hunter when he came to give a presentation about matplotlib at the Space Telescope Science Institute in 2004 (or thereabouts). I remember being blown away by how capable matplotlib was (even then), and stunned by John’s drive to build it completely outside of what was paying the bills.

Years later, when I started working at STScI in 2007, one of my first tasks was to add some new features to matplotlib. Little did I know, it would become my passion as well. I was a little intimidated to be working on something so entrenched and widely used. But John somehow instilled confidence in me with his support of what I was doing, even as I was ripping apart old assumptions in the code and turning it over anew. Through all the e-mails, conference calls and chance conference meetings with John since, he has been the most encouraging mentor and a prime example of open source stewardship. I should be lucky to find even a fraction of the enthusiasm, skill and leadership that John had.

Equally impressive has been watching so many of us Scientific Python types, separated by geography and institutional affiliations, come together in supporting each other at a time like this. Sometimes it really is more than just code.

If matplotlib has been useful to you in any way, please give generously to the John Hunter Memorial Fund.

Client-side Rendering in Matplotlib, Part II: The Language Blender

| Comments

Note

EDIT 2012-08-08: Added benchmarks in Firefox in addition to Chrome

In the last post, I outlined some of the architectural difficulties bringing matplotlib’s interactivity to the browser. In short, there’s a big chunk of code that lies between building the tree of artists and rendering them to the screen that needs to run either in the Python interpreter, as it does now, or inside of the web browser to support interactive web applications. It would be great to avoid having two code bases, one in Python and one in Javascript, that would need to be kept in sync. Writing code for both contexts from a single code base may turn out to be a pipe dream, but bear with me as I explore tools that might help.

Also, when trying to grok the discussion here and understanding the architectural challenges of matplotlib, it may be helpful to read the matplotlib chapter by John Hunter and yours truly from Architecture of Open Source Applications, Volume II.

Tools

There are a few interesting projects that help bridge the gap between Python and Javascript.

PyJs

PyJs (formerly called Pyjamas) is a Python-to-Javascript converter. It also includes an environment much like Google Web Toolkits for developing rich client-side applications in Python, but those features are probably not useful here.

Skulpt

Skulpt is a Python interpreter written in Javascript. It can compile and run Python code entirely within the web browser. In terms of language features, it doesn’t seem as mature as PyJs, but the fact that it has no dependencies other than a couple of Javascript files may be an advantage in terms of deployment. An obvious shortcoming of both Skulpt and PyJs is the lack of support for Numpy – none of the existing matplotlib Python code, which depends so heavily on Numpy, would work in that context.

PyV8

Unlike the other two, which allow Python to run in the browser, PyV8 allows Javascript to run inside of the Python interpreter. It is a wrapper around Google’s open source V8 Javascript engine and allows sharing objects between Python and Javascript somewhat transparently. If the drawing code were to be rewritten in Javascript, it could then, theoretically, be used both from Python on the desktop and inside the web browser.

Playing around

As a first pass playing with these tools, I’ve created a new project on github py-js-blending-experiments. I’ve started by writing a very simple benchmark that does a simple 2D affine transformation, in pure Python, Python with Numpy, Javascript and C. This test, while a real-world bottleneck from the real-world matplotlib, is probably too simple to read too much into the results. A real test would involve classes with complex interactions between them, to show how the same flexible system of transformations, tickers, formatters etc. would work, and would take into account the penalty of stepping over the gap between Python and Javascript. But all that will have to wait for a future blog post.

The benchmarks

The benchmarks compare a number of different possible approaches.

  • Pure Python: This is just a simple pure Python implementation. transform.py
  • Pure Javascript: A hand-written JavaScript implementation. transform.js
  • Numpy: An implementation using vectorized Numpy operations. transform_numpy.py
  • C extension: A hand-written C extension. transform.c
  • Skulpt: Taking the pure Python implementation above, but running through Skulpt to get it to run inside of the browser.
  • PyJs: Compiling the pure Python implementation above to Javascipt using PyJs, and then running the result in the browser.
  • PyV8: Running the hand-written Javascript implementation above inside of PyV8.

Results

The following results are on a quad-core Intel Core i5-2520M CPU @ 2.50GHz running Fedora Linux 17. Python 2.7.3, Numpy 1.6.1, Google Chrome 21.0.1180.57 beta and Firefox 14.0.1 were used. The benchmark is performing a 2D affine transformation on 128,000 points. Note that the timings in the web browser are quite variable. I’ve included the average and independent results of 5 runs.

=========================== ================== ==================================
Benchmark                   avg. time (in ms)  times
=========================== ================== ==================================
Pure Python                 94                 95, 96, 93, 96, 92
Pure Javascript Chrome      40                 41, 29, 59, 33, 42
Pure Javascript Firefox     15                 8, 7, 25, 20, 16
Numpy                       6                  7, 6, 6, 6, 6
C                           2                  2, 2, 2, 2, 2
Skulpt Chrome               686                700, 691, 676, 691, 676
Skulpt Firefox              1052               1027, 1052, 1062, 1060, 1061
PyJs Chrome                 2197               2156, 2218, 2176, 2187, 2251
PyJs Firefox                658                644, 687, 630, 680, 674
PyV8                        38                 38, 38, 38, 37, 37
=========================== ================== ==================================

Conclusion

So what can we conclude? Remember what I said about not reading too much into these results? Well, I’m going to do it anyway with an enormous caveat.

There is considerable overhead when trying to shoehorn Python into the browser (comparing Skulpt and PyJs to Pure Javascript). I personally am surprised by how much more successful the Python interpreter approach is vs. the Python to Javascript conversion approach, though that may be due to the relative incompleteness of Skulpt. (EDIT: Though the Firefox results tell the opposite story). It’s pretty clear where the overhead of PyJs comes from. The line:

X = a*x + c*y + e

converts to:

X = $p['__op_add']($add3=$p['__op_add']($add1=(typeof ($mul1=a)==typeof ($mul2=x) && typeof $mul1=='number'?
    $mul1*$mul2:
    $p['op_mul']($mul1,$mul2)),$add2=(typeof ($mul3=c)==typeof ($mul4=y) && typeof $mul3=='number'?
    $mul3*$mul4:
    $p['op_mul']($mul3,$mul4))),$add4=e);

You can see how basic numeric operators in Python don’t translate directly to those in Javascript, so it’s forced to do something a whole lot more dynamic, including typechecking within every loop iteration. I pity the fool Javascript engine that tries to optimize that.

Not surprisingly, the PyV8 engine performs comparably to the V8 engine embedded in Google Chrome, which also beats pure Python by at least a factor of 2. We could do rather well implementing this core in Javascript.

Numpy and C extensions, of course, beat everything handily for this very numerically-biased benchmark.

Where does that leave us? Who knows… Interesting ride, though. Stay tuned and leave comments… There’s more to hack away at.

Client-side Rendering in Matplotlib, Part I: Defining the Problem

| Comments

One of the big challenges that matplotlib faces as it enters its second decade is moving from a desktop app to the web browser client/server paradigm. This need has been known for a few years at least: SAGE and the IPython notebook are rich web clients and powerful ways to interact with Python, however, their plotting is still necessarily limited by matplotlib’s design to rendering a static image. John Hunter concluded his keynote at SciPy 2012 arguing that this was the single biggest challenge to matplotlib’s relevance today.

The performance of JavaScript and graphics in web browsers is no longer an issue. At least on a gut level, it appears to compete well with anything matplotlib is able to do with its Python bindings to traditional desktop GUI toolkits. See d3 and webgl-surface-plot for some examples of great, high-performance plotting coming out of the JavaScript community. The problem with those libraries is they don’t integrate well with Python data processing, they are harder to modify and extend and, let’s be honest, just generally lack the Pythonic features that have made matplotlib so popular.

When trying to determine how to pull matplotlib kicking and screaming into this Brave New World, let’s assume that the network bottleneck between the server (e.g. an IPython kernel) and the client (i.e. the web browser) is too high to simply send images over repeatedly. It would be awfully nice, if we’re going to do all this work anyway, to allow for interacting with a server that may be over a slow and high-latency internet connection on the other side of the globe. The only way to make interactivity bearable in that scenario is to put some actual plotting smarts into the client.

For the purposes of this discussion, we should define what interactivity means. I think it basically amounts to:

  • data cursor (i.e. getting the current position of the mouse in data coordinates)
  • panning and zooming
  • adjusting the edges of the axes within the figure

Other interactive features, such as a "back" button or "apply tight layout" button have an "activate and return" interaction, rather than real time interaction, so can probably be handled with a round-trip to the server and thus aren’t really considered here.

It’s well known that matplotlib has a number of backends that handle drawing to specific GUI frameworks or file formats. The matplotlib "core" understands how to build and generate plots, and then sends low-level drawing commands to the currently selected backend. In order to reduce code duplication, there is a solid wall between the core and the backends, and we’re constantly trying to minimize the amount of code required to write a backend. The advantage of this is not just to reduce the number of lines of code, but to ensure consistency between the backends, so that when you render a streamplot with hatching and custom markers to a PDF file, it looks the same as when you render it to an SVG.

So why can’t we just add a new "webbrowser" backend? The problem is that the backends are too low-level. They know where the shapes and the text are, but they know nothing how they relate to one another, how the data scales from its native data coordinates to the coordinates of the screen, and how to best add ticks and other annotations to the graph. All of that information would be required for any sort of interactivity.

To even begin to tackle this, we need to move from the current two-way split of the plotting core and rendering backends to a three-way split into the phases of outputting a plot:

  1. Build: This phase is where the various Artist objects that make up the plot are created and related to one another. This is where most of the domain-specific code about particular types of plots lives.
  2. Drawing: Given those Artists and view limits for the axes, figures out how to scale them, and where to place the ticks, labels and other pieces of text. This phase also includes decimating or downscaling the data for display, since how to do so is dependent on the limits. Newer features such as "tight limits" also need to happen during this phase.
  3. Rendering: Converts a series of simple commands from the drawing phase into the native commands understood by a particular GUI framework or file format.

In normal interactive use, the Build phase happens once, but the Drawing and Rendering phases happen in a continuous loop as the figure is panned, zoomed and resized.

The Drawing phase comprises a great deal of Python and C++ code [1], much of it at the heart of what matplotlib is. The big pieces are:

  • Ticking (i.e. deciding where the numeric values and gridlines should go) is a surprisingly involved task, and matplotlib’s ticking is very flexible, supporting many different scales (such as log scale) and formats (controlling the precision of the values, for example). Because of this, the Drawing phase is dependent on matplotlib’s transformation infrastructure.
  • Simplification and downsampling is performed on-the-fly as the data is zoomed to reduce unnecessary drawing and make the interactivity much snappier. Of course, when it comes to large data there are other issues about the network bandwidth and the memory efficiency of the data representation within the browser that may be limiting relative to what matplotlib can do now.
  • Text layout, including math text layout, is done at this stage, because the size of the text relative to other items can not be known until draw time.

It’s obvious that the Build phase can remain on the server. And the Rendering phase can remain on the server for the native GUI backends and the file formats. We may need to write a "web browser" backend, but that could be written in pure JavaScript if necessary. It’s that big Drawing piece in the middle that has to exist sometimes on the server and sometimes on the client. Ideally, this code would not be duplicated, run both in CPython and in the web browser (depending on usage) and remain as flexible and easy to read as the Python code we already have. Are you starting to understand why this is a hard problem yet?

I hope to follow this blog post up with some experiments into various possible solutions over the upcoming days and weeks. In the meantime, I encourage all the comments and help on this I can get.


[1]It’s easy enough to see what code is required for the drawing phase by using coverage.py and turning it on at the start of Figure.draw and turning it off again at the end.