pandas lightning talk SciPy 2011

Video
Event SciPy 2011 Lightning Talk
Location Austin, TX
Date July 14, 2011

This transcript and summary were AI-generated and may contain errors.

Summary

In this talk, I clarify that “there’s a kind of a misconception that pandas is only for time series data. It’s completely not true.” My goal: creating “one of the best, if not the best, tools in any language for working with relational data, labeled data.”

Core Architecture

pandas provides labeled arrays that handle heterogeneous, size-mutable data—bringing R’s data frame concept to Python. The library’s foundation: automatic data alignment, indexing and reshaping capabilities, and missing data handling. It originated from my work at a hedge fund but was designed for broad applicability.

My vision: “trying to build a platform that could be used to essentially replace R, building the fundamental building blocks for statistics, for data manipulation, stuff that people in other scientific computing environments wish they had in Python.”

Technical Improvements

I consolidated the internal structure from two different classes for tabular data into a single implementation that “brings the best of both worlds,” including better handling of missing data in non-floating point types.

After years of resistance, I admitted to “giving in” on two frequently requested features: support for n-dimensional structures beyond the original three-dimension limit, and label-based indexing for slicing and complex get/set operations.

Data Operations and I/O

The talk covered improvements in pivoting and reshaping operations, showing how pandas could transform SQL-table-like structures into more analytically useful formats.

I/O received attention, with enhanced CSV reading and a complete overhaul of HDF5 PyTables-based storage. I described the original HDF5 implementation as something I “hacked out in an afternoon” as a prototype, which I subsequently rebuilt.

Future Directions

Planned developments: sparse data structures optimized for mostly-NA data, time zone support, generic moving window functions, and enhanced group-by operations. I also mentioned collaborative work on statsmodels integration.

Key Quotes

“There’s a kind of a misconception that pandas is only for time series data. It’s completely not true.”

“I would like to make it one of the best, if not the best, tools in any language for working with relational data, labeled data.”

“Trying to build a platform that could be used to essentially replace R, building the fundamental building blocks for statistics, for data manipulation, stuff that people in other scientific computing environments wish they had in Python and haven’t had for a long time.”

“If you use it, you find that it’s not, send me an email. Either you can hack on it or let me know how it could be improved.”

“There used to be two classes with different internal implementations for tabular data. I fixed that. There’s now just one. I think it brings the best of both worlds.”

“Usually in the past, pandas only handled three dimensions and less, which is really all you need for finance and econometrics, but people keep asking for greater than three, so I’m giving in.”

“Here’s another thing that I gave into recently, adding fancier indexing, something I stonewalled on for years, it feels like, and I finally added.”

“Somebody emailed me and said, how do you store pandas objects in HDF5, and so I hacked out something in an afternoon, but it was just a prototype, so I went through and actually built something real.”


Transcript

All right.

All right.

I’ll try to talk at least as fast as Peter here.

So I wanted to tell you about what I’ve been working on in pandas and some things that

I think are exciting.

I don’t know if many of you have heard of this library, but basically we’ve got labeled

arrays that handle heterogeneous data that are also size mutable, so kind of like an

R data frame if you’ve ever used R. But basically I think there’s a kind of a misconception

that pandas is only for time series data.

It’s completely not true.

It happens to be very good for that, but also a lot of other things.

Key features are automatic data alignment with lots of indexing and reshaping.

It does missing data really well, both implicitly and explicitly.

It’s got great time series stuff.

I’ve used Scikit’s time series, but I find it hard to use.

And I work a lot with multiple time series, which isn’t really very well supported in

Scikit’s time series.

So if you’ve got time series data, find yourself struggling, take a look.

And there’s a lot of stuff for doing SQL-like operations, merging, joining, that kind of

thing.

Of course, you know, I used to work for a hedge fund.

I built it inside a hedge fund.

It’s extremely good for financial data.

People on the Internet have very good things to say about it.

I won’t really go there.

I would like to make it one of the best, if not the best, tools in any language for working

with relational data, labeled data.

So if you use it, you find that it’s not, send me an email.

Either you can hack on it or let me know how it could be improved.

Because this is really my main goal, is making it one of the best tools that’s available

anywhere.

And, of course, trying to build a platform that could be used to essentially replace

R, building the fundamental building blocks for statistics, for data manipulation, stuff

that people in other scientific computing environments wish they had in Python and haven’t

had for a long time.

So some of the new things.

Oh, wow.

That’s amazing.

There’s only two and a half minutes left.

I’ve heavily reworked the internals.

There used to be two classes with different internal implementations for tabular data.

I fixed that.

There’s now just one.

I think it brings the best of both worlds.

There’s a cool internal data structure.

It’s kind of a prototype, but I’d like to do more with it.

Getting more eyes on it would be fantastic.

Handling of missing data and non-floating point D-types is better, and I’m working currently

on essentially laying the groundwork for an N-dimensional structure.

Usually in the past, pandas only handled three dimensions and less, which is really all you

need for finance and econometrics, but people keep asking for greater than three, so I’m

giving in.

Here’s another thing that I gave into recently, adding fancier indexing, something I stonewalled

on for years, it feels like, and I finally added.

Integers, labels, you can slice with labels.

This is sort of inspired by Larry and Data Array and all that, so having that all there.

You can also set, which is cool, so you can do really fancy getting and setting on pandas

objects.

I worked a lot on robustifying the I-O, so you can read CSV files really easily, read

tabular structures very easily.

I really retooled the HDF5 PyTables-based storage, which was kind of a prototype somebody

emailed me and said, how do you store pandas objects in HDF5, and so I hacked out something

in an afternoon, but it was just a prototype, so I went through and actually built something

real.

Pivoting and reshaping data has gotten a lot better, so if you ever have data that looks

like this stored in a SQL table, it’s indexed based on a couple of columns, and you want

to reshape that into something a little more useful, you can pivot it with the pivot function

and just specify here the bar and the foo column, and so you get reshaped data, which

you can then do computations on, but you can also, you know, forget about the syntax, this

is the panel data structure, but you can also slice along the other axes, so if you wanted

everything labeled one in the bar in the foo column, then you can get all that data labeled

like that.

I had an extra slide there.

Some other things that are exciting, sparse data structures, mostly NA, not mostly zero,

time zone support, generic moving windows functions, I’m working a lot on enhancing

group by and frequency conversions, and we’re going to hack on it in stats models, getting

more integration there.

So thanks a lot, and yeah, let me know if you find it useful.