Keynote: My Data Journey with Python |SciPy 2015 | Wes McKinney

Video
Event SciPy 2015
Location Austin, TX
Date July 9, 2015

This transcript and summary were AI-generated and may contain errors.

Summary

This SciPy keynote traces my path from mathematics graduate to pandas creator.

From Finance to Frustration

In 2007 at AQR Capital Management, a $40 billion hedge fund, I encountered data analysis realities. Complex mathematical discussions could be summarized in 30 minutes, but executing the work required months of data munging in Excel and SQL. This gave me “an allergy to anything that harms productivity.”

The Python ecosystem in 2008 was different. NumPy, SciPy, and Matplotlib existed but were “bleeding edge stuff” focused on replacing MATLAB in academic research. Critical gaps existed for business analytics: no robust statistical tools, poor missing data handling, no SQL-heavy workflow support. “All of the things that were most relevant at a core…were not core concerns of the community.”

The Proto-pandas Moment

I discovered Jonathan Taylor’s experimental port of R’s statistical models to Python. Though unreleased, it provided the statistical foundation I needed. But to use it effectively, I required better data structures for time series and relational operations—what became pandas, initially “the AQR time series library.”

At AQR, introducing open source tools was risky. I described it as being “short a put option”—if the software failed, blaming open source developers wouldn’t save my job.

The Pivotal Year

After pandas was open sourced in 2009, it remained niche until two events in 2011. Enthought convened developers to address fragmentation in Python’s statistical tools. I argued for integrated solutions rather than “a federation of loosely connected components.” Consulting work revealed data analysis problems weren’t unique to AQR.

Recognizing that “someone has to be the chicken” in the chicken-and-egg problem of tool maturity, I took leave from graduate school and spent my finance savings to work full-time on pandas for a year, willing to “go broke over it.” Working with companies like AppNexus, I came home with lists of missing functionality, then coded from 6 PM until 2 AM.

Community

John Hunter encouraged me to “forge your own path” when others pressured me to work on NumPy instead. The IPython team and Stats Models developers provided collaboration and support.

A train conversation from Seattle to Portland about Python and the IPython notebook turned out to be with Titus Brown, later instrumental in securing IPython’s first funding.

Looking Forward

Now at Cloudera, I focus on scaling Python for big data while fostering collaboration across languages. I argue that “R versus Python versus Julia…language wars” are counterproductive, advocating shared toolkits for common problems.

My technical focus: LLVM and JIT compilation as the future of data processing, hoping for “a no-JVM future” where “LLVM and JVM live in perfect harmony.”

Key Quotes

“This whole formative experience kind of imbued me with this sort of allergy to anything that harms productivity, or that prevents you from expressing your ideas in a concise way.”

“You’re short a put option. So if you use a piece of software, and it ends up causing bugs that are not your fault… blaming the open source developers is not a valid excuse.”

“All of the things that were most relevant at a core, like the things that were going to solve the problems I had, were not core concerns of the community.”

“Someone has to be the chicken, and it might as well be me.”

“I was willing to go broke over it. Like, it was that important to me.”

“I would come home from a day at AppNexus with this long list of, these are all the things that need to get built in pandas. And then it would be like 5 or 6 PM. And then I would work from 6 PM until 1 or 2 in the morning.”

“Forge your own path. If you need to do your own thing, don’t worry about it. It’ll all sort itself out, and as long as the software that you’re building is useful.”

“The R versus Python versus Julia, like the data science language wars is very counterproductive and I think has impeded progress in certain ways.”

“I’m kind of hoping for a no-JVM future. I don’t know if we’ll ever see it, but LLVM and JVM living in perfect harmony would be really exciting.”


Transcript

Thanks Eric for the introduction and thanks to Enthought and all the organisers and sponsors

of the conference.

This is my fifth SciPy, I missed one in the middle.

My first was in 2010 and I have been to all but one since then, so it’s great to see the

community grow.

There are a lot more Python programmers and also folks tackling a lot more diverse use

cases than we were five, six years ago, so that’s really exciting to see how the language

and the community and the software has grown to be useful in a lot more areas of work.

Thanks also for getting up, I’m on Pacific time so this is early for me, so at this time

yesterday I was sleeping so I promise I’ll watch Chris Wiggin’s keynote, I think the

video is probably already up but you know how that goes.

So this is a little bit different than my normal talks which are more like focused on

pandas and specific details of pandas versus the rest of the data tools, other programming

languages.

I work at Cloudera now so my life is more about big data than it used to be, but this

is a bit more of a retrospective on my own experiences and maybe some thoughts on where

we’ve been, where we’re going and we can talk more about that, maybe have some questions

at the end.

So I’ve been up to, well, a lot of things, I guess these things just kind of happen as

time passes that you move from one thing to another, at least I did.

So I’ve been involved in a number of different Python projects, I’ve worked in a number of

different places, studied a couple of different places, so we’ll talk some more about those

things.

This is really covering 2007 to the present, things that I was up to, talk about how things

have gone well, which is really a testament to the community and talk a little bit in

brief about what I’m personally focused on right now and what I see as some of the opportunities

for the community to continue to grow and flourish.

So whenever you come to one of these conferences, it is useful to look around at the people

at the conference and say, well, what brought you here, how did you end up here?

And I know I often ask myself the same question, the odd sequence of events, kind of the butterfly

effect, if you could trace it back to how could I end up on the stage at a scientific

Python conference back when I knew very little about Python eight years ago.

I think the common thread that brings us all here is Python is helping us solve our problems.

Some folks here are core developers on many of the projects that you use every day, you

know, the pandas core team, a lot of them are here, Jeff Reback, the project leader,

Philip Cloud, Stephen Hoyer, who is building not exactly an offshoot project of pandas

but a reimagining of pandas for climate data and sort of a different set of use cases,

but definitely sort of stemming off of the experience offered by pandas.

There’s a lot of the NumPy core team is here, people who built NumPy, people who built SciPy,

so it’s really exciting to bring together people from industry who are using Python,

learning more about Python, how to use Python more to solve their problems, along with the

people who are building the software.

It’s incredibly important that we bring those two groups of people together so that if you’re

just using the Python data stack, the scientific Python stack on a day-to-day basis, that you

can tell the core developers about, you know, what’s working well, what’s not working well,

what if you ran into a problem where you ended up down a rabbit hole and had to spend three

months building some custom solution to fit your problem, that can often speak to a limitation

in the tool set and telling those stories to the folks who are at least part or full-time

working on the open source software is incredibly useful and none of the software that we have

here would exist if not for those use cases and those real world applications.

So before I started programming in Python in 2007, 2008, I really wasn’t much of a programmer.

I got a math degree and I didn’t do a lot of programming in college and so people talk

to me now and they’re like, Wes, what happened, you were writing proofs in 2006 and now you’re

mostly writing Python code.

It was mostly, I think in retrospect, a matter of exposure that I’d never really seen Python

programming, didn’t know any Python, didn’t really know any Python programmers.

The one anecdote that I had about Python is I was taking the intro algorithms course

at MIT and we had a dynamic programming problem on one of the problem sets and because I’d

taken a Java course and so I wrote the solution in Java and it was like a couple hundred lines

of Java at the end of the day, it wasn’t a very complex dynamic programming problem and

a friend of mine said to me, she said, well, I’m going to write the solution in Python

and I bet you it doesn’t exceed 30 lines and I just said that’s crazy, clearly there’s

no way that the solution to the problem can be that short and one of the TAs published

a solution to the problem in Python and I said, wow, this is crazy, it’s so short.

This was in 2005, I think and it just didn’t really click with me at that point and I continued

onward writing proofs and getting a math degree.

I also had no exposure to any data analysis or analytics really of any kind, so I’d never

written a SQL query, I didn’t know what R was, I actually never heard of R until I was

working at AQR.

I was kind of within this very sheltered bubble of technology and science when I was at MIT,

so let’s just say that joining industry was a bit of a rude awakening.

I got a job at AQR, which stands for Applied Quantitative Research, so if you follow the

hedge fund industry, AQR was started by Cliff Asness, who was one of Eugene Fama’s grad

students at the University of Chicago, so Eugene Fama, I don’t know that he’s won the

Nobel Prize yet, but he’s always on the short list for the Economics Nobel Prize for the

Efficient Market Hypothesis, but I think that the Efficient Market Hypothesis is a bit on

the down these days, since a lot of, I think it’s more for political reasons than anything,

but they were at Goldman Sachs, they were running the hedge fund operation at Goldman,

they spun out in 1998, started AQR, eight or nine years later I got a job there because

there were a lot of math folks who had gotten jobs there and I was looking for applications

of math and so I got a job at a quant fund.

But it was an interesting environment, while there was a lot of math and a lot of research

involved with the asset management process, the whole company really at its core ran on

SQL and Excel, and I’d hardly used Excel at that point, so that was also a bit of a trial.

The systems that built the models and ran all of the, computed the buy sell orders were

all written in C++ and Java, there was development in all of the standard compiled programming

languages that we all know.

There were a number of folks who’d come from economics departments who had PhDs, and in

PhD, economics and finance departments, even today, are still very much about MATLAB and

R, and depending on the department, it’ll either be more MATLAB or more R. I think that’s

beginning to change and you’re seeing more Python, but at least at that time, the message

was if you’re going to do some research not in C++, then you should use MATLAB.

I had a bunch of projects my first year that involved some of them were more analytics

that were more about summarizing data that was found in, you know, all of our data was

in a Microsoft SQL database.

So you would pull data out of the, you know, your workflow would either be do all of your

data analysis in the SQL database, which you would end up with, you know, 100, 300 line

SQL queries, or you would pull some of the data out, you would write a simpler SQL query,

you would pull the data out and then either analyze it, usually in Excel.

But if you were using MATLAB, and you could, you know, analyze the data in MATLAB.

The group that I was in had had someone who with with our experience, and so that that

team, the credit derivatives team was doing a lot of our programming.

And so I was introduced at that time to this very strange programming language R, which

at that time was quite a quite a lot different than it is now.

Kind of surrounding all this was he was huge amounts of Excel.

So you know, if you didn’t sort of mechanically end up with all of the Excel shortcuts and

in your hands, you know, after a year, so you just weren’t getting very much done, because

most of your life was moving data around in Excel, you know, copy and pasting special

values and all that sort of thing.

But the thing that really that really jumped out at me is that I, I felt like I was spending

a lot of my time dealing with just data munging, moving data around, cleaning data, normalizing

it, you know, lining up, lining up data and spreadsheets.

And, you know, these were data analysis projects that were that could be discussed, you know,

in a half hour meeting where they weren’t, they weren’t really very complex and the details

of you know, what data is coming from where and what do we need to do with it?

And what are the deliverables for the project?

They were very easy to talk about, but actually getting them done was a whole lot of work.

And so that was, you know, I think this whole, you know, formative experience kind of imbued

me with this sort of allergy to anything that anything that harms productivity, or that

prevents you from expressing your ideas in a concise way, so that you can, you know,

hopefully your code will keep up with your thought process about the problem as much

as possible.

And that rarely, that rarely occurs, but certainly it should be better than 95.5.

I mean, maybe it should be like 80.20 would be, would be a better, would be a better situation.

And so I had gotten exposed to Python, one of my colleagues had had written a couple

scripts for, you know, for, you know, odd jobs, you know, nothing really, no scientific

Python, like nothing really, nothing heavy duty, but I had gotten introduced to the language

and it was very alluring, because it’s like, wow, it’s like readable pseudocode, you know,

the same thing that I’m sure brought a lot of people in this room to Python in the first

place.

And I discovered at the beginning of 2008, like, wow, there’s this whole numeric and

scientific computing community who are, you know, solving research problems and doing

fast numerics in this interpreted programming language that’s really easy to write.

But it was, it was a very different time.

So you know, you look at today, we have very mature tool sets, you know, projects that

have been around for, you know, more than, more than 10 years, NumPy is going to have

its 10th birthday next year, you know, SciPy is getting on, you know, might be about 15

years old or so at this point, you know, Matplotlib’s been around since 2002, I think, IPython around

the same time.

You know, all these projects have, you know, they’ve been around for a long time.

But if you fast forward, or if you go back in time to 2008, you know, these, this was,

this was bleeding edge stuff.

And you know, you have to remember, I was in a financial firm, where, you know, as we

used to say, you know, you’re short a put option.

So if you use a piece of software, and it ends up like causing bugs that are not your

fault, it’s like, oh, well, there were bugs in this open source software that I was using,

you know, it was pretty clear, like, it wasn’t, you know, explicitly said, you’re like, I’m

pretty sure I’m gonna get fired.

Because you know, blaming the open source developers is like, not a valid, not not a

valid excuse.

And that was part of what drove the the use of, you know, tools that came from Microsoft

or that came from, you know, the math works, you know, with MATLAB, is that there was this,

you know, and you can understand, like, we were managing built, you know, billions of

dollars, you know, at that point, you know, AQR had $40 billion under management.

Nowadays, it has well over $100 billion under management.

So there’s this, you can understand the risk aversion that if you lose your clients money,

like, you know, your heads on the chopping block, quite literally.

And so you can imagine kind of the apprehension that that was experienced when I started telling

my colleagues, like, hey, let’s, let’s start doing all of our work.

In Python, it will make us all a lot more productive.

I think people worry first about, you know, what’s the downside risk here.

The community was also a lot different at that point.

And I don’t, you know, this is not really true.

This is not really true anymore.

But at that point, the battle was quite a lot different.

The battle was being waged in scientific research, it’s you were doing your PhD, and you’re

telling your advisor, I’m going to do this work in with, you know, Python and NumPy and

SciPy.

And I’m instead of instead of MATLAB.

And so a lot of the work in the community was about, and I think, you know, maybe part

of it was that a lot of the core developers were PhD students procrastinating on finishing

their PhDs.

That’s, you know, that’s, you know, another story.

But you know, but that was really where the focus was.

And when I, you know, came upon the community, you know, I saw that I’m like, oh, this is

amazing, like, clearly, and I because I had MATLAB users at AQR.

So I was like, well, it’d be great.

I didn’t want to program in MATLAB.

And so, you know, the prospect of, you know, having an alternative to MATLAB, at least

was was really exciting.

But in my case, you know, very different, very different use cases.

And so all of pretty much all of the things that were most relevant at a core, like the

things that were going to solve the problems I had, were not core concerns of the community.

I hadn’t written any SQL, you know, before I got to AQR, suddenly I’m, you know, my whole

world is writing SQL queries, and I’m spending 40% of my day and SQL Server Management Studio.

But you know, if you look at the proceedings of SciPy, you know, the conference talks and

the papers, there really wasn’t there weren’t a lot of relational databases, not a lot of

SQL was getting written.

And as it was, and that kind of trickled down to all of the other related problems around,

you know, how do you deal with nulls and missing data?

Using Python for statistical use cases, if you looked at statistics departments, at that

time, it was more about statistics departments were fighting to use R instead of Stata and

commercial statistics, you know, really SAS and Stata, using R as an open source alternative

to those tools.

And you had this chicken and egg problem where there were no statistical tools, or very few

statistical tools in Python, very few statistical tools, and as a result, not very many people

were using it for statistical applications.

And then kind of, you know, your downstream use cases, making graphics that are statistical

in nature, you know, machine learning.

So a lot of the things that we kind of take for granted now didn’t really exist.

And the work that I was doing, I fell into this last category of more, you know, unfortunately

analytics has this, you know, buzzwordy connotation means all sorts of things, but, you know,

analytics in a lot of business settings means things that you can do with SQL, but you would

like to write Python code instead of SQL.

But that was not really a thing that you could do well at that time.

So and, you know, also taking in context that I was, you know, let’s see, beginning of 2008,

so I was almost 23 years old and incredibly stubborn, and I guess I didn’t care that much

about getting fired.

I’m like, well, if I get fired over this, it will make a really good story.

And I was doing an R project, and I’d seen that a professor named Jonathan Taylor at

Stanford had ported a select subset of an important R package to Python, and that’s

the MASS package.

And so in particular, I needed an algorithm called iteratively reweighted least…

squares, say that three times fast on a podium, also known as robust linear model for doing

the project that I was doing and the thing about SciPy stats models is that it wasn’t

even in mainline SciPy, so I found it in the SVN repo, I think it was in a branch, it might

have been in trunk, maybe it was in a branch but it hadn’t been shipped in SciPy so this

was like seriously bleeding edge stuff but I said it implements my model, I fitted the

model and it matches the R results, I can maybe use this, I just won’t tell anyone that

it’s bleeding edge stuff that hasn’t been released.

And so that was, if you go back to what was the hook that gave me a reason to really try

this out seriously, it was really that.

And the funny thing is Jonathan Taylor is a statistics professor and he was embedded

in Stanford which was very much MATLAB and R at that time and still is in a lot of ways.

So I think he was drawn to Python as were we all and so if not for that, I don’t, it’s

like the butterfly, I don’t know how things might be different, I think I might still

be programming in R although quite a lot less happy.

So in order to really use that library well, I needed some kind of data structure or data

wrangling toolkit, I had no idea what was going to be the scope of the project, I had

just the problem that was sitting in front of me and I had a whole big bag of frustrations

from using R and working with time series data, dealing with data from many different

tables, doing all that munging, joining, aligning, normalisation, missing value handling, forward

filling, all that stuff was very difficult to do in R at that time, it’s gotten easier

in the meantime, but those were the problems I had, it was financial data and so I tried

to port my R code to Python.

So about a month later, the proto pandas, thousand lines of code, as Fernando Perez

would say, just an afternoon hack, had turned into a useful tool for the particular problems

I was working on and the interesting thing that happened at AQR is we had a case where

we needed to implement a statistical model in production and doing it in C++ was going

to take a really long time, because C++, and so we found with a little bit of research

that we could actually embed the Python interpreter and enable users to write analytics or statistical

computations in Python and extend the legacy system with Python instead of with C++, it’s

funny to think back on that project and how hackish and horrible it was in some ways,

we just had to get something running as a proof of concept, but it was this amazing

thing where it was like a Trojan horse that we quite literally opened up inside of this

legacy system and then all of a sudden people realised that they can write Python instead

of C++, well you can imagine what happened, everyone wants to write a lot of Python.

Getting Python adopted in a $40 billion asset management business overnight is certainly

not a thing that happens, so a lot of the work that occurred over the next year was

building consensus around can we use Python in a serious way, can we trust it, can we

trust the open source community, who are these people that are building this software, those

are all the discussions that we had and at a certain point and there was a lot of skunk

works porting existing systems to Python as proofs of concepts, getting folks to use

proto pandas, it wasn’t called pandas at this point, it had no name actually, it was just

the AQR time series library.

So there was a lot of evangelism around like don’t do Ruby, don’t do MATLAB, consider Python,

look at this cool time series tool that I’m building.

So long story short, by the beginning of 2009 we decided to make a serious commitment

to Python and see how it goes.

I’m sure there was always an escape hatch, we’ll escape to Java if Python doesn’t work

out but it did and it was certainly a lot of work and the things that made it easier

and made it possible, one of the big ones is that we didn’t have to re-implement or

implement any core system components, so we were able to reuse essentially pick and choose

components of the scientific Python ecosystem, things like Pytables, HTF5, NumPy, SciPy,

Matplotlib, IPython, those formed the foundation of an interactive research environment and

the thing that was missing was the domain specific functionality, the data wrangling,

the time series support, the things that you needed to build quantitative models for trading.

The other big thing was the interoperability and of course that’s a big reason why we’re

all here is because it would have been a fool’s errand in the late 90s to go and re-implement

all of the legacy code bases that were written in C, C++ and Fortran, you can just bring

them with you and script them from Python, build a nice wrapper layer and the fact that

we were both able to embed Python in a legacy system, that was kind of the step one and

also take legacy components as we would sort of cut away parts that had been re-implemented

in Python, we could wrap those C++ components in Python C extensions so that systems that

just didn’t make sense to go and re-implement or there was no bandwidth to port those, either

no bandwidth or it might have been a legacy system that was maybe we’ll decommission this

in a few years, we could just wrap those in Python just, I mean it was a lot of work but

it was certainly easier than the alternative, having to more or less throw out a trusted

system that had been developed over the course of ten years.

The other big thing was the fact that the user interface was so much more pleasant and

palatable for the users and I think that when new people come to the Python ecosystem that’s

often one of the big revelations, you learn the libraries, you learn IPython, the basic

tools, IPython notebook didn’t exist back then but I think now when people discover

the IPython notebook they are like wow how did I get on in life without this, it’s like

life was so horrible before, it’s like you realise you have been sort of walking uphill

both ways and eating dog food and all sorts of things and you’ve got a motorbike and you’ve

got Flamin’ Yon and life is just really great, you can spend more time reading the internet.

And so I think from that point onward, the approach to problems would be you encounter

a new problem and be like well can we use Python to solve the problem and I think that

our community is very much the same way and you hope in your companies and any software

project that you should have a really good reason to not be using Python and it’s great

that this is now considered a popular belief that Python is the first thing that you reach

for, try to solve problems with it first and have a really good reason to be programming

in Java or C++.

So after a period of time and a lot of convincing, so I think Eric alluded to that, financial

institutions don’t like to open source code.

a matter of principle, as any IP is considered to be precious.

And I made the argument that open sourcing would

be a really positive thing for the company,

for recruiting, for getting community development

and involvement in the project.

It’s interesting how that played out.

And if I have time, maybe I’ll talk a little bit about that.

So I looked, I downloaded, it’s on PyPI,

and you can download pandas 0.1.

And it’s really small.

You think about pandas 0.16.2 now,

which weighs in at a little under 200,000 lines of code.

And the first version of pandas was a really small project.

There was a little bit of Cython for accelerating

certain algorithms.

But it’s a really small library.

And it certainly wasn’t the package that it is today.

And it was only useful for a certain small set of use cases.

And I was busy with other things.

And so at the time when we open sourced,

we open sourced pandas.

Developing pandas was not our main priority.

It was suitable for our needs.

It had pretty much everything that we needed.

And so spending a lot of time moving pandas forward

wasn’t really a priority.

And in 2010 and 2011, there was just

starting to develop the first kind of whispers

of a statistical community in Python.

So Skipper Sebald, who’s here, one of the core developers

of the Stats Models project, the first major version

of Stats Models came out of his, he

was a Google Summer of Code student in 2011.

And that was the first version of Stats Models.

But you go back to 2010, and there was no consensus

about if you need to do data wrangling or analytics,

what should you use?

And pandas was this little known thing

that sat there that wasn’t, it was useful, but not

as useful as it is today.

I went to grad school.

AQR, I consulted with AQR while I was in grad school,

mostly to fix bugs and add a few new features to pandas.

But the real story of pandas didn’t really

get started until the summer of 2011.

And there were two discrete events

that occurred that led me to get really worked up,

for the lack of a better term.

So the first is that I think that among the literati

and the core scientific Python developers,

the perception that there was some lack of consensus

around data structures and data toolkits

for statistical computing in Python,

that was acknowledged that it was an issue

and that we should talk about it and build some consensus

and figure out a way forward.

mThought flew us all in.

We met for a couple of days to talk about this problem

and figure out how we can solve the architectural issues,

low-level issues with NumPy that we’re preventing.

NumPy from being used better, more for statistical computing.

What’s up with the data structures?

And my big argument at this time was

that rather than having this federation of loosely connected

components, that we needed to build something that

was more integrated and delivered a better user

experience to the end statistical user who’s not

a computer scientist that just wants

to be able to import a library and read a CSV file

and compute some statistics about it.

And you have to remember that at that point in time,

even simple things like reading CSV files were still very hard.

And so I’d gotten more interested in the problem.

I was a grad student, didn’t have a lot of time

for open source development, but that was a big thing.

And then I went and spent a couple

of weeks working with a hedge fund that was not

AQR in a consulting engagement.

And I spent a couple of weeks with them.

And I realized that the problems that we had at AQR

were not isolated to AQR.

I think at that time, I had invented this fantasy

that every company had built better software than we had.

And it turned out that it was quite the opposite,

that a lot of companies were still

kind of back in the dark ages in a lot of ways.

And that wasn’t necessarily true of the particular company

I was working with, but I think I

heard stories about the general financial industry

and hearing about other companies.

And it felt like it was the time for Python data tools,

and we need to do something about this.

But the issue was maturity, and again, the chicken and egg

problem.

And so it was like the end of May 2011,

or I guess I decided that someone has to be the chicken,

and it might as well be me.

And I think when you’re building community

and you’re building a software, the use cases

and the real world applications are the most important thing.

And you can’t build a useful piece of software

unless you’re being faced at least in a nearly direct way,

like you have a consulting project

or you’re working in some direct way

with somebody who’s suffering from a problem

so that you can ship a piece of software

and then say, well, did that solve your problem?

Did that solve your problem?

And if it didn’t, then you have this sort

of virtuous cycle of making the software better

until you’re really completely solving the use case.

During the latter half of 2011, I

worked with AppNexus in New York.

I don’t even know if I’m allowed to say that

in these consulting agreements.

They’re like, you may not talk about AppNexus, Big Python

Shop.

And that was a very enlightening time for me

because it was a set of analytics and data challenges

that was pretty far afield of what I’d encountered at AQR.

And I very quickly realized, wow,

we need to make pandas more or less a SQL replacement as much

as possible.

And there was just a heck of a lot of stuff

that was missing from the library.

And I would come home from a day at AppNexus

with this long list of, these are

all the things that need to get built in pandas.

And then it would be like 5 or 6 PM.

And then I would work from 6 PM until 1 or 2 in the morning.

And then I would wake up and do it all over again.

And so the next year and a half pretty much worked like that.

Making pandas better, evangelizing,

building consensus in the community,

building functionality that you really

couldn’t find anywhere else that would also attract

users from other communities.

So the time series capabilities in pandas

still are among the very best.

And that was a draw for our folks,

for really any programming language

to say, wow, this is this amazing time series

tool that also can be used for general relational data

operations.

Like, we don’t have to put the data in a database

to do an outer join.

So something I will mention is that part of the reason

that this was all possible was that I found myself in,

you know, it was May 2011.

And I got really fired up.

And I went to my advisor at Duke.

And I said, I have to take leave so I can work on software.

And so I took a year off from grad school.

And I moved back to New York.

And the reason that I made, in retrospect,

it seemed it was a little crazy.

Part of the reason I was able to do all this

was that I had been very frugal when I worked in finance.

And I’d saved up just about enough money

where I was like, well, I can work

on open source for about a year.

But then I’m going to have to figure my life out

on how to support this in a sustainable way.

But I was willing to go broke over it.

Like, it was that important to me.

And I was like, well, I saved this money.

And like, what better, you know, I

might as well spend it on this because it

feels like the thing that is needed right now.

I got a book deal with O’Reilly, November 2011.

I hope to talk a little bit more about the back story.

But there’s other things I wanted to talk about.

Needless to say, it’s been very successful.

I think a lot of you here probably own copies of the book,

and I certainly never expected it to be so successful.

Fernando, Brian, and John had been putting together

different book drafts in the past,

and we talked about working together on a book,

but they were very busy, I had the time,

and I wanted to make a more pandas-focused book,

and I honestly used writing the book as a way to,

it was like the carrot to make me finish parts of pandas.

So I was like, well, here’s an empty section of the book.

I wanna put something there,

so I’d better finish the software.

You know, with any project,

it’s not just about sitting down

and hacking out a bunch of code.

You really have to have clarity

about what it is that you’re building,

what are the problems that you’re solving,

and being faced with those real use cases

is the thing that will help you kind of get there

the most quickly.

And I had a lot of support along the way.

I wasn’t working in a bubble.

There were constant conversations

with folks in the community on the mailing lists,

at SciPy, it was a lot of discussion and collaboration,

and honestly, the folks on this list,

and this is not the complete list,

but the folks here at Enthought,

Eric and Travis and Peter,

and the IPython folks and the Stats Models folks,

they were all cheering me on,

and so without their support

and without their use cases and their support,

it would have been really hard for me to do all this,

and certainly being that lone wolf

that’s kind of in the cave writing the code,

the real world’s not like that.

You have to engage with others

and learn about the problems that other people are facing.

So just to relay a little anecdote,

just to show you how small the world is,

and how, I guess, serendipity does, in fact, occur,

I was on a train from Seattle to Portland in November 2011,

and I had just been in Seattle

and had spent at the supercomputing conference

and had spent a few days hanging out with

and talking about data with Peter Wang,

and the guy sitting next to me sees that I’m in Emacs

and says, well, are you a programmer?

I said, well, yes, I’m a Python programmer,

and he says, oh, I do a little bit of Python, too,

and I said, oh, well, have you seen the IPython notebook,

which at this point was very beta software.

So it turns out that I was sitting next to Titus Brown,

who, if you don’t know, is a huge Pythonista

and has been at SciPy many times,

is in the PSF, and a huge proponent of Python

and uses Python in his research lab for,

he believes he’s at UC Irvine now,

and as fate would have it,

I’m informed that he was very helpful

in helping the IPython team get their first

dedicated funding for IPython in 2012,

and Fernando claims that a part of that was sort of the,

that I think after that conversation with Titus,

he became a huge convert and proponent

of the IPython notebook.

So John Hunter, who’s unfortunately not with us,

was also a big part of helping me along the way.

Part of the reason why I’ll single out John Hunter

is he was, after Eric Jones,

John was one of the first people

from the Python community that I met,

and the reason was that in January 2010,

I wasn’t sure what I was doing with myself.

I was like, maybe I’ll go to grad school,

maybe I’ll get another job, I don’t really know.

I wanted to sort of move on to some new project,

and I saw that Tradeworks in Chicago, where John worked,

was looking for Python developers.

I said, oh, a finance company that, you know,

has been using Python for years,

and so I flew to Chicago and I interviewed at Tradeworks

and I spent quite a long time with John,

and he and I bonded right away,

and I think that maybe John saw me

as like a 17-year-younger sort of version of himself

in a certain way, and when you go back

to the origin story of Matplotlib,

John was very much kind of the lone wolf programmer,

so I’m going to build a plotting replacement

so that my research lab doesn’t have to use MATLAB anymore,

and I certainly identified with that,

and John was often a person that I would email

or get on the phone with whenever I sort of needed

to discuss community issues or, you know,

like, oh, this is really hard,

how to make the open-source lifestyle sustainable,

and one of the things that I faced in the Python community

was the pressure to not work on pandas

but to instead work on other components like NumPy

rather than build kind of pandas,

which was viewed at that time

to be kind of working around limitations in NumPy

to just work on NumPy and fix the kind of, you know,

limitations of NumPy rather than building

this sort of, you know, other project,

and so John, I think, was kind of the loudest voice.

It was like, well, forge your own path.

If you need to do your own thing,

like, don’t worry about it.

Like, it’ll all sort itself out,

and as long as the software that you’re building is useful,

and so it was good that I had somebody telling me that,

and, you know, couldn’t have been, you know,

a more empathetic person than John,

so for everyone who knows him.

I started a couple of business ventures.

During 2012, while I was writing my book,

we explored building a commercial Python financial toolkit.

It’s one of the things I realized during that time

is that selling Python software is hard,

and maybe as a result, I’ve developed a very strong feeling

that the software needs to be free

and it needs to be open source,

and so, you know, I haven’t been as interested

in building commercial scientific software, you know,

as a result, and that was a large reason

that Chong and I started Datapad and said,

well, we’re not building software for Python users,

but we’re gonna move upstack into the business user,

build a business intelligence tool that solves problems

that exist out there and still exist,

but use the Python stack as our core technology

for building the product.

So starting companies is hard.

I don’t need to tell many of you.

We were venture-backed, and we built a product,

built a company, and we had kind of, you know,

I knew the Cloudera founders,

and we’d been talking about the product

for a very long time, and selling to Cloudera customers

would have been one of our go-to-market strategies,

but the back end, like the systems engineering problems

that we were tackling were very similar

to the problems at Cloudera, and last summer,

they said, hey, why don’t we join forces

and solve these problems together,

and that is indeed what occurred.

So where I am now and what I’m doing,

Cloudera, I guess you can think of as an open-source company

for big data that builds and supports Hadoop

and projects that are related to Hadoop

and the Hadoop distributed file system,

and for me, kind of thinking about the arc of my life

and wanting to have a way to sustainably work

full-time on open-source, I really couldn’t ask

for a better situation and to be working

on data problems at very large scale,

so I’m happy to talk more offline with other folks

about my work there and Cloudera itself,

and if you’re looking for a job in big data,

well, I’m sure you can figure out how to contact me.

So it’s been seven years of Python development,

and the things that I’m really interested right now

are much the same problems.

I’m still very much a Python person at my core,

but I’m very interested in fostering collaboration

amongst all of the data science programming languages.

I think it’s the R versus Python versus Julia.

like the data science language wars is very counterproductive

and I think has impeded progress in certain ways.

And I would like to see collaboration in…

I would like to see collaboration among those groups of people

because we are solving the same problems,

the same machine learning problems,

the same analytics, you know, data serialization,

you know, network database problems.

And if we could build toolkits that we could all use,

particularly for the same kinds of problems that Panda solves,

that would be, you know, a great benefit for the future.

As long as there’s a first-class Python interface,

that’s the most important thing.

And the other things related to that

is generally making Python, using Python for big data a lot better

because I work at a big data company,

so I’m working on that right now.

And, you know, I think the whole LLVM,

you know, JIT compiling, cogeneration stuff,

I believe that is the future of a lot of what we are doing

and, you know, I’m putting my chips in that pile.

So I’m kind of, you know, hoping for like a no-JVM future.

I don’t know if we’ll ever see it,

but, you know, I think, you know, LLVM and JVM

living in perfect harmony would be really exciting.

And so the other kind of broad theme of what I’m working on

is that, you know, as the data gets bigger,

invariably the storage and execution,

you know, how the data gets processed and how it gets stored

are going to be, you know, for really big data,

are going to be, you know, standalone systems

for storage and data processing.

And the question then becomes,

how can you build the best user interface?

And I think, and I think a lot of you should think,

that Python is, you know, the best

or one of the best user interfaces

for doing sign-in computing, data processing,

you know, all these problems that we’re solving.

So making sure that we can continue to program in Python

and that, you know, Python is viewed

as one of the best environments for getting work done,

that’s, you know, that’s what I’m working on

and I think I will continue to for quite some time.

So I ran slightly long and I don’t think I have any time,

I don’t know if I have time for questions,

but thank you very much for having me

and looking forward to what the future holds.

APPLAUSE