PyCon SG 2013 Conference Keynote from Wes McKinney

Video

Event PyCon Singapore

Location Singapore

Date June 14, 2013

Watch Video Slides Video

This transcript and summary were AI-generated and may contain errors.

Summary

This 2013 talk offers a state-of-the-union assessment of Python’s data science ecosystem.

Scientific Python Maturation

I note how recently the Python data stack came together. NumPy reached its current form in 2005 when Travis Oliphant unified the array library landscape. The IPython notebook, which I call “really game-changing,” had emerged just two years prior.

Key libraries driving adoption: Scikit-learn brought machine learning, with its French and German developers being “a model for how open source should work.” Stats models provided R-like statistics, and PyCuda made high-performance computing accessible: “if you’ve ever written Cuda code in C and then gone to PyCuda, it’s just really a transformative process.”

Data Preparation

Data cleaning represents 80-90% of analysis time. “Making the data cleaning and data preparation tools easier to use and more productive really overall speeds up the analysis process a great deal.”

A key principle: “if your tools are hard to work with, you’re less creative.” Reducing friction enables faster iteration and more questions.

Stack Overflow Analysis

A live demo analyzed Stack Overflow data. Django declined from 15% of Python questions in 2010 to lower percentages by 2013, while Flask gained ground. pandas had grown to over 2% of Python questions. Emerging trends: Raspberry Pi, Python 3 adoption, and IPython itself.

The Web Challenge

“The big challenge for Python over the next four or five years is how to keep Python as an important and relevant language that people want to program in and that they can use to build systems in the way that they need to be built for the future.”

Obstacles: JavaScript isn’t designed for serious data processing, browsers handle only data subsets, and network latency complicates interactive applications. The solution: build bridges to web technologies, not abandon Python.

Python-Web Integration

The IPython notebook runs as a JavaScript application executing Python on the backend. RStudio’s architecture creates applications running locally or in the cloud seamlessly.

Emerging tools like Vincent (Vega.js integration) and Folium (Leaflet.js mapping) power web visualizations from pandas data.

JIT Compilation

Numba translates Python code into machine-code performance. Related: PyPy, Julia’s LLVM architecture, and Cloudera’s Impala using JIT for SQL processing.

Big Data and SQL

The “SQL everywhere” trend: Facebook’s Hive, Berkeley’s Shark, Amazon’s Redshift. Python has limited success in big data compared to “medium data,” but emerging interfaces to Spark looked promising.

Requirements for Data Tools

Built-in missing data handling, intuitive APIs that don’t impede thinking, and core operations for grouping, sorting, and filtering that are both fast and easy to use.

I urged users to document pain points: “The more that the open source development community understands problems that people are experiencing in the real world, that really helps everyone make good design decisions.”

Key Quotes

“If your tools are hard to work with, you’re less creative.”

“Making the data cleaning and data preparation tools easier to use and more productive really overall speeds up the analysis process a great deal.”

“The big challenge for Python over the next four or five years is how to keep Python as an important and relevant language that people want to program in and that they can use to build systems in the way that they need to be built for the future.”

“If you go on GitHub and look at the Scikit-learn project, it’s kind of a model for how open source should work.”

“Hopefully, we don’t have to give up on Python and just program in JavaScript, because that would make us all very sad.”

“I think in order for Python to really stay relevant, rather than saying, no, I’m not going to program in JavaScript, I think we have to build interfaces to JavaScript and to the web technology so that you can keep your serious work in Python.”

“The more that the open source development community understands problems that people are experiencing in the real world, that really helps everyone make good design decisions.”

“When I started using Python in 2007, people thought I was crazy. They were like, this is not a serious language, you can’t build production systems with this.”

Transcript

Thanks a lot. Thanks a lot, Calvin, for the introduction. Thanks for having me here. This

is my first time in Singapore, second time in Asia, but it’s been a long time since,

I think the last time I was in Asia was in 2005, so it’s been a very long time. I’m excited

to, already excited to come back. So, I guess this talk is a little bit high level. I will

have some examples and some things about pandas, but also I wanted to comment a bit generally

about the state of Python for working with data, and if you’re not very familiar with

some of the development that’s been going on the last five or six years, just to tell

you a bit about that, and at least to tell you from my point of view, what are the interesting

trends right now and where I see things going. So, I’ve been doing a lot of things the last

few years. Most recently, I just moved to San Francisco from New York. I’m working

on a new analytics company there, which hasn’t launched yet, and there will be more information

about that later this year. Went to MIT a long time ago, started the pandas project

not long after I got out of college, mainly because I was very excited about programming

in Python, but I found that there was kind of a missing piece, which was data manipulation

tools. You had great general purpose programming libraries, tools for building systems, unit

testing libraries, array computing with NumPy, very rich scientific Python ecosystem, but

the statistical data analysis tools and data manipulation, data cleaning tools, there really

wasn’t very much of that in Python, in large part because Python didn’t have a very big

statistics community, so I started pandas essentially in response to that. So, my book

came out in October, and the goal of the book was to have a nice introduction for people

who’ve never used Python for working with data before. So, it’s not a book about data

analysis methods so much as it is a book that shows you how to use the important Python

tools for working with data so that you can do data analysis. So, I wanted to give a nice

introduction to the main scientific Python tools, NumPy for array computing, IPython,

which is the programming environment we’ve all come, you know, I’ve been a big fan of,

Matplotlib for data visualization, and then pandas is about, you know, maybe more than

half the book, showing you how to use the library to solve, you know, all of the various

problems you may encounter. So, the last six years have been very interesting for

Python. Well, you know, you have to remember that NumPy, which is the array computing library

for Python, essentially what gives you, you know, matrices and linear algebra and, you

know, MATLAB-like functionality in Python, really only existed in its current form from

about 2005 onward. There were multiple array libraries starting in about 1995, and there

was a bit of a community fragmentation that was going on, and Travis Oliphant really wanted

to unite the Python community under a single array library and merge those two libraries

together, Numeric and NumArray, and created NumPy, and that was only in 2005. And the

NumPy project and SciPy, which is built on top of NumPy, has matured a lot in the last

five years, and that’s been very important for businesses who want to adopt Python and

want to feel like they’re building on top of a mature and stable tool, that the code

that they write today is going to be robust and not have weird crashes and problems running

in production, but that that code will also be maintainable and will still have a good

chance of working a few years down the road. One of the other really game-changing things

in the Python ecosystem in the last, really just last two years, has been the IPython

notebook. How many of you have used the IPython notebook? So I’ll show it to you in the talk,

but it’s been a really great tool for creating and sharing data analysis and really any kind

of Python work. It’s very easy to take a notebook that you’ve created on your computer and upload

it on the Internet and share it with people. It’s a really great way to collaborate on

Python work. Other libraries that have been really important for bringing more people

into the Python ecosystem outside of pandas have been things like Scikit-learn for machine

learning. So it used to be if you wanted to do machine learning, you had to use R or you

had to use Java or some other library. So a group of primarily, they seem to be mostly

French, but mostly French and German developers have built the Scikit-learn library and now

has a very large and very active community of developers. So if you go on GitHub and

look at the Scikit-learn project, it’s kind of a model for how open source should work.

The progress they’ve made is really incredible. I’ve been involved in stats models, which

is the statistics and econometrics library for Python. It gives you a lot of the kinds

of regression and statistics tools that you have in R, but in Python. There’s now, we

now even have formulas. So if you’re, it used to be pretty hard to write down complex regression

models, so now you can write formula strings that describe your linear model and there’s

a formula parser that will, that connects with stats models that you can use. Other

things, well the next talk up is about PyCuda, which has certainly brought a lot of people

to Python for doing high performance computing. Tools like PyCuda and PyOpenCL make, you know,

programming on GPUs for very high performance computing applications a lot more accessible.

So if you’ve ever written Cuda code in C and gone through that, that painful process and

then gone to PyCuda, it’s, you know, it’s just really a transformative process. So I

wrote some Cuda C before I used PyCuda and I was really upset about all the time I spent

writing C code before I used PyCuda, so really great. Another missing piece of the puzzle,

as I was saying, is pandas, which made Python a good language for working with any kind

of raw data and wrangling it into shape so that you can do analysis on it. So data preparation,

I love this, I love this quote, this happened during the Strata conference last February

in the Bay Area, that in doing data analysis often the data cleaning and data preparation

makes up a lot of the time that you spend doing the analysis. You know, people quote

80%, sometimes it could be 90%, and you spend a lot of time, you know, you prepare your

data, you do a bit of analysis and you might find some problem with your data and you have

to return to the sort of data cleaning, data preparation to sort out issues and bad, you

know, problems in your data. So making the data cleaning and data preparation tools easier

to use and more, you know, faster and more productive really overall speeds up the analysis

process a great deal. And that’s what pandas is for. It’s primarily a tabular, you know,

spreadsheet style data manipulation tool for Python. It’s very fast, so it’s much faster

than you would get writing something in pure Python. But one of the main features is that

it has a really nice API, so the functions, all of the methods and functions are intended

to fit really nicely together so you don’t have to spend, hopefully you don’t have to

spend a lot of time digging around documentation and remembering how functions work. So some

people like to describe it as like our data frames in Python and that’s not a completely

accurate characterization, but, you know, you can see, you can think of it, think of

it that way. And it’s been growing very, very fast and now has a quite large development

community. As an aside, another project I’ve been involved with is a little library called

VBench. So if you ever care about performance testing your code and are interested in, you

know, keeping track of how fast your code is, it’s a tool. I had to build it in order,

you know, to keep myself sane while developing pandas, monitoring whether things get slow

or not. So you can see in this example, you know, in the middle of 2011, there was an

operation that got slower for a while and then it got faster and faster over time.

time.

So this is a benchmark of a simple panda statement benchmarked at each version of the code base

over time.

So you can see how, keep track of whether the library’s getting slower or faster.

So one of the big things that I’ve learned in working on data tools is if your tools

are hard to work with, you’re less creative.

And I found this myself that, because I’m very curious, like I like to ask a lot of

questions and I like to be able to get answers to those questions very quickly.

So each time you ask a question, if you feel like you have to go through a lot of pain

and suffering to be able to get to the point where you can answer that question, sometimes

you may not even do it at all.

So the less impedance and the less friction you have in your tools, the faster you can

sort of iterate and ask different questions and analyze your data.

And this will hopefully lead to more creative work and better insights into your data.

So one interesting analysis, how many of you use the Stack Overflow website?

Most people, yeah.

So Stack Overflow has gotten much more popular over the last few years, and I think primarily

because people find that they can get answers to their questions really quickly, sometimes

in minutes.

The number of people who are actively trolling Stack Overflow and answering questions is

pretty remarkable.

People have an incredible amount of time on their hands.

So I was interested in looking at, you can query all the data from Stack Overflow about

any particular topic.

So I downloaded all of the Python Stack Overflow data for the last, since the beginning of

And I was interested in looking at, just based on Stack Overflow, what can we see about what’s

going on in Python and the whole world.

So let’s do that.

So I’m inside the IPython notebook.

So if no one has ever seen this thing before, I’ll give you a very brief teaser about it.

So the idea with the, if any of you have ever used Mathematica, this will be very familiar.

I think it was modeled on the Mathematica notebook.

And the idea is to be able to have code cells that you can put Python code in.

So if I put print hello world, I’m using Python 2, so I don’t need parentheses.

If I put print hello world, what’s going on is I’m inside Google Chrome.

So I’m in a web browser.

And the notebook application is running on my local machine.

And I can put a Python statement here, press Shift-Enter.

It sends that code to a running IPython process, executes it.

If there’s any output or plots or anything, it captures those and inserts them in the browser.

So if I had, so if I generate some random data with NumPy, so I just generated some

1,000 normally distributed random numbers, took the cumulative sum to turn it into a

random walk, and then I’m using the matplotlib plot function to make a plot.

And so to have this notebook interface is really nice because you can essentially tell

a story with your Python code.

You can also have cells that have markdown.

So I can put stack overflow analysis here.

And so I can put any markdown statements here.

So you can mix text and writing with your code and analysis and build.

People are starting to use the notebook to write books and papers and blog and do all

kinds of things.

So it’s a very, very nice tool ecosystem.

So what I did, I’m not going to take you deeply through the code, but I, so I downloaded all

of the stack overflow data into one big, one big pandas data frame.

So if we look at the, look at the tags column in the data frame, you can see that each,

each post, I’ll show you what a post looks like, if I can, okay.

So a post looks like this.

So you have a post ID, the date it was created, the person who, the person who asked the question,

and it’s, it’s title and also it’s tags.

So we’re really interested in these tags and that’s kind of a proxy for what the, what

the question is about.

So this guy asked about calculating harmonic series.

So it’s tagged with math and, and Python.

So let’s look at.

So what we want to do is to, to take all of these tags and to do some, some aggregate

the data to get some kind of idea about which tags appear the most over time.

And then maybe we can look at some trends and see how things are, what people are talking

about on stack overflow is changing over time.

So I won’t, I won’t belabor you, but I, I wrote a little regular expression to split

tags into a list of, list of strings.

So made a, a big table called tag table that now has for each post ID a sub tag associated

with it.

Then I can merge that with the, the post table.

So now if we look at this merged table.

So now we have for each post ID and each sub tag, in each sub tag, we have one row

in the table.

So now this is something that we can, that we can aggregate.

So now I’m going to take this merge table group by sub tag, call the size method and

that gives us the count of number of posts by number of posts by tag, order and descending

order and then take the top 500.

So make this a little bigger, gosh, okay.

So this is since the beginning of 2009, these are the top, the top occurring tags on stack

overflow for the whole, for the whole time span.

So as, as you might expect, well everything’s tagged with Python, so that’s everything.

But then, you know, Django being definitely the most popular Python library is next up.

You know, Google App Engine, NumPy is also up there, kind of shows you, you know, relative

to Django, you know, how, roughly how popular, you know, scientific Python is relative to

web development, regular expressions, Matplotlib, various things on down here.

If we look at the top, top 25, I think I have to go a bit further down the list to

find pandas.

I think pandas is in the top 50.

Yeah, so, so pandas is down here toward the end of the, end of the top 40.

And so this isn’t that interesting because this is just, well it is interesting but it’s,

it’s over the entire, the entire sample of the dataset.

So if we want to look at trends over time, what we can do, first I need to convert the,

the creation date column, which is strings, into timestamps.

So I do that.

And then I’m going to filter down the dataset to just the top 500 tags, just to make things

a little faster.

And then I’m going to group by tag.

And for each, for each subtag, I want to get a count by month.

So for, you know, September 2009, for each tag we would get a, get the number of posts

that occurred in, in that month.

So this is pretty, pretty standard pandas stuff.

I’ll show you what this thing looks like.

Okay.

So we have a, we have a big data frame now.

One column per tag, the index, which are the row labels of the data frame.

are time stamps, so going from January 2009

through June 2013.

If we look at…

the Python tag

and then plot that…

so this shows us, um,

the number of posts per month

over the full sample, so you can see, um,

well, Python is getting more popular, but

Stack Overflow is also getting more popular, um,

and then there’s this drop at the end, and that’s because

we’re in a partial month, so

I’m gonna drop off…

just go up to 2013-5-31

to drop off the June data,

so now we don’t have that partial month at the end.

And now there’s a few interesting trends in the dataset,

um, particularly, you know, between libraries

that do similar things, so you can see how, you know,

the popularity of things has changed over time, um,

and one thing that we want to do is, because you see that

Python questions are getting more popular in Stack Overflow,

so we might be interested in, to see, you know,

what percentage of, um, Python posts are about Django

over time rather than looking at the absolute number, um,

because we’re gonna see an uptrend basically in everything,

depending on this overall trend.

So I’m gonna divide everything by the number of Python posts,

and that gives us a percentage.

And so here you can see this is just the percentage

for pandas, so you can see the library, you know,

taking off, and now, you know,

a little over 2% of Stack Overflow questions, um,

in recent times are about Python

or also about pandas.

You can also look at Django and see that,

you know, back in 2010, you know,

almost 15% of questions were about Django,

and then you see Django becoming, well,

at least less popular based on this metric over time.

Something like Flask, which has been gaining in popularity,

um, you can see, well, it’s really only about, you know,

percent and a half, but, you know,

getting quite a lot more popular.

Python 3.

If anyone’s using Python 3,

you can see generally people not being as excited

about Python 3 a few years ago,

and more and more people are adopting it.

Matplotlib.

Matplotlib also getting more popular over time.

Regular expressions staying about the same.

I think people generally have problems

with regular expressions all the time.

I know I do.

Other things, maybe like IronPython.

So I think a few years ago people were more excited

about doing Python on .NET,

and there was a lot more activity around IronPython,

but it’s sort of declining in popularity.

And then there’s other, like, things like Twisted.

I don’t know if any of you use Twisted

web stuff, so you can look at, you know,

the popularity of Twisted over time.

You know, it had a very more popular period

and it’s become less popular.

And Tornado, which is a new library,

you can see appeared in, you know,

late 2009 and has gotten a bit more popular over time,

but still less than a percent of questions.

So this is fairly interesting.

But we might want to look at, you know,

trending tags,

like things that have experienced the most growth

over the last year,

either growth or decline,

to see kind of what’s moving and shaking in Python land.

So I’m going to go back to my data aggregation,

and rather than resampling by month,

I’m going to go by year.

Of course, that doesn’t want to work.

Oh, great.

Let’s try A.

Okay.

So now we have data

annually.

So now you can see this is…

So we’ll just look at this

graphically, make a bar plot of

Python posts.

So here we’re looking at…

This is a partial…

This is a partial year.

But we can…

Let’s just look at Django.

And now I need to do…

I need to do that normalization thing again

so we can look at percentages.

And feel free to have at the data yourself

so you can find.

So if we look at norms Django…

So here is the Django plot

over time.

This is 2013, 2012.

And if we take just this time series,

there is a method percent change

that gives us the year over year change

on a relative basis for each year.

And so what we want to do is do this

for every tag

and then find the top 10

and the bottom 10 to see what’s changing.

So I’m just going to do that.

So we’re going to call percent change

on the entire data frame.

I’m selecting the percent changes for 2013.

So let’s call this

what’s happening 2013.

Order by value.

This doesn’t look good.

This might be a bit distorted

because we don’t have the counts here

so we might want to…

Well, these are the top 500

so it should be fairly accurate data.

So…

Let’s get the downtrends

and then the uptrends.

Okay.

Yeah.

So here’s the uptrends.

So this is maybe some things you might expect.

In particular, you know, Raspberry Pi

has gotten super popular.

How many of you own a Raspberry Pi?

Very cool.

You know, Python 3, getting more popular.

pandas getting more popular.

Python, Scikit-learn.

We can see, you know, Flask and Node.js

also getting a lot more popular.

People using Python in a web system.

IPython getting quite a lot more popular.

And in the downtrends…

I don’t know if there’s anything.

It would also be interesting to look at the counts here

to see how many posts overall there were.

There was a real…

So metaclasses and metaprogramming,

I don’t know if you noticed, there was a big…

People just couldn’t stop talking about metaclasses

for a couple of years there.

So when I started to look at this data,

somebody said, oh, you should look at, you know,

how much people are talking about metaclasses.

And so you can see that, you know,

the number of Emacs questions down 40%.

So people are not big Emacs fans anymore.

Let’s see.

I mean…

Down 20%.

So just people aren’t asking about text editors, you know.

Maybe, like, sublime text is in here.

Sublime text 2.

All right, plus 37%.

How many of you use Sublime Text 2?

I don’t, but it’s a very great text editor.

All right, so as you can see, there are some things

changing, happening in the Python community.

There are, I guess, more general, broader trends in

technology that we can talk more about.

And I just wanted to comment on a few of them.

So generally, with the growth of web technologies getting

better, web browsers getting better, there’s more and more

of a push to putting things in browser-based interfaces,

things like the IPython Notebook, putting the

computation in the cloud rather than on

your local desktop.

Essentially being able to do computing anywhere.

Now that there’s the Chromebook, being able to

essentially have a computer just be a view on some remote

computation in the cloud and to not have to do anything

locally anymore.

And you can imagine how this would be beneficial.

I’m sure you’ve all experienced headaches with

deploying, installing software on your computer.

And you could solve that problem by having everything

be handled by somebody else in the cloud and not have to set

up your own machines.

I’ve suffered a lot from that myself.

One of the things that’s also pushed people to do more data

stuff on the web and in the cloud is that web graphics has

gotten a lot better.

So I don’t know how many of you do JavaScript.

Anyone use D3 in here?

So I’ll show you a couple D3 examples.

So generally, three or four years ago, it was much harder

to do a lot of interactive visualization on the web just

because there’s Internet Explorer.

And so you’d write JavaScript code, and it would work in one

browser and not at another.

So with more and more, the web technology becoming more

standardized, you can have more confidence that you can

build something once, and then it will run everywhere.

There’s been a lot of talk about, like Calvin was saying,

about big data.

People are also starting to talk about whether big data is

overhyped, which I think is probably true.

But big data is still an important trend and one worth

paying attention to.

I guess the other thing I’ll talk about very briefly is

just-in-time compilers, making Python faster, making

languages faster, generally speeding up computation.

And I would say that the big challenge for Python over the

next four or five years, especially with the big push

toward the web and toward the cloud, is how to keep Python

as an important and relevant language that people want to

program in and that they can use to build systems in the

way that they need to be built for the future.

I don’t know that I know the best answer to this.

I have some ideas, but I think this is going to be something

that we all have to pay a lot of attention to.

Hopefully, we don’t have to give up on Python and just

program in JavaScript, because that would

make us all very sad.

But there is something very alluring about being able to

do all of your data analysis on the web, to be able to

build something in one place, share that with anybody,

access your data and your analysis, and to be able to

ask and answer questions very quickly from any computer in

the sense that sites like Bitbucket and GitHub and nice

sites for software development have really revolutionized

collaboration on software projects.

I’m very hopeful that the same thing will happen

for data analysis.

There’s really not that kind of thing.

It’s still very much, here’s my code on GitHub and maybe a

link to the data, and you can reproduce the

analysis in that way.

So I’m very interested in this problem.

One of the big issues is that implementing all of the data

processing in JavaScript is a pretty tall order.

I’ve written quite a bit of analytics code in JavaScript,

and it’s really just JavaScript is not meant for that.

The other problem is that you’re not going to process

all of the data in the browser.

So whenever you’re working inside a web browser, you have

to only be looking at small subsets of the data.

And so that really complicates your system.

You have to do data processing on the server side and be

interacting with only a small subset for

visualization.

But on the other hand, your brain can only look at so much

data at once.

So even if you had a whole giant pile of data in the

browser, you wouldn’t need all of it in there at once.

The other problem, of course, is building

interactive applications.

You get very used to having a desktop application.

You press Enter, the code runs, the results happen

immediately or as soon as they run on your

laptop or your desktop.

But when you’re on the web, you have to think about server

round-trip time and cloud storage and how data moves

between machines.

And it becomes a lot more complicated.

So of course, web graphics has just really had a renaissance.

Well, there really never was a birth before.

So it’s really had a fantastic growth of really

nice and compelling visualizations on the web.

The main technologies that are driving that

are SVG and Canvas.

The big SVG library that everyone uses, it’s a vector

graphics library for JavaScript, is called D3.

It stands for Data Driven Documents by the venerable

Mike Bostock.

Let’s see if I have a couple of examples here that I’ll

jump to in a second.

So with regard to JavaScript, I think in order for Python to

really stay relevant, rather than saying, no, I’m not going

to program in JavaScript, I think we have to build

interfaces to JavaScript and to the web technology so that

you can keep your serious work in Python.

But then whenever you need to build a visualization or you

need to send those results of something that’s happening in

Python to the web, that you have a nice way to do that.

And that you can enable other people to get their data and

analysis on the web.

And there’s been some really great examples of this.

So what I’ve already showed you, the IPython Notebook,

essentially it runs in the web browser.

It’s a big JavaScript application.

And now when you look at the IPython project, it’s become

not 50%, but a lot of the code in IPython now is in

JavaScript, which is not something I think the IPython

guys would have predicted a few years ago.

Another great example is the RStudio IDE.

Any RStudio or R users in the room?

All right.

So RStudio, it’s been around for about three years and

maybe four years now.

But it’s an IDE for R programming.

And so it’s got graphics and a shell down here.

So that’s R code, so graphics and all this.

But the really amazing thing about this application is that

even though it’s running on my Mac and looks just like a

normal application, the folks who build it, JJ Allaire and

Joe Chang and others, they build it entirely using web

technologies.

So this exact same application could be run inside the web

browser, but users of it would never know that they’re

looking at a web application.

So it’s pretty brilliant.

And I think it’s a model for the future of being able to

have applications that can run either locally on the desktop

or can run inside the web browser in the cloud and to

just not know the difference.

So you don’t have to make a choice of whether you’re going

to be in the browser, in the cloud, or

local on the desktop.

Another thing that can really help is building integrations

between libraries like pandas, like data and analysis tools,

and charting libraries like, well, I’ll show you a couple

of examples here.

There’s a fellow in Portland named Rob Story who’s got some

libraries for integrating pandas with Java.

JavaScript libraries, so here’s one example using the

library vega.js, which is a pretty new package.

And making a bar chart, I think, I don’t know if this

uses pandas data, but no, this is just using vanilla, sort of

raw Python data.

But this chart is not generated by Matplotlib.

It’s rendered in the browser, and this is vector graphics.

And here’s an area plot, and here’s a line plot.

And I think this is generally a good idea, especially when

you have things like the IPython notebook, which you

run in the browser, makes it a lot easier to incorporate the

cutting edge developments in JavaScript land.

Here’s another example, also by Rob Story.

He has another project called Folium, which is as soon as

this loads.

And it integrates pandas with a library called leaflet.js,

which is a mapping library.

So this is some data about the US that’s being overlaid on a

map, and the legend up here.

And so I can zoom in on this, and it doesn’t

give me tool tips.

So this is interactive JavaScript visualization that’s

being fed directly by data from pandas.

There’s other libraries like ChartKick.

So it’s actually a Ruby library.

There’s a Python one now.

And the idea is to have very simple Ruby or Python code

that makes it easy to take data in those languages and

build interactive JavaScript visualizations.

So each of these charts, one line of Ruby.

There’s also the equivalent Python interface that gives

you the same thing with one line of Python.

So if you need to get graphics in the browser and you want to

use Python, these are good options to consider.

So I think a lot about generally what makes the

perfect data language.

And as much as I love Python, I’m not completely convinced

that Python is it, but it’s pretty good.

And I think it’s gone quite far.

And I’ve been very happy using pandas, and people so seem to

be a lot of other people.

But I think there’s still work to do.

And some of the problems have to do with low-level concerns,

things like missing data, so designing all of the data

structures with missing data in mind.

You want to spend most of your brain cycles thinking about

what you’re doing with the data rather than how you’re

going to do it.

And that often boils down to nice APIs and clean syntax.

So one of the big problems with R is that the syntax and

the language itself often gets in your way.

And that’s one of the reasons why I wanted to do Python way

back when, but there’s even things that are a bit hard to

express in Python.

And using SQL is nice for certain kinds

of data operations.

But yet there’s a lot of things where writing SQL

queries is not a particularly good way to

express what you want.

So I think we haven’t quite got there yet.

But I think a lot of what you need, the core built-in

operations of grouping and sorting, doing set logic and

filtering, and all of that, all of that needs to be there

and needs to be fast and really easy to use.

So another important trend, especially on where the

computation is happening, is more and more projects are

being built on top of just-in-time compiler

technology.

So if you’re not familiar with what a just-in-time compiler

is, the idea is that you have a framework that can take a

high-level description of an algorithm and generate fast

machine code, assembly code, on the fly.

So that sort of steps around the traditional, if you wrote

C code, you would write that C code, compile it, and then run

it, but if you need to generate a custom function on

the fly that’s optimized for a particular specific

application, you wouldn’t want to generate all of those.

You could end up with millions of combinations of

specialized functions that you might want to compile.

So having a tool chain like LLVM enables you to build

custom functions and then generate very, very fast code

that’s every bit as fast as if you wrote hand-coded C code.

And one of the really great things about these tools is

that you can write, there’s a tool called Numba, which came

out in the last two years.

I think last year, really.

And it enables you to write Python code, and then the

Numba tool chain uses LLVM to translate that Python code

into machine code.

Essentially, it’s about as fast as if you wrote C code.

Sometimes it’s faster.

There’s another language called Julia, which is a new

scientific programming language, and it’s built

entirely on top of everything you write in the language gets

passed through the LLVM compiler tool chain.

So all of the code is compiled on the fly and is very, very

fast as a result.

Some of you are probably familiar with PyPy, which has

its own just-in-time compiler tool chain.

And that team has really done some amazing work, squeezing

performance out of Python, doing optimizations that

people never thought were possible.

There’s other projects.

I put up here a project from Cloudera, which is a big data

company in the Bay Area.

And they built a project called Impala, which is

basically a database engine or a SQL engine

on top of Hadoop.

And part of where it gets its speed is by using LLVM to

take SQL queries and generate really fast, specialized

functions that process the data.

And I think that sort of thing is going to be the model for

the future.

Now, in big data space, the big trend in the last two

years is to do SQL everywhere.

So I brought up SQL a couple of times.

I was going to write up here, on here, SQL all the things.

But this is actually a pretty interesting set of benchmarks

that came out.

I don’t know if any of you can read this, but there’s a lot

of competing SQL interfaces to big data.

So this was a set of benchmarks that was done

comparing Hive, which is Hive was developed at Facebook.

And it basically takes a SQL query and converts it to a

MapReduce job and runs it on Hadoop.

It works, but it wasn’t built for performance.

So it’s pretty slow.

So a lot of people have been building

replacements for Hive.

So this was done by the Spark team from Berkeley.

And they have a project called Shark, which is Hive on Spark.

And they were comparing that with Impala and also Redshift,

which is a big data SQL engine that came out about six

months ago, well, within the last year, from Amazon that

runs on Amazon Web Services.

So basically in big data land, there’s a big shootout going

on to who can build the fastest SQL engine for big data.

And so it’s very interesting to pay attention to.

And a lot of it’s just doing pretty simple

group by aggregations.

But it turns out that that represents a lot of business

use cases.

So basic conclusions that I have for data tools and where

things are going, I think we all have to embrace the web.

It’s where things are going, the web and the cloud.

The desktop model of computing is really on its way down.

In the future, it’s going to be tablets and your phone.

And well, I guess the tablets are shrinking in size.

The phones are increasing in size.

So I guess at some point, we’ll just have one device

that we occasionally hold up here.

And sometimes we type.

So device consolidation, the user interfaces, those need

to get worked on, of course.

How do you build a data analysis interface for touch?

I don’t know that I really know the answer to that.

with data tools there’s still a lot more work to do. I think tools like pandas are a good

step in the right direction and I think have made people a lot more capable and productive

in working with their data. But it’s definitely worth examining, especially when you’re working

with data yourself and you run into something that is tedious or takes you a long time or

you can’t figure out how to do something, write that down and tell people about it and

open an issue on GitHub or write an email to the mailing list to tell people the problem

that you ran into and tell them about your use case and what you see as roadblocks or

problems with the tools for your particular problem. And I think the more that the open

source development community understands problems that people are experiencing in the real world

that really helps everyone make good design decisions and new thinking about the tools

and how to make them better. And of course I think big data is here to stay. The Python

ecosystem is definitely good for medium data. Python really hasn’t had mainstream success

in big data. So if we go back to this benchmark page, the same project that the team that

made these benchmarks for Shark, for the Spark project, there’s now a Python interface to

Spark so you can write functions in Python that process big data. So to the extent that

we can build bridges to the big data ecosystem so that we can get more people programming

Python and processing big data with it, I think that will also help Python stay relevant

because the data is not shrinking in size. Maybe in two or three years we’ll all decide

that all this data, petabytes of data that we’re warehousing isn’t really worth much

and we should just start throwing it all away. So yeah, that’s my talk. I really appreciate

you having me here and I’ll have some time for some questions which, you know, happy

to comment further on any of these topics. Thanks. Yeah. Oh, could I have the mic? Yeah.

Thanks for the talk, guys. Once again, you did the Python conference in New York for Finance last month. Can you tell me what is the update in the Finance area?

Sorry, could you repeat the question?

You did the Python conference in New York for Finance. So can you give an update on the

use of Python in Finance? Thanks.

Yeah, so the question is how the use of Python in Finance is changing or doing generally.

So Python has become really, really popular in Finance in short. When I started using

Python in 2007, people thought I was crazy. They were like, this is not a serious language,

you can’t build production systems with this, the only real choices are Java and C++.

So now everyone is hiring Python developers as fast as they can for doing financial work

and I think that organizations have seen the productivity benefits both in their system

maintainability, developer happiness, and being able to get more done with fewer people.

And also, because the libraries have gotten a lot better, people are, big financial firms,

even like big investment banks, JP Morgan has, I don’t know if any of you work for JP Morgan,

but they’re big Python users. Bank of America, I know folks at UBS and Goldman Sachs,

and really all the banks, people have really taken to Python.

Scala, which is a Java JVM language, has gotten a lot more popular in Finance as well,

so I think the two big trends there are in Python and Scala in Finance.

Well, I’m not in Finance anymore, but I’m excited to see Python doing well there.

I was wondering whether you could go into more detail about the collaborative effort.

Right now, these tools like Python, Node.js, they’re very cool as a single user kind of thing,

and of course everybody has their GitHub repository, but is there anything on the horizon

that you can see where you can actually make collaborative work with other organizations here?

Yeah, so my friends over at Continuum Analytics,

they’ve built a hosted IPython notebook environment that,

I think there’s a free plan on here, so let’s see,

there’s a free plan and then there’s paid plans that hosts IPython notebooks

and enables you to share notebooks among a group of collaborators.

There’s also, and they have a lot of other tools there for like environment management

if you want to build like custom environments and all that kind of thing.

There’s the NBViewer site, which is a place to,

you can upload an IPython notebook someplace, like on Gist on GitHub,

and then generate a link on NBViewer and then send that out

and so you can get a static view of an IPython notebook.

So that also helps with collaboration.

There’s no multi-user version of the IPython notebook.

I know that the IPython team, they got a big grant,

like a million dollar grant to work on IPython for the next two years,

and so they’re pouring a lot of resources into improving the IPython infrastructure

to be able to more easily integrate with JavaScript libraries

and building interactive widgets and things within IPython.

And I think they’re also hashing a plan for like a multi-user IPython notebook

because what you really want is something that’s kind of like Etherpad

or any of those kind of collaborative text editors on the web

and to be kind of both looking at an IPython notebook

and working, collaborating in real time on a project.

I think within two years we’ll have it, but I’m not sure exactly when.

man in audience speaks

Why is Python getting more popular?

Well, I don’t have complete information

about why Python is getting more popular,

but I would say that the libraries have gotten better,

both in the analytics, like Scikit-learn for machine learning,

stats models, pandas for just general data work,

and that’s brought a lot of people to the Python ecosystem.

Sort of the web development libraries have gotten better,

so more and more people are…

Well, really, I think maybe Ruby has won in web development.

Like, there’s a lot more Ruby developers than Python developers,

but generally there’s a lot more people doing web development

and building websites with Python.

There’s also a lot of refugees from Java and C++

that have maybe some of them in the room.

I programmed in Java before I did Python,

so I think more companies are becoming more comfortable

building software in interpreted languages than they used to be.

Like, people have embraced unit testing and dynamic types,

and people used to be a lot more uncomfortable

with not having static types,

and so more mainstream acceptance of dynamic languages

has definitely made Python more popular.

I’m doing scientific research,

and in my area, most of the people, they use pandas.

And if I introduce Python to all these people,

they will wonder what Python can do,

and I tell them Python can do nearly everything.

But it’s not very specific.

It’s like with pandas, you can clearly see

any linear things you can solve.

All you need is R.

You can think about statistics.

Don’t you think there are too many entities

and too many functions?

So, yeah, I think one of the main things that’s holding back a lot more MATLAB users from

moving to Python is mainly the development environment.

People really love, like, the MATLAB IDE, the profiler, the debugger.

You know, the R community has been essentially solved that problem in a lot of ways with

RStudio, and so I would be very interested to see, you know, the equivalent of RStudio

in Python.

I think that would help kind of really, you know, drive the nail in the coffin on MATLAB

a bit more.

But yeah, I think, you know, you can do pretty much all the things you can do in MATLAB and

in Python.

You know, some of the library is a bit rougher around the edges, but I think the benefits

that you get from, you know, better data structures, you know, object-oriented programming, tools

like pandas, which help with data preparation, you know, I think they do bring a lot of benefits.

So maybe if you want to sell more people on switching away from MATLAB, like, look at

their code and find the places where they’re kind of having to, like, really hack their

way around limitations in MATLAB and show them a better way.

And I think once people realize that they can be, you know, by switching their technology

they can make themselves, like, more productive and, you know, be able to get more done.

So I think they see, you know, okay, well, if I program in Python, then, you know, I

can get more research done, you know, kind of the benefits become, yeah.

Well, my real question is, like, why there are so many pieces in Python?

I mean, Ruby is quite relevant.

You think of Ruby on the real line, the only free one.

But in Python there are too many choices.

Yeah, there’s a lot of fragmentation.

Well, at least it’s easier to install now with things like, you know, Anaconda, which

is, like, a free, you know, scientific Python distribution.

So, you know, with a few clicks you can have everything installed.

That used to be the big barrier.

It’s like, oh, I’ve got to install, like, 50 packages before I can, like, make a plot.

So that definitely is easier.

But the library, you know, the multiple library problem, I guess I don’t really have a good

solution for that.

I think, you know, with pandas what I’ve tried to do is to integrate multiple libraries and,

you know, be able to access.

If you want to make plots, you can make plots without knowing anything about Matplotlib.

You can do array computing without knowing deep information about NumPy.

So that, you know, you essentially have to learn one set of integrated tools.

And, you know, I don’t know what problem domain you’re in.

But I think generally, you know, to build a set of, you know, domain-specific sort of

tools and helper functions and utilities that help kind of, you know, reduce the amount

of, you know, having to use, you know, eight different libraries to solve a problem.

Yeah, I don’t know.

It’s a problem with open source.

You know, you’ve got a very highly distributed development community rather than, you know,

kind of the math works in Needham, Massachusetts.

You know, building a tightly integrated development environment.

I guess rather than asking for fewer libraries, I think that’s for one more or actually about

one more.

You had some good examples of the web technologies that are sort of in the space of G3 for JavaScript.

And some of those things that basically create SVG Canvas as alternatives to Matplotlib.

So is there, I mean, when I was looking at it, I was thinking, well, not Python, but

wouldn’t you want to just, wouldn’t the idea be to just overload everything in Matplotlib

with something that generates SVG?

I mean, is there any movement towards that?

So that effectively, once you’re in a web framework, you would be generating web graphics

rather than the graphics that are intended for the more traditional web style.

Yeah, so the question is, you know, when you’re in, you know, the IPython notebook and in

the browser, like, why not do everything, you know, with web graphics and SVG?

That’s definitely the direction that we’re going in.

I guess we’re just not, we’re not there yet.

It’s like we need more developers and more people working on that problem.

So, you know, you saw here in, you know, I’m in R and this is made with ggplot2 and doing

these kinds of, you know, faceted statistical graphics, you know, it’s very easy to do in R,

but doing this, you know, even with D3 is pretty difficult.

But I think, you know, five years from now, we’ll have a set of JavaScript libraries

for statistical graphics that everyone uses.

And we have interfaces to them from R, from Python, from JavaScript.

And, you know, all the graphics is going, is clearly going to the web.

And I think, you know, generating static desktop graphics is going to be

a thing of the past in the long term.

Just it’s, yeah, it’s just a lot of development effort.

I’m, you know, my company, we’re working on a lot of things in that direction

and we’ll definitely open source some packages that help with that at some point

once they’re ready for people to use.

I have one more question.

There’s a problem with Python and the CPython implementation

in GIL. So the question to me is, how does that affect data processing?

Yeah, so the concurrency issues in Python,

if you’re doing data processing, that can all be done, you know,

in a multi-threaded way, you know, that releases the GIL and

you just have to kind of design carefully for that.

Like if you’re using, I’m a big fan of Cython for writing

numerical codes. And Cython has, you know,

[1:00:00] a version of range, PRange, that will parallelize a for loop.

[1:00:04] And so as long as you’re not accessing any Python data structures inside the loop,

[1:00:08] just doing numerical computing, then you can have truly multi-threaded code.

[1:00:12] If you’re doing, you know, well, if you’re doing

[1:00:16] a lot of IO stuff, then, you know, the GIL is not really an issue

[1:00:20] because the GIL gets released. But there are occasionally things

[1:00:24] where the GIL does get in the way. And it is a long-term liability.

[1:00:28] So it’s something I do worry about.

[1:00:32] Yeah, thank you very much.

[1:00:46] Thank you very much.