2012 PyData Workshop: Data Analysis in Python with pandas

Video
Event PyData
Location Mountain View, CA
Date March 2, 2012

This transcript and summary were AI-generated and may contain errors.

Summary

In this PyData workshop, I present pandas about four years after starting the project in 2008.

The Genesis Story

Working at a quantitative hedge fund in 2008, I found R’s data manipulation capabilities frustrating and the language itself “sort of horrible to program in.” This led me to create what I called “the ultimate data tool for Python”—a library that would allow analysts to work entirely in Python. As I note, “I get a lot of people which say, you saved me.”

Core Architecture

pandas has two fundamental data structures. The Series is “a NumPy array plus an array of labels,” while the DataFrame is “an indexed collection of series objects”—a table with both row and column labels where columns can contain different data types. Unlike NumPy’s rigid structured arrays, users can add and delete columns easily.

The library automatically handles data alignment through joins. When performing operations between differently indexed datasets, pandas inserts missing values where data doesn’t align: “I see this problem occurring over and over again where you have outputs of analysis where things are messy, things are not homogeneously indexed, and you don’t want the tool to get in your way.”

Indexing and GroupBy

I demonstrate hierarchical indexing, which enables efficient operations on large datasets without excessive copying. GroupBy operations implement SQL-style aggregations that split data based on multiple criteria, apply functions to each group, and recombine results.

The demo uses baby names data (258,000 records from 1880-2008), showing how American naming patterns have diversified. In 1880, just 15 names accounted for 50% of boys’ names; by 2008, this number had grown to 127 names.

Ecosystem Integration

The presentation shows pandas’ integration with IPython notebooks, matplotlib for visualization, and NumPy for computation—what I describe as a “very integrated and powerful tool chain.”

Looking Forward

My company Lambda Foundry aimed to address what I saw as underservice of quantitative professionals who “spend 50% to 80% of your time just doing really basic development” with inadequate tools. I observed that while machine learning attracts computer scientists who “understand the value of building reusable software,” fields like econometrics lack this culture.


Key Quotes

“The R language is sort of horrible to program in.”

“I’ve been building this tool to be the ultimate data tool for Python.”

“I get a lot of people which say, you saved me.”

“It turns out that you can get a lot of work done when you focus on one thing and do nothing else for about eight months.”

“The code base… it’s tripled in size in nine months, which amazes even me.”

“You spend 50% to 80% of your time just doing really basic development. I find that really backwards.”

“Computer scientists understand the value of building reusable software, which is kind of a sense that’s not as prevalent in economics and econometrics.”

“I see this problem occurring over and over again where you have outputs of analysis where things are messy, things are not homogeneously indexed, and you don’t want the tool to get in your way.”

“Pandas offers this sort of very integrated and powerful tool chain for expressing operations on index data.”

“Cultural revolution, everyone wants to, it’s like, oh, my child is special, my child is unique. You’ve got to be more creative about names.”

Transcript

So my name is Wes McKinney.

I’m going to be talking about a library that I’ve been working

on in Python for about four years called pandas, which is

sort of a mashup of lots of things.

Panel data, Python data analysis, or take your pick.

If you like thinking about it as being inspired by the

animal, then certainly be my guest.

Very happy to be here.

So the story with pandas.

So I guess most of you have used NumPy.

Am I correct?

Has anyone not used NumPy?

Well, that’s fantastic news.

So anyway, so pandas is, so the story is I started out

working in Python at the beginning of 2008.

And I was working inside of a quant hedge fund.

And I was using R. And long story short is I became very

frustrated with the data manipulation

capabilities that R provided.

Particularly with regard to time series data, which is

really just a special case of a labeled array data structure.

And so I started looking at, well, OK, is

there anything in Python?

First, is there anything in R that has the features that I

needed to really have a rich tool for working with

collections of time series data?

I basically decided that no.

And plus, the R language is sort of

horrible to program in.

I was already not satisfied with building real production

systems in it and wanted to do things in Python.

But there was sort of a missing layer there between

NumPy and the tools needed to implement the

domain-specific logic.

So I started building this library.

And basically now, four years later, it’s this very rich

data manipulation tool, which is built on top of NumPy.

So it sits on top of NumPy and provides a lot of features,

like Travis was saying, that are not built into NumPy,

which eventually some of them will be.

And that’s part of some of the work that’s going to be

happening over the next couple of years.

So there are data structures.

The library is primarily data structures.

But there are also a lot of auxiliary tools for working

with those data structures, getting data in and out of the

pandas data frame object, writing things to disk,

different kinds of persistence, tools for working

with databases, just all of the things that you need to

solve real-world problems.

And I’ve worked many different problem domains in addition to

finance.

So I’ve been building this tool to be the ultimate data

tool for Python.

A lot of people who’ve used R and now use, have started

using pandas and now can do all their work in Python,

which I think is really great.

So I get a lot of people which say, you saved me.

It’s like I was having to switch between doing my data

munging in Pythons and then feeding it into R and then

doing the work in R. And now they can do all

that work in Python.

So a lot of very happy people out there.

Certainly has made a big difference in my work.

So development had kind of stagnated a little bit.

But I picked it up and have been working on it basically

full time since about last May.

And so it turns out that you can get a lot of work done

when you focus on one thing and do nothing else for about

eight months.

And so there’s been a huge amount of growth.

And that’s going to continue upward trajectory

throughout this year.

So I actually looked and the code base is, it was already a

pretty big code base, but it’s tripled in size in nine

months, which amazes even me.

And so I’ve started a company.

It’s called Lambda Foundry.

So if you’re a knight of the lambda calculus, then perhaps

you like the name.

So we are supporting pandas development and also starting

to build a product ecosystem for QuantFinance.

So I find that quants out there, a lot of them are

programming in R and MATLAB.

They’re underserved by the tools that they have.

And there’s not a lot of infrastructural support for

that problem domain.

A lot of homebrew systems, a lot of in-house development.

And the result is that those kinds of users just aren’t as

productive as they could be.

You’ve got these incredibly smart people, many of them

with backgrounds in PhDs in physics or

economics or finance.

They’re doing modeling.

They’re doing research.

But they struggle with the tools, and you spend 50% to 80%

of your time just doing really basic development.

I find that really backwards.

Some other projects that I’m involved in, I’m also involved

in the Stats Models project, which is statistics in Python.

If you’re interested in doing more statistics in Python, we

need more developers on Stats Models.

It’s not the sexiest development out there.

Scikit-learn, which is a machine learning library for

Python, has a huge development team.

It’s extremely active.

Of course, machine learning is kind of a sexy topic.

But also, computer scientists understand the value of

building reusable software, which is kind of a sense

that’s not as prevalent in economics and econometrics,

which is what Stats Models is more about.

I’m also writing a book.

I just literally got the cover today, and they’re going to

change it from my legal name to being just

Wes at the bottom.

So I’m writing this book with O’Reilly, Python for Data

Analysis, and the idea of the book is basically to present a

sort of a nice introduction to all these tools that we’re

talking about today, but tailored for a user who is

working with data, working with structured data, maybe

has some R experience and wants to.

I got an email from somebody who said, I’m

drowning in Python.

I’ve got some R experience.

I’ve got some SAS experience.

Where do I start?

How do I learn about these tools and become productive?

I perceive that really there’s a need for that, and so on top

of all the other things I’m trying to do, I am writing a

book, and it will be out later this year.

So it’s going to be about the fundamental tools in the tool

chain, IPython, which you’ll see more about, NumPy,

Matplotlib, SciPy, a few other libraries that

are in common use.

But a lot of the book is going to be about pandas and

focusing on very practical tools for real world problems

and looking at actual data and going through case studies and

showing how to synthesize all these tools

to solve your problems.

So for the next hour and 15 minutes or so, I’m going to

give you an overview of pandas, explain the data

model a little bit, how it relates to NumPy, and then

basically jump into the IPython notebook.

And any of you who can get it running on your laptop, I hope

that’s many of you, I will upload the notebooks.

I’ll create a little archive and put it in the Dropbox

for you to download.

And we can fire that up and follow along with the

examples.

I think it’s going to be mostly just me showing you

things because I have a lot to cover.

So cover some of the basic getting data in and out from

flat files, from JSON objects.

So if you work a lot with web data, basically if you’ve got

a bunch of JSON objects and you want to load that into

tables that you can then crunch that data and analyze

and create summary statistics, how you can do that, how you

can do examples showing you how you can take data out of

a database and put it into a data frame.

Explain how label-based indexing works, which is

something that’s not really present in NumPy.

Showing you how that can integrate into data alignment,

which is a big part of pandas.

Talk about descriptive statistics and missing data

and how that’s implemented, how you can apply functions

to your data in an intuitive way.

Talk about database-like join operations and concatenating

data together.

Talk about GroupBy.

So if you’re familiar with SQL, you’ve ever written a

SQL expression with something with GroupBy where you’re

aggregating or transforming data sets.

How that looks in pandas, it’s a big missing feature from

NumPy, and I’m excited to see.

I think I could take the GroupBy engine that I built

inside pandas and use it for NumPy arrays.

But as you’ll see, doing GroupBy without labeling,

especially without hierarchical indexing, which

I’ll explain more about.

Interpreting the result of a group by expression

is hard without that richer data structure.

I’ll talk about how the indexing and group

by relates to reshape operations and forming pivot tables.

So how many of you have ever created

a pivot table in Excel or another spreadsheet

OpenOffice?

So I’ll explain a little bit what that’s about

and what that looks like.

Time series I’m not going to talk about.

If you’re interested in time series,

I’ve got advanced topics later today,

and that will be all about time series.

So I’m going to omit that from the discussion here.

It’s going to be using the IPython HTML

notebook during the latter portion here.

So the pandas data model.

So there are two major objects.

The first is the series, which is just

a one-dimensional labeled array.

So it’s a NumPy array plus an array of labels,

and I’ll explain what the labels can be.

Really, the short story is that they can be anything,

but they just can’t be mutable because they’re

used for lookups.

It has to be anything that you can put in a Python dict.

The second major object is the data frame.

So if any of you have used R before,

I took the name, lifted the name directly from R.

A data frame is an indexed collection of series objects.

So you have a table of columns.

You have row and column labels.

The columns can be different types.

So if you’ve used structured arrays in NumPy,

it’s very similar to a structured array,

except it has a number of additional features.

So you can add and delete columns

very easily, which is not possible in a structured array

without creating new arrays.

So the series object.

Now, the reason it’s called series, I guess,

for lack of a better name, you could call it a labeled array.

Or I guess you have series and time series.

A time series is a series that has dates as the index.

So you have an array of values, and you have

an associated array of labels.

So here, it’s just the strings a through e.

The data in the array could be any data

that you can store in a NumPy array.

So if you use NumPy, it fits very naturally

in a mental model of data.

The index labels, they don’t need to be order.

You can have duplicates, but there

are many operations which will refuse

to work if you have duplicates.

So if you try to look up the value at c,

and it turns out that the index has duplicates,

then it will throw an exception, because it’s not

a one-to-one mapping.

So a little bit more on that later.

But most of the time, when people

use index data in pandas, it’s always a unique labeling.

That’s how it’s designed to work.

So the data frame is adding a dimension.

So now, you have column labels and row labels.

And so here, I have columns, foo, var, baz, quix.

And here, I’ve got four columns.

They can be floating point.

They can be strings.

They can be Booleans.

But the important thing about data frame

is it’s the primary container for working with data in pandas.

So if you are carrying out an analysis,

and you compute additional columns,

you can add them into it like a dictionary.

So it’s intended to work just like a Python dict,

where the keys are the column labels,

and the values are the columns as one-dimensional arrays.

And so it works well with if you have only homogeneous data.

So if you just have a collection of floating point data,

it is very fast with that, supports

very fast both row-oriented and column-oriented operations.

But you can also have something that’s

completely heterogeneous, where each column

is a different type.

So the story with the axis indexes, so here,

data frame, we’ve got a row index, which

is called the index, and the column index, which

is the columns.

So I’ll show you in the notebook.

So basically, these contain or are used to do the lookups.

So if you want to select out a row or select out a column,

it all goes through the index.

But they’re the core data structure,

which is used to do data alignment

and to implement join operations.

So one nice thing about them is because this index is

its own little data structure, we

can do something more sophisticated.

We can have hierarchical indexing,

which means that the semantics are,

you can imagine having an array of labels,

but each of the elements is tuples.

But suppose that you wanted to select out a group based

on the first tuple.

So the naive way to implement that

would be to make a linear scan through the tuples

and then match on the first element of each tuple

and then select out those rows based on the first element.

But if you had a very large data set,

suppose you had a million rows or five million rows,

that would be, you’d have to make a full sweep

through the labels in order to just select out that data.

So you can imagine having a big performance bottleneck there.

So having a proper data structure

that represents the multiple levels of indexing

allows you to do a very fast login lookup of a group

and select out a view on the data without copying any data.

So it’s designed in order to enable

you to work with very large data sets to reduce copying.

There are lots of situations where you can be working

with gigabytes of data and memory,

and you want to be able to work with groups of the data

without running into memory problems.

And so this is one mechanism that

really helps a lot with that.

And I’ll explain how this relates to group by operations

and reshape operations.

So data alignment, so the reason the indexes are there

in the first place is that you work with a lot of data

where you’ve got labels for it.

So in a database table, these might be your primary key,

which might be, you might have a couple of columns which

uniquely identify each observation.

And that information is intended

to be expressed in the index.

But you might have multiple collections of data

that are indexed differently.

And it’s a lot of, it creates a lot of problems

if you’ve got two different data sets that

are indexed slightly differently.

You want to be able to combine them together

without having to write a lot of code.

So that’s really what data alignment is all about.

And when you add, if you’re doing arithmetic

between data sets, so you can imagine if you’re working

with a time series, so here we have two series objects,

one which is labeled B through E and another which

is labeled A through D.

So when you add these two together,

you say I want to add the values, which

I want to add each value labeled B together,

each value labeled C together.

But then you’ve got some labels which are in one,

but not in the other.

So when you add them together, you’re

not really doing an array add.

You’re really doing a join between the objects.

So behind the scenes, when you add these two differently

indexed series together, it’s really

doing A join B so that in the result object,

you get the union of the labels.

And then any place where there was a mismatch, you get an NA.

So you have a missing value in the result.

And then once you’ve done that, the missing values

are meaningful.

You want to know that this label was

missing in one of the objects.

And then you can choose after this point,

I want to just drop out all of the labels

where there was an NA.

So in the case of a data frame, so here I have one data frame

here, another data frame here.

You can see that the C column is missing from this one.

The row labels are different.

And when I add them together, it does an outer join

both on the row index and on the column index.

So you do get a bunch of NAs in the result.

And anyone who’s really interested in talking

a lot about how missing data is handled

is sort of outside of the scope of this talk.

But you get an outer join in the result.

And then you can decide if you want

to fill the NAs with some value or you want to drop them out.

So you have tools to then clean up the missing data

in whichever way that you want.

In practice, it’s very common to get data that looks like this.

So it’s kind of a little mock-up of you

have a bunch of data that’s different.

and you just want to throw it all into a table.

And you want it to take the union of all the indexes

represented by each of the columns

and create this uniformly indexed structure

that you can then clean up

and start to perform computations on

and actually carry out your analysis.

I’ve used R, I’ve used MATLAB,

worked with a lot of things,

and I see this problem occurring over and over again

where you have outputs of analysis

where things are messy,

things are not homogeneously indexed,

and you don’t want the tool to get in your way.

You just want to say, throw this all into a table,

union all of the labels together,

and then let me decide what to do with it once you’re done.

So it turns out that even R makes this difficult.

It shocks me that it’s difficult.

So as I was saying, the axes themselves,

so this is a case of where we’ve got a hierarchical index.

So this is a data frame with five columns.

We’ve got years in the columns,

and the rows are now hierarchical index

where here I’ve got a country name and an age,

and everything in the first column is Australia,

and then we’ve got a bunch of ages here.

But the important thing is that, you know,

that everything is sort of contained in the index,

and the index is separate from the data,

and we can use this labeling information

to perform transformations and to reshape the data

into the form that we need it.

So you can see that I stole this slide

from another talk that I gave.

So when we’re doing reshaping,

so let’s go back and look at this object.

So here, this data frame,

really each value here is identified by three things.

So there’s a country name, and there’s an age,

and there’s a year.

Suppose we wanted to rearrange this data in a different way.

We might want to, say, put the ages up here

and put the years back down here.

And so given that each value is identified by those three things,

that’s a very well-defined operation.

You can just, you know, you can see, say, like,

oh, just move age up here, year down here,

and then put all the values in the right place.

So how, suppose you had data like this,

I mean, how would that look?

So the two operations for doing that,

I wanted to kind of explain them in a high-level way

before I start, you know, firing code away at you.

So here’s a slightly bigger data set.

Here I’ve got multiple groups.

I’ve got two countries, Australia and Austria,

and I’ve only got five ages here.

And so you can imagine if we wanted to switch age from with year.

So the two operations for doing that,

so pivoting up is called unstack.

So you can imagine if we unstacked age from,

that would pivot up unstack,

and then we could take year and pivot that down,

and that’s the stack operation.

And so I’ll show you how this works.

So if we said data frame unstack age from stack year,

you can see it switches the place of year and age from.

And so that’s how, you know,

a reshaping operation with indexing would work in pandas.

So we’ll see some more of that.

So as far as missing data,

I guess I’ll mainly show this through examples.

So I’ve augmented the NumPy.

NumPy has a couple of functions

which help for detecting special values and floating point data.

I use the not a number value as a missing data marker,

which occasionally makes people upset,

but it’s really the best solution that I’ve got.

And we can, you know,

if you’re really interested in missing data issues

and null handling, we can talk a lot more about it

because it’s a problem that I’ve been dealing with for a long time.

So I’ve got API functions in pandas is null and not null,

and they’re the basic tools for detecting missing data

and for producing a Boolean array that you can use to select out

or to exclude missing data from pandas objects.

One gotcha of the missing data implementation in pandas

is you can’t have NAs in integer arrays.

And this is sort of an unfortunate implementation detail

that occasionally jumps up and, you know,

I get e-mails about it maybe a couple times a month,

and it’s peculiar to NumPy

and sort of, you know, the various options

for implementing missing data.

So it’s a big topic, and, you know, there was, yeah,

I won’t go into it.

So talk a little bit about GroupBy at a high level,

again, before we start, you know, jumping into code.

So how many of you, I guess, have done some form of GroupBy

either in SQL or some other tool?

Okay.

So the general idea is that you’ve got some data

and you’ve got some labeling information for it,

and you want to, so here’s an example

where we had an array with labels ABC.

And so behind the scenes, you know, I mean,

the implementation is, you know,

you’re not really very concerned about that.

But so it splits the data into groups based on the labels,

and then you just want to apply some function

to the data in each group.

So here in this case, apply the sum function

to each of these groups producing the summed value

for the A group and the B group and the C group.

So you can imagine, you know, having an array of labels,

you know, like this, an array of data,

and you just say, I want to group this data by the labels

and apply the sum function, and you get out a new object

which has the values aggregated,

and then the labeling for the result

is then the unique labels in the original label array.

And so you can imagine if you instead had two arrays of labels,

then, you know, suppose that I don’t have them here,

but you can imagine I could have an array of ones and zeros,

and then there are actually six possible combinations

in the result.

So the result of the group by operation

could be up to six elements long,

depending on how many combinations

you observed in the data.

So I guess you can begin to see

how hierarchical indexing kind of falls out of group by.

So if you have multiple layers of identifiers,

you do a group by expression,

then the result is going to be labeled

by the unique combinations of values in the labels.

So there are many ways that you can split data into groups.

So a very common, you know, if you’ve used SQL before,

the most common case is that you have a table

and some of the columns of the table

serve as the keys that you want to aggregate on.

So you might say group by column A, column B,

aggregate value C.

So those grouping information

might already be columns in the data frame.

You might have external arrays.

I say I have this object.

I want, you know, I have this external array of labels,

and I want to split that up into groups

and then apply some function to each group.

And of course, you know, the question is,

where do the groups come from?

So, you know, we’re Python. We’ve got functions.

We can pass functions into,

we can use functions as objects.

So you can define custom functions

which can produce the group label

kind of in a very flexible way.

So I’ve seen a number of different, you know,

group by implementations.

If you’ve used R, if you’ve ever used the plier package,

that’s got some, has some ways to define custom groupings,

but it’s a very flexible way of, you know,

you’ve already got this object with labeling information,

and you might want to take the labels

and then produce a group label out of that,

and that’s a very flexible way to group data.

So we’ll see much more of that.

So like I was saying about hierarchical indexing,

and I’ve got this example in the notebook

that we can take a look at.

So here was a data set where I had a data frame

with the country, once again,

the same data set that I showed you before.

So I figure why not go ahead

and just show you this in the live demo.

Yeah, so I’ll explain what we’re looking at here.

So are you familiar with IPython?

Well, who is not familiar with IPython?

Okay, so IPython is an enhanced interactive Python shell.

So it has tap completion.

It has good integration with the system shell.

It has an interactive debugger.

It’s, you know, basically it’s, you know,

sort of like the code running sort of iterative development environment

that is, you know, most, I guess most people in scientific Python use every,

I use it every day.

You know, it’s like where I develop Python code.

Now the IPython notebook is, you can see it’s running in a web browser.

So what’s going on here is that you are connected to an IPython shell running

in the system and you have editable code cells.

So here in this, you see mine looks a little different from yours

because it’s the bleeding edge version,

but the functionality is virtually identical.

So the idea is you have editable code cells

where see here I have from pandas import star.

And so in that cell, basically you press shift enter.

And what it does is it sends that code to the kernel, runs it.

If there’s any output, see if I had done print five,

then it captures that output and it displays it in the browser.

If I had done, if I put a plot in here,

so suppose I did plot a range 10.

So that’s a matplotlib command.

It captures the plot and then it renders it and outputs it in the notebook.

So it could be arbitrarily complicated

and we’ll do some slightly more complicated plots.

But if you’ve used Mathematica before, it’s just like the Mathematica notebook

and it’s designed to be very Pythonic and, you know,

give you a nice way to sort of go through like an analysis workflow.

So I’ve got a data set here.

It’s been used in a number of other, you know, statistical graphics,

you know, talks that I’ve seen.

It’s the baby names data.

So it’s the top 1,000 baby names since 1880 in the United States.

So I thought it was a kind of interesting data set to look at.

So to get that data into the notebook here, I’ve got a,

so first of all, you need to run the first cell which has the import pandas.

So we see here, so shift enter runs that.

Don’t worry about the plot.

So I just shift enter runs that cell.

Sets things up.

And then I’m going to load the baby names to .csv file,

which I hope to God I put in the archive.

And so you see that ran that.

It says it will say busy someplace on your notebook.

And then if you press, well, there’s a command key.

I think it’s control M B creates a new cell.

So that’s how I did that.

We’ll just edit this cell and type names.

And you can see what’s happened here is it’s loaded the data into a pandas

data frame object, which tells you data columns.

So year, name, proportion, sex, sound, x.

It tells me the index contains the integers from 0 to 200, 250.

So it’s 258,000 elements long.

And so now the way that this data structure works is it’s like a dictionary.

So if we want the year column, we just do name sub year.

And it gives me an array.

This is now a series object.

And you can see that on the left we’ve got the row number,

which is just the observation number in the data set.

And then the values go from 1880 through 2008.

So it looks like the data is already in order.

So we look at the full data set.

So now what I’m going to do here, so you can get the year column,

name sub year, name sub sex.

You get the sex column.

So you can look at the data like that.

And if you look at the proportion column, it has floating point values.

So this is actually a really big data set.

I mean, a quarter of a million observations is pretty big.

So if I do colon 100, do sub colon 100,

so slice off the first 100 elements, whatever those are,

you can see that that gives me the first 100 rows of the data set.

So now you can see you’ve got something that’s much more pleasing to the eye.

You can look at the data like that.

So we’ll play around with this data for the next 20 minutes or so.

So when I do name sub sex,

so suppose I just wanted to select out the boy names.

So I could say name sub sex equals equals boy.

And so that gives me now a Boolean array.

And if I pass that to, if I do names sub,

and then name sub sex equals equals boy,

that selects out the section of the data set which has the boy names.

So I’m going to go ahead and assign that to the boys variable

and going to do the same thing and assign it to the girls variable.

So names sub sex equals equals girl.

And so now we split the data set into two pieces.

And so the idea with,

I’m just going to shut down TweetDeck,

otherwise we’re going to keep getting PyData retweets.

There we go.

Yes, shift enter to execute cells.

So now from here,

suppose that we wanted to actually index this data set

by the unique identifier.

So if we look at the boys data set,

so let’s do colon 50, so get the first rows.

So you can see that each observation here is uniquely labeled by year and name.

So in some years, some names aren’t going to appear.

Like later, the names that people are giving their children in 2008

are different from the names they were giving their children in 1880,

as we’re about to see.

But if we wanted to, let’s say, iterate through the years,

if we do boys sub year equals equals, let’s say 1956,

and we select out all the boy names for 1956,

well, we get 1,000 observations.

I’m going to print this guy,

and then I’m going to do .toString,

so that would just force it to output everything to the console.

And so you can see here are all the boy names from 1956.

Suppose we wanted to order.

It looks like they’re already ordered by proportion in descending order.

So you can see that Michael was the top boy name in 1956.

We might want to, I’ll show you how we can reorder these.

We can do sort operations in a second.

So one way we could do manipulation of this data set is we could add a level.

We could add indexing.

So I can say boys set index past the name and year columns,

and that’s going to create.

So I execute that cell, and that now creates.

You can see that it’s removed the name and year columns,

and now the data set is indexed by name and year.

So if we do colon 100 now, you can see that the output is a little different.

So now the name and year have been moved down into the row index,

and we’ve just got the proportion, sex, and sound x columns remaining,

so that if I grab the proportion column from that,

that I’ve got now a series,

but it’s got the indexing information connected to it.

So from there, I mean, the reason

that the indexing is useful, suppose

that I wanted to select out all of the data,

basically all of the proportion of children

that were named a particular name.

So you can see here, I have indexed it by name and year.

And I suppose I want to select out all the West data.

So to do that, well, there are a number of ways

that you can select out data based on the row index.

So the most useful is the special ix field.

So if we do idf.ix, that gives us a little object that

enables us to index the rows and columns like an ND array.

So here, I can do .ix square bracket West.

And that selects out the slice of data

that corresponds to the West group.

And that’s a very fast operation.

So it looks like that the West name only made,

let’s see, I’m actually not printing it all out.

So it looks like the West name only made it into the top 1,000

between, I guess, the last year was 1972, which

makes me a little bit sad.

So we do a little more popular name like Peter.

You can see that the data goes all the way through to 2008.

And you can see what’s happened here

is when I selected out the Peter group

that before we had something that

was indexed by both name and year.

And it’s gone and dropped the name level from the index.

So if I just selected out the, my keyboard

has stopped responding to me.

What’s going on here?

So if we select out the proportion column,

that now gives us the proportions for Peter

through all the years.

And then we could plot that.

So if we do dot plot, that gives you a matplotlib plot,

which it gives you the incidence of the proportion of babies

named Peter over time.

So you can see it peaked around 1960.

And it’s been a steady decline.

So people aren’t, well, actually the story

is more complicated than that.

So it’s not just that people aren’t naming their children

Peter as often as they were.

So here’s another cell where I selected out the Travis

group and the Peter group.

So you can see who I’m interested in learning

about their names.

So I select out Travis.

The proportion, I plot that.

Peter, the proportion, I plot that.

And so then you can see here’s a plot of, well,

I need a legend here.

So I’ll do plt.legend.

That’s matplotlib pyplot.

Oh, that’s not so great.

Label equals Travis.

I thought this worked yesterday.

Label equals Peter.

And so you can see that people were

really enthusiastic about naming their children Travis

around the 1970s and 80s.

And then it’s been a steady decline since there.

So this kind of raises a question.

Is do people really, are people really just not

so happy about these names?

Or is something different happening with people’s baby

naming behavior?

So I thought I would say, OK, how many unique names does

it take to, if you take the top names that people are giving

their children, how many names does it

take until you reach 50% of all the babies born?

So I looked at this data set.

I said, oh, that would be interesting to get that number.

And so I can walk you through the code

necessary to achieve that.

So if we look at the data set, so let’s

take the boy names, let’s see.

So the plot function is assuming that it stays the same

when it’s not listed in the year, right?

That’s right.

That’s right, yeah.

Yeah, I think it just connects the dot.

Yeah, really, I guess what you should really do

is, if there were none in a year,

I guess really it doesn’t appear in the top 1,000, yeah.

But what you could do, actually, so if we had the West

group and the proportion, yeah, so what you could do here

is you could say re-index.

And we could do range 1880, 2009.

And what that does is that reconforms the series

to the range 1880 through 2008.

But then you could, on that object,

then call fill NA with 0.

So that adds 0s where there are NAs.

And then you can see if you plotted that.

I mean, the reality is it’s probably not 0,

but it’s some small number.

So you have that ability to do that.

So the re-index function enables you to reconform the data

to a new index.

So that’s exactly what has happened here.

So I’ve taken a bunch of data that was relatively sparse.

I want to make it exactly match the range 1880 through 2009,

fill the NAs with 0, and then plot it.

Can we do dot prop instead of brackets?

It’s like you’re using column labels in some cases.

Like, not only in some cases, but like dot.

Can I?

Sorry, could you say that again?

Yeah, when you’re saying prop in brackets.

Yeah.

Instead of that, can we say dot prop?

Yes.

Yes, you can.

So assuming that the column name forms a valid Python variable

name, you can do dot to access it as an attribute

instead of having to pass a string to get item.

So to kind of walk you through how,

so I had this question of how do people’s baby naming

habits change through time?

So I kind of wanted to walk you through the machinery necessary

to answer a question like that.

So the first question is, how do we group this data set by year?

So I’ve got a year column, first of all.

So it gives me the years, so the year for each row.

And I want to group the data by row.

So I’m going to call the group by function and pass year.

And now the idea here is that it’s taking the string year,

it’s looking up that column, and it says, OK,

I want to group this data set by year.

And so what actually happens here

is that you create a group by object.

So I’m going to call that grouped.

And now from here, there are lots of things

that you can do with the data set.

So you have this object.

You could, say, compute the, well,

an interesting thing might be, suppose

we want to select the prop field.

And what’s the median proportion of, well,

what’s the median value in the prop field for each year?

So if I do subprop, what that’s doing

is telling the group by object, select the prop field.

And then for each group, apply the median function

to that group, and then assemble that

into a series, which is now labeled by year.

And so you could see, you could then plot that,

and you can see how the median proportion is

changing over time.

So this is kind of interesting that it increases.

So one.

One thing you might say is, well, as I will show you, the diversity of names that people

are picking has been increasing a lot since the 1980s, but it’s certainly over the course

of the, the people got more creative during the 1910s and then things got not so great

and then, so they weren’t thinking very hard about the names they were giving their kids

and now things are great and so, you know, people are getting very diverse, but a more

precise question would be how many, you know, how many names does it take to get to 50%

of all the children?

So that’s a more complicated question.

So now we’re going to go back to the grouped object.

Yeah.

So it’s just a view on the data.

Yeah.

So, so if we take this, this grouped object, so the idea is that we’ve got a number of

functions here.

So we have aggregate, which enables you to aggregate a field, which is what I showed

you.

So I could have said aggregate and pass a function lambda X, X prop median.

So that is going to call, based on each subgroup of the data, take the proportion call and

call median on it.

I guess not.

I’ve got to think about what happened here, but the most, the most general function you

can do is, is apply.

So apply takes a function, applies it to each group and then given what the results are,

it tries to assemble, assemble the results back into a data frame or a series.

So suppose that we wanted to take the top five names in each group.

So we can apply a function.

So on the chunk of data, I’m going to call that X.

I’m going to say X sort index.

So basically reorder the data using the index, but by the prop column.

So that was going to put the highest, you know, the highest occurring baby names at

the end because it orders in ascending order.

So I’m going to say ascending fall.

So you want them at the beginning.

And then do colon five to take the top five names in each year.

So you can imagine in your head, you know, you’ve got these groups of a thousand baby

names per year.

I’m sorting them in descending order by the proportion and then I’m taking the top five.

I’m going to take the top two or the top, what’s going on here?

I’ll take the top one actually.

So just taking the top one so, you know, pandas kind of defends you from, you know, spitting

out a hundred thousand rows of data into your console.

So that’s what’s happening there is you can actually control that.

When I selected out the top five, you know, I guess it decided that 645 rows was too much

to display.

So I’m just going to do colon one to select the top one, call that function.

You can see that from 1880 onward, people really liked John for a lot of years and it

became Robert and then James and some point.

So that index number is actually the observation number in the dataset.

So what happened here is that it took the result of each group and it added on, it added

the year.

So now we’ve got, so in the result we’ve got index with two levels where the first level

is the year, the second level is the observation number from the dataset.

So you could get rid of that if you wanted.

This is the top name per year.

The full dataset is the top one thousand, yeah.

So if I’d done the top five, let’s say, and I’m just going to select off the last one

hundred rows of the data of the result, you can see that in 2008, the top five names were

Jacob, Michael, Ethan, Joshua, and Daniel.

I actually didn’t know this, that Jacob is the top name for a lot of years.

So anyway, so this gets us closer to answering the question, so we can split the data into

groups like this, but now suppose that we had the data for the boys, boys year equals,

let’s say, 2008, and we have the proportions, and these all happen to be sorted.

So now this is a NumPy array, so I can call the NumPy array cumSum function on that, which

gives me the cumulative sum.

So let’s call this cumSum, sort of walk you through how this looks.

Then I’m going to, so I do that, and now I’m going to call the NumPy searchSorted function

with .5, which gives me the index of the value in the cumulative sum where .5 is reached.

So in 2008, it took the top 127 baby names in order to name 50% of the children, but

if you go back to 1880, that number is only 15.

So the question is, how did this evolve over time?

So I’m going to take this little exercise with the cumSum and replace this with just

a dummy variable group, can even make it generic and make this percent number generic, and

so I have this getNum function, I take the group of data, I sort index by prop, descending

order, I take that order group, select the proportion, cumulative sum, searchSorted by

the percent that I’m interested in.

So a bit of a mouthful, but I guess you could do, you could probably, so it might be difficult

to do in SQL, but you could do it.

Yes?

Yes?

Oh yeah, sure.

Thanks.

Sorry about that.

Yeah.

I didn’t know about that feature.

So.

I’m not.

Excellent.

Okay.

So, so then, so then I’ve got, I made it a little, a little simpler even, getNum at

percent, I pass the data, the percentage, I group by year, and then, and then I apply

this, this aggregate, this aggregation function and pass the percentage that I’m interested

in.

So here, I want to get the number of boys it took to name 50% of the children through

time.

I’m going to plot that, call that boys, do the same thing with the girls, label that

girls.

And so you can see, rather alarming trend that people were, were always a bit more creative

about girl names over time.

And then around 1980, you know, I’m not sure what happened.

Cultural revolution, everyone wants to, it’s like, oh, my children, my child is special,

my child is unique.

You’ve got to, you know, got to be more creative about names.

And so it’s just been a upward, a steady upward trend.

So, you know, if you extrapolate from here, you know, I don’t know, it’s like we’re going

to have a hard time like, you know, saying each other’s names pretty, pretty soon.

We’ll have to find like random sequences of characters, you know.

So anyway, so, you know, this was actually kind of complicated in retrospect now that

I look at this.

But what I wanted to explain to you is that, is that pandas offers this sort of very integrated

and powerful tool chain for expressing operations on index data and enabling you to, to do very

fine-grained control over operations on groups of data.

Being able to write those functions in Python in a nice way and then have that sort of integrate

with the rest of the tool chain.

So make plots in Matplotlib really easy.

Basically, just have all the functions that you need to answer, answer the questions that

you might have about it, about a data set.

And secondly, you know, just like, for example, like this, like having a view of the data

like this where I say, like, I’ve got this data, I know that it segments in two years

and I just want to look at, you know, the top five observations within each group.

group.

And once you understand each of the tools along the way here,

basically you’re just assembling these operations.

Say you need to sort, sort, select, apply, select,

and then you’ve got this.

So you concatenate these operations together

and it’s starting to look a little bit functional.

Then you can really crunch down a big data set

and get some meaningful answers out of it pretty quickly.

So I’m trying to think if there’s any fun stuff that I

had here.

So I also have a data set in here

which you can play around with.

Now that you have all these notebooks,

you can play around with the data.

So I have a data set with the number of births per year.

So I have year, sex, and then the number of births.

And so we can take that, merge that,

and so do a join operation between the names

data and the births data.

And then if you actually want to know

the number of people that were named that name in the year,

we can take.

So the merge data set now looks.

And so what’s happened behind the scenes

is it’s looked at these two data frames

and it said, OK, the common column names are year and sex.

So join on those two keys.

But I could have explicitly said join on sex and year.

And that would be a very explicit database join

operation between these two data frames.

And so you can see the result now has everything

we had before, but it also has the number of births.

So if you wanted to get the number of people in each year,

take the proportion times the number of births.

And I just called numpy floor on it to get an integer.

So I run that.

And see, I assigned a column to the data frame.

So I’ve got now a new column, persons.

Maybe the right word would be people.

But you can actually then sum up.

You can then group by name, sum up, select the persons column,

sum that up, and order.

And you could get the number of people born with each name

over the whole data set.

So you can see that there were 5 million people named James,

more than John, actually.

So James is still a pretty popular name, it turns out.

And so, yeah, it’s a pretty simple way

to aggregate that up the data set

and then to look at it and work with that.

So anyway, thank you guys.

Thank you guys.