Building Rich, High Performance Tools for Practical Data Analysis

Video

Event O’Reilly Webcast

Location The Internet

Date February 20, 2013

Watch Video Video

This transcript and summary were AI-generated and may contain errors.

Summary

In this Strata Conference talk, I share my approach to building data analysis tools.

Philosophy

Data preparation represents 80-95% of most analysis workflows. I argue the data science talent shortage can be addressed by making existing practitioners more productive through better tooling. As I put it, “big data is largely about counting things”—much of the infrastructure focuses on reducing enormous datasets into manageable forms.

Building data tools is about human-computer interaction and user interface design, not just algorithms. I advocate for APIs that “fit your brain,” citing Uncle Bob Martin’s principle that good code “looks like you expect it to.” I warn against “parameter soup”—overly general tools with too many configuration options.

Data Structures

Proper data type systems must handle messy data. Supporting missing values is critical: “most data sets that I encounter have missing values”—often overlooked in academic treatments but essential for real-world work.

Array programming builds on concepts from APL, Fortran, and MATLAB. Arrays enable efficient batch operations and memory-efficient views without copying.

Tables—collections of arrays with metadata—form the primary abstraction. Hierarchical row and column labels can represent multi-dimensional data in tabular format, supporting pivoting and reshaping.

Split-Apply-Combine

I highlight Hadley Wickham’s work on split-apply-combine: split data by grouping keys, apply functions to each group, combine results. The implementation relies on hash tables and sorting—“the two most important things that I utilize on a day-to-day basis.”

Performance

Different implementations of identical algorithms can vary dramatically in speed. Investigating why the J language outperformed other tools revealed that performance often comes from choosing algorithms suited to specific data characteristics, not just optimizing general-purpose implementations.

The demo analyzes USDA nutrient data, showing fluid chaining of operations: grouping, selecting, computing statistics, and plotting.

Vision

Every programming language should eventually have robust structured data processing capabilities. While syntax differs, fundamental operations share common patterns.

Key Quotes

“Working with data and preparing data ends up being 80% to even 95% of the work that’s involved in a data analysis workflow.”

“We like to joke in the data community that big data is largely about counting things.”

“Data tools, I like to think about it being the intersection of computer science, which is the algorithms and the underlying data processing code which actually crunches the data and reshapes it and moves it around. But the bigger part of that is how you interact with that code… So it’s more about human-computer interaction and user interface design as much as just the underlying algorithms.”

“You know it’s a good API if it fits your brain… A sign of good, clean code is if it pretty much looks like you expect it to.”

“There’s very general things you’re doing with the data set and you might have 12 different knobs that you can turn and that can also be very overwhelming when you’re just trying to get some work done.”

“Most data sets that I encounter have missing values.”

“Hashing and sorting, as far as implementing data tools, are two of the most important things that I utilize on a day-to-day basis.”

“The main solution to data science’s shortage is to make existing data scientists more productive.”

“I would hope that every language has a core set of structured and semi-structured data processing tools, so that you can be programming in any language and still express operations on the data in a very natural way.”

Transcript

The webcast today is sponsored by Accenture. And we’d like to say a very big thank you

to Accenture and let you all know that knowledge is key. But knowledge without insight

isn’t much of an advantage. Accenture has more than 257,000 people serving clients

in more than 120 countries. It’s a global management consulting, technology services,

and outsourcing company that helps its clients use analytics to turn data into insight,

insight into action, and action into tangible results. That’s high performance delivered.

You can learn more about Accenture, how it helps its clients navigate and harness the power

of big data by building big data infrastructures. Folks, today we have Wes McKinney

presenting Building Rich High Performance Tools for Practical Data Analysis.

Wes gave this talk back in the fall at our Strata Conference in New York.

And the talk was so well received and attended that we asked Wes if he could present the event

for you all today. And we’re really excited to have him with us.

It is my pleasure to turn the program over to Wes for his presentation. Hello, Wes.

Wes McKinney Hi, Yasmina. Thanks for the introduction,

and thanks everyone for joining us for this webcast. As Yasmina said, I gave this talk

back at the Strata Conference back in October, which already feels like a long time ago.

And I would describe the talk a little, I guess, as more about my philosophy and process

for thinking about building data analysis tools and some of the things that I’ve learned

along the way in my work in building data tools. So I’ll be sharing that with you.

So thankful to Accenture for supporting the webcast.

So the structure of the talk, I’ll tell you a little bit about my background

and what I’ve been working on for the last few years, and how I got involved

with building data tools and the kinds of things that I think about.

I’ll talk about some of the motivations for why to, I guess, why are we interested in the philosophy

and what are the various ingredients that go into building data analysis tools,

and what kinds of problems are we trying to solve. And then I’m going to talk specifically

about some of the technical details that go into building data tools,

and I’ll show you some examples that come from my book.

So my background is, well, I graduated from MIT back in 2007, and I joined the finance world

and worked for H2O Capital Management for about three years. And while I was there,

I got involved with building data analysis tools for Python,

and started what would later become the pandas Project.

And over the last two or three years, pandas has grown into a very successful project

in large part because I’ve been putting a lot of development work into it.

Now there’s a very robust community around the project, and it’s being used very productively

in many areas of the industry. I started out in finance, but pandas is very popular

in essentially anywhere that’s working, any industry that’s working with data,

and almost anyone who’s using Python is using pandas, I find.

I published a book which also came out around the time of Latte Strada.

It’s called Python for Data Analysis. And I intended for the book to be

a very accessible introduction to the modern scientific computing stack in Python.

So we do things like NumPy, and IPython, and Matplotlib.

So it’s the core scientific tools in Python. And about 60% of the book or more

is really a robust and practical introduction to pandas.

I’m looking at various important topics in data manipulation and data preparation,

and showing you how you can use Python to productively integrate

into your analysis workflow. It’s not a treatise on data analysis methods per se,

but more about the kinds of tools in Python that you can start using right away

to incorporate into your analysis.

So I spend a lot of time thinking about data tools.

And I guess working with data and preparing data ends up being 80% to even 95%

of the work that’s involved in a data analysis workflow.

So you see various estimates from studies people do on data analysis.

And I’m very interested both in the big data and the small data.

In the case of big data, a lot of the analytics that you’re doing

is just taking very, very large data and reducing it to a small or medium size

so that you can reshape, and reshape, and place, and size, and create models.

So a lot of the modeling and really agile data work happens

at what I would call the small and medium data level in typically big data tools.

There is a lot of big data machine learning and big data analytics.

A lot of big data infrastructure is geared around producing enormous data sets

into a manageable form that can be visualized, and placed, and sized, and modeled on.

So we like to joke in the data community that big data is largely about counting things.

I’ve already said that in a talk, and I always like to use that saying.

So really we’re talking about simplifying data wrangling.

So when I say data wrangling, I’m talking about all of the work that you need to do

to take a raw data set which comes in in a CSV file or comes out of a database

or a series of database tables or a series of CSV files,

and all of the work that you need to do to integrate that data together,

to clean up data that’s not in the form that you need

or contain data that’s unclean or needs some regularization or processing.

So kind of all of the tools along that pipeline I’ve been focused on,

identifying what are the key features and things that trip up users

and slow down your data analysis process.

So this really ends up coming down to a lot of user interface design.

So data tools, I like to think about it being the intersection of computer science,

which is the algorithms and the underlying data processing code

which actually crunches the data and reshapes it and moves it around.

But the bigger part of that is how you interact with that code

and what are the sequences, what’s the sequence of commands that you type into,

type into the terminal or you type into your text editor

to describe the operations that you’re doing with the data.

So it’s more about human-computer interaction and user interface design

as much as just the underlying algorithms and doing things very fast and efficiently

and in a scalable manner.

So often when you sit down and you’re working with a data set,

especially when you’re working with an unfamiliar tool set,

thinking about how all of the pieces fit together

and how to efficiently iterate from point A to point B in the analysis process.

So, of course, there are costs to making mistakes in user interface design.

So I like this sort of big data joke that in big data,

if you make a coding error, it can end up costing you a great deal.

I think that’s less important in small and medium data

where if you enter the wrong command,

you only have to wait a few seconds to see that it was the wrong command.

But I think the same kinds of design considerations are very important

in big data tools and small data tools.

So very concretely, what that comes down to is API design.

And so there are lots of things to consider in API design.

A couple of things you hear about is that, especially in the Python world,

we like to say that you know it’s a good API if it fits your brain.

So that, you know, the function that you write, a function called B, right?

parameters that you pass into them.

It’s a sign of, I think Uncle Bob Martin says that,

you know, a sign of good, clean code

is if it pretty much looks like you expect it to.

So I like to, you know, you know that you’ve designed

a good tool if people don’t complain about it

and it sort of seems like common sense

and it fits how you think about the problem.

Of course, there’s also risk with API design.

And especially in cases where you might have

a very general tool with many different options

and end up in what I would call a parameter soup

or a parameter health where, you know,

there’s very general things you’re doing with the data set

and you might have 12 different knobs that you can turn

and that can also be very overwhelming

when you’re just trying to get some work done.

So there’s another side of it that syntax matters

and there’s a cost to, sort of a mental cost

in looking at code and especially code

that other people have written and parsing,

sort of mentally parsing what the code does

and visualizing, especially in the course

of working with data, visualizing the data operation.

And sort of how everything works.

You can definitely recognize sort of bad syntax.

These are kind of two very extreme examples.

Each of these lines is an implementation of quicksort,

so this quicksort algorithm.

This is a bit, you know, cherry picking

and fans of APL languages might complain at me

but the top one is in the J language

and the bottom one is in the K language

which are both modern implementations

of the array programming language APL

from the 80s and 90s, like I said, really 70s in fact.

There’s even more, you know, esoteric languages

where you can implement things where the code,

you know, just stops making any sense at all.

You can fill in, you know, the stars are here.

This is a real programming language

but it’s largely invented

as a completely esoteric programming language.

So there’s definitely a lot of distance, you know,

you can go far off in the extreme

of very esoteric difficult to programming languages

by certain adult programming interfaces.

And, you know, I love this quote from Guido Van Roosa

who is the inventor of Python,

that user interface can handle only so much complexity

or it becomes unusable.

So focusing on building simple tools

versus complex tools, I think is generally

common sense way to do it.

And of course there are, you know,

folks among us who like very complex, precise tools

but there is a place in between.

So in terms of actually implementing data tools

if you were starting on a blank slate

and building yourself a set of practical

data analysis tools, there’s the API

and sort of user interface side of things

of how you, the user, you know,

what’s the code going to look like

that you’ll write to do analysis

with the tool that you’ve built.

And then there’s, you know,

how you actually implement it

and that comes down to representing

the underlying data types,

whether you have, you know,

strings or booleans or dates

or other kinds of data.

Often those, you’re working,

one of your primitive types is an array,

so you have a sequence of essentially

multi-dimensional sequence of those data types.

Data structures also fit into that,

especially for writing algorithms

and the algorithms part of things

is how, you know, must be carefully thought about

to make everything fast.

So in terms of arrays, I guess nowadays

a lot of us are programming in R

and Python and MATLAB and we don’t think

too much about arrays.

We say, okay, I’ve got, you know,

a bunch of data in a data frame

or a table-like object.

We don’t think too much about

the underlying array representation.

So array programming started back in the 70s and 80s

with Fortran and Fortran and ATL

and the modern languages are,

a lot of us are programming in R

and Python and MATLAB.

Other languages, so these are,

so when I say array, I mean,

all of the elements in the array

have the same type and can be

of arbitrary dimension.

So you could have, you know,

five or six dimensional arrays

if you really wanted to,

but more often than not,

we’re working with one or two dimensional arrays.

And you can think about scalar values

as being an array that doesn’t have any dimension.

It’s zero dimensional.

And arrays are important for expressing

batch operations on data.

So when you take a column in a spreadsheet

or a column in a table,

you’d like to be able to express

batch operations on a set of related data points

and that saves you from having to

to write a lot of loops

and end up with very robust code

that involves iterating over

unnecessarily iterating over the data set.

Arrays are also important when you are

slicing and dicing a very large data set

and that you can,

you can just look at a subset of a table

or a subset of an array as a view.

So proper array data structures

enable you to look at,

if you just want to look at, say,

the first hundred thousand rows in a table,

you can construct a view of an array

and express your data operations on that view

without having to copy any data.

So if you’re familiar with,

if you’re deeply familiar with R or Python or MATLAB,

there are various places where a lot of data gets copied

and that can be a source of slowness in your application.

Of course, most of you are probably familiar

with the data types they’re working with.

There’s the low level,

numeric data types and strings and dates and times.

Depending on the language that you’re working with,

you can also have arrays of arbitrary objects.

You can have arrays of arrays

or arrays of pairs of dates and strings.

So you can have somewhat arbitrary complexity

and representation of the data.

There are a couple of other data types

that I would add to this list.

So one of them is a categorical variable.

So in R, the namespaces are factored.

For many of your big data types,

you can compute a categorical variable,

which is, I guess,

or another name for it is enumeration.

So if you know that you only have a fixed set

of distinct values that occurs in an array,

you can endow it with special properties

such as ordering

or where you can have a more compact representation

of that data.

And that will help you a lot more on down the road

when you’re slicing and dicing or aggregating the data.

You can take advantage of that special structure

that you know that you only have

a fixed set of categories in the data.

You can also compute arbitrary structures of these data.

So things like tuples,

you might have a tuple of an integer

and a string and a date,

and that could be defined as a record

or structured data type.

On top of all of this,

and if you’ve ever spent a lot of time

implementing data tools,

that in reality,

and I guess in real world data,

I find that most data sets that I encounter

have missing values.

And those might be you are parsing a CSV file

and you come to a row

where there’s just no values

for certain entries in a row.

And so you need to be able to encode that

and represent it in the data structure

and have a way to fill missing values,

to drop missing values,

to have a core set of operations

that can take into account the missingness of data.

Of course, there are problems with all of this

data type business that you can end up with.

with, I guess I’d say, stuck-in-type soup,

where among integers and floating-point numbers,

there are different kinds of integers,

different kinds of floating-point numbers.

You can represent complex and real numbers.

There are different kinds of strings.

So things can get very complex very fast

if you aren’t careful.

And to make matters worse,

when you’re actually implementing the algorithms

underlying these things,

there’s many different ways

that you can represent the data at the machine level.

But you don’t want the user

to have to think about all of this,

that you don’t want to,

the person who’s writing code,

you don’t want them to have to worry about

whether you have a 16-bit unsigned integer

or a signed 32-bit integer.

It represents the quantity,

and you can do numerical operations on it.

That’s really the only thing that matters.

But of course, for someone who’s implementing these things,

you do have to think about it

because there’s memory use considerations

and performance considerations.

So it is a necessary evil in some cases,

but I think that it should be hidden from the end user

to the extent possible.

So on top of arrays and our basic data types,

in the way I find myself

usually working with data in table form,

so here I’ve got a drawing of what is

sort of a more generalized table.

So you think of a table as just being a list of arrays,

and each array has a fixed data type.

So we can have, one of the arrays can be an integer,

another can be Boolean, another can be string.

A common simplification is that

the table only contains one-dimensional arrays.

So if you’re an R programmer and you use DataFrame,

DataFrame is a table that contains all one-dimensional arrays

but you can, in some other areas,

in some other applications like parametrics,

it’s very common to work with panels,

which are tables containing two-dimensional arrays.

You can also think about a panel

as being a three-dimensional array,

all of the data have the same type.

You can add an additional layer of metadata

on top of the table by adding row and column labels.

So in this example, the column labels are just integers,

but we could also have information

about what each row in the table represents.

So here I put a simple two-level index,

which turns this table into what would look like

a pivot table in Excel,

or if you have another tool that builds pivot tables.

So you can have a data structure

which represents arbitrarily many layers of labeling.

So here we just have two levels,

where the first four rows of the A level are the A group,

and the first level of the row labels,

and the second level just has the integers one, two, three,

four.

So here’s a sort of a real example from Canvas

of a table that has both hierarchical row labels

and hierarchical column labels.

So if we look at the raw data,

these are some aggregated PIP data,

and you can see that in the columns we have days,

and then we have aggregate statistics,

and in the rows we have the time and smoker variables.

And if you define your data structure

with in this very general way as a sequence of arrays

with arbitrary column and row labels,

then having this more complex data structure,

which represents pretty high-dimensional data here,

we’ve got four layers of labeling,

you can represent it in a tabular format

that can be very nice to work with.

You could also have represented this data

in more of a traditional flat spreadsheet-like form

that could be loaded into a relational database,

and this is also a valid way of storing this data.

But this pivoted format can also be very nice

to work with depending on your application.

Now, of course, you can have all of these

very complex data structures,

but you need to have very, very simple code

to express operations on them.

So there’s a big set of data operations

that you can use on tabular data sets,

and I’ll talk about a couple of them.

So this is kind of an exhaustive list.

You can do with tables.

So let’s just look at a simple example

of table concatenation.

So on the left side here,

I have a set of stock price data

over a set of date,

and on the right, some stock volume data,

and something that you might want to do

is to join together these two data sets

into a single table,

and still be able to distinguish

the price data from the volume data

and refer to Apple’s price or a Johnson & Johnson volume.

And we have the additional complexity

that the row labels of the tables are different.

So we also need to introduce some missing data

into the right table

because it doesn’t have any data

on the 13th or 14th of September.

So if you were using the pandas library,

you could express this operation

with the function concat,

and I pass a list of the left and the right tables,

and I’m concatenating along axis one,

which is the column, axis zero is the rows.

And then I have this keys argument here,

which says that I want to add

an additional level in the columns

to be able to refer to the two tables

that are being joined together here.

So we end up with this single table,

which now has column labels consisting of

price and volume as the first level,

and the stock tickers as the second level.

And so now we can refer in a simple way

to any individual data point

by either price and volume, stock ticker, and date.

And in the bottom right corner,

you can see that it’s introduced

not a number as missing value,

missing value indicators in the table.

Here’s another example of a primitive table operation.

So looking back at this joins table

of this price and volume data,

something I might want to do is to do a reshape operation

where I pivot down the price volume level into the rows.

So now I have one column for each stock,

and then two levels of row labeling.

So if you were using pandas, you could type,

I called this variable result,

and I call result stack zero,

which says take the top level of the columns

and rotate that down into the rows,

and that does compute now.

There’s two levels of row labeling on the table.

I see a flashing Q&A tab, but nothing’s coming up.

So to actually implement these things

is not entirely straightforward,

but at the end of the day,

it comes down to array operations

and moving data around in an efficient way.

And in a lot of cases,

there’s one of the most important sets of algorithms

for processing data has to do with set logic,

and that usually involves the use of hash tables and sets.

And the other class of extremely important algorithms

is sorting, and that, you know,

I would say that those two things,

hashing and sorting, as far as implementing data tools,

are two of the most important things

that I utilize on a day-to-day basis.

Another very important area of data analysis is doing group operations, and there’s this

fantastic paper by Hadley Wickham. So, if you don’t know Hadley, he’s an author of many widely

used R packages, in particular ggplot2 for visualization and the reshape2 and flyer packages.

And I love talking to Hadley because he… So, I developed Candace essentially without

looking at Hadley’s libraries. And it’s interesting to see how we’ve developed

some of the same design decisions and also some different design decisions.

And looking at each of those design choices, you can certainly… Decisions that Hadley made,

often I’ll say, okay, well, that’s a better way of doing things. And then we might disagree over

some other particular design choice that I made in Candace or that he made in flyer.

So, a lot of people are doing group operations by writing SQL. So, I put a simple SQL statement

here where we grouped by three keys and aggregated, computed the means of one column and the standard

deviation of another. And this is a very general class of data operations that goes far beyond

databases. It’s just a sort of general way of slicing and dicing tabular data.

So, in Candace land, sort of the way I think about grouping and aggregation is in two different

steps. So, first you have a table grouped by a set of keys. So, the keys could be any number

of things. So, often they’ll be just column names in the table that you want to use to group it.

You might have data that falls, that came from outside the table, or maybe it’s being derived

from data that’s in the table. And so, you could have arbitrary arrays that you use to group the

table. Or you might have some complex logic, which is expressed in a function. And the function can

get applied to each row of the table to determine which group each row belongs to. So, once you

describe how you want to group the table, which in the case of an SQL statement will just be a

list of column names, you want to apply a function to it or a set of functions to the column. And

that might be something as simple as you want to compute the mean of one column and the standard

deviation of another column. But there’s more complex operations that you can also do.

And so, we would hope that depending on the form of the function or the form of the

aggregation that we get something useful back that we can work with nicely.

So, this is, I guess, a visual representation of how a group by works. So, you have an array of

data and an array of group indicators or group labels. So, we split the data into pieces and

we apply a function, which in this case, in this case, this picture is just a simple aggregation.

And then we combine that together into a final result table or labeled array.

So, here’s a more complex operation where I took a, and you probably can’t,

as you know, if you can read the table, it’s very small on my screen,

where we group the table by a column. And then instead of aggregating, I want to

take each group of data, sort by the values in a particular column, and then select the highest

two values in the, out of each group. So, if we have the raw table on the left,

I grouped by the size column, and then I select the top two. And then the result that we get

is a table that is now labeled by size. So, in this data set, there were only six

values of size. And then for each size, there are two rows that were retrieved from the table.

So, here’s another example where instead of grouping by size, we want to first discretize

into two groups, into zero to three and three to six, and then group by those discrete buckets,

and then select out the top, well, here, four values in this case. So, rather than saying,

okay, I want to group by the size column in the table, I’m going to first discretize the

size column with the pandas cut function. And I say, okay, I want to divide into zero to three

and three to six. And then I’m going to group by my discretized size array, and then apply my top

end function and select the top four values from each group. And in doing so, I get back now a

table that is now has two groups, zero to three and three to six, and four, the top four values

for each of those buckets. So, what’s going on inside group by, I don’t know if you’ve ever

thought about this, but the first thing that you have to do is identify which group each row in the

table belongs to. And that typically involves taking a column of data, converting it to a

categorical variable, which enumerates the groups and gives you a nice labeling.

If you have multiple key arrays or multiple key columns, the algorithm is a little more

complicated. You have to form tuples as those values and compute a categorical variable off

of those tuples so that you get all of the unique combinations to the key. And then you can either

apply, you can either aggregate the data by doing a scan of the whole table, but for more complicated

analysis, you may need to sort the data into continuous groups and then go through each

chunk of the table and apply the function to it. So, kind of the gory details of what’s going on

inside these data algorithms if you’re morbidly curious. And you can get extra performance wins

by saving pre-computed values, but if you pre-compute the categorical variables,

that’s a lot of the work that needs to be done in a lot of cases.

So, you get things to be fast and from all number of things, and you have to

make all of these, balance all of these considerations and evaluate depending on

your use case how to make things really fast. Choosing the best algorithm, I guess,

I would say is the most important thing, but there are other kinds of considerations like how your

data is laid out in memory, whether in row-oriented fashion or column-oriented fashion.

If you’re using an interpreted language like Python, you also have to worry about

the overhead associated with dynamic typing. So, you have little boxes, you can think of them

as wrapping your underlying data. So, if you’re doing a lot of data manipulation of box data,

there might be more overhead associated with boxing and unboxing the data

than actually doing the computation. So, in the case of libraries like pandas, the actual

heavy lifting is being done at a lower level and not in the interpreted language.

So, just to also give you a fun example to show that not all algorithms implementations are

created equal. You can all have several implementations of the same algorithm,

and one might be a lot faster than the other just because of how the code is written. So,

I took a very, very simple function that gets used all over the place to computing.

You have an array and you want to compute the unique values that are found in that array,

and this function is found in basically every data analysis language, and I used the test case of

generating a big set of a million integers that only have 10,000 unique values. It should be

used. So, I ran the unique function in R and in NumPy and in pandas and also in Julia.

I was playing around a lot with Julia back in October

and I was curious about this part of it, part of the thing.

And so everybody’s implementing the same algorithm

to compute the unique values

and you can see the performance is quite a bit different.

I also noted that the J array language

is able to do this almost an order of magnitude faster

than other languages.

After I gave this talk, I actually dug down

into the J source code to see what’s going on.

It turns out, I was a bit sad

when I found out why it’s a lot faster.

And it turns out that there’s an even better algorithm

for computing unique values of integer arrays

when the range is small.

So if you look at the values first

and see that the values all fall within,

let’s say zero to 10,000,

which I think they did in this case,

then you can use a simpler algorithm

that just does a sweep and count.

It doesn’t have to use a count,

sort of a, I don’t know what the word would be,

counting sort.

It doesn’t require using a full blown hash table.

Okay, so I am going to go to a slightly interactive portion

of the webcast and see if I can share my screen.

So let’s try this.

So I’m just gonna run through some quick examples.

And these are, this is a data set that’s found

that I used in my book.

It’s from the USDA and it’s a data set

that contains nutrient information about 6,000 foods.

And I’ve seen a number of different analogies

and visualization involving this data set

recently on the internet.

So basically the data is in JSON,

so I’m loading in the data.

So if you’re interested in web visualization,

you end up having to work in JSON most of the time.

So I’m loading the data into Python

with the JSON load function.

And that gives me a list of what represent

JavaScript objects.

And each object is a, you can see is a Python dictionary

that contains metadata about a single food.

So we have a food name, so we have a caraway cheese,

and then a food group, a food ID,

manufacturer if there is one known.

And for each food there’s a nested list of nutrients.

So there’s, it might be 40 or 50 different nutrients

for each food.

So the major nutrient groups like protein and fat,

but also down into the individual vitamins and minerals

that are found in the food.

So if you wanted to slice and dice

and do some analysis on this data set,

you have to do, I mean, also considering that it is 6,600

at 6,600 foods in the data set.

So it’s pretty big, not big data, but it’s median data.

So we need to do a little bit of work

so to wrangle this into shape

and then be able to slice and dice it a little bit.

So if we were just looking,

interested in looking at the nutrients for a single food,

I can just take the nutrients for this caraway cheese.

And that gives me a list of Python dictionaries,

which are just little mappings.

And if I pass those, that list of nutrients

to the pandas data frame constructor,

that gives me a,

that gives me back a data frame, which is a table of data.

And here I call the head function,

which gives me the first 10 rows in the data frame.

So you can see there’s the food description,

the food group, the units and the number.

I looked at the value here

is the amount per 100 grams of food.

So if we were just looking at a,

just looking at a single food and we wanted to say,

okay, well, how many nutrients

from each nutrient group are there in the data set?

We could look at the group column from the data frame

and you call the value counts method.

And that returns back a labeled array,

which is the pandas series object.

It shows us that there are 54 amino acids, so 42 vitamins.

And you can see all the other food groups that are there.

We could also group by,

things like group by group.

And then let’s say we wanted to compute

the mean value by group.

We could then select the value column

and call the mean method.

And then that gives us back an array

containing the two groups

and then mean values for each group.

So if we wanted to do some more analysis overall,

all the foods need to do a bit more work.

So just walk you through this very quickly.

I’m not hoping to teach you pandas for this demo,

just to kind of illustrate the tools a little bit.

So I write out the metadata that I’m interested

in extracting for each food from the database.

And so I pass the whole database

into the DataFrame constructor with those keys.

And that gives me back a DataFrame

with one row for each food,

and just has the food description group and ID.

So I’m going to rename description to food

and group to F group.

So I have a, I write down a mapping of renamings,

and then I use the rename function

to rename the columns with that mapping.

We get back a new table, which has now

food, food group, instead of description and group,

which is more generic.

So here it’s more specific.

And then I’m going to iterate through each food,

take the nutrients for each food,

add a column that is the ID for that food group.

So we’ll have the same ID in each row

of that nutrient DataFrame.

And then I’m gonna concatenate together

all of those smaller nutrient tables

to create one big, one big nutrient table.

We look at it, it has about 400,000 rows.

And the number of, if we look at the ID column

and compute the unique values of the ID column,

we can see that we have 2,636 foods.

I also noted that in looking at this data set,

that there are some duplicate entries in it,

which is a pretty common kind of thing,

it turns out in practice.

And so we can call the duplicated method,

which examines all of the rows in the data set.

And when we call duplicated,

it gives us back a Boolean array that we can then sum.

And that shows us that there are 14,000 duplicate entries

in the data set.

So it’s a pretty unclean data set.

So what we can do is take this duplicated Boolean array,

put minus in front of it, and say nutrients.

So we’re taking nutrients

where the nutrient data is not duplicated.

So I’ll call this nutrients.

And so I run that.

Yeah, now we have a little smaller data frame.

We could also just use the drop duplicate convenience method.

So I like to have convenience methods.

You could argue that, well, okay,

writing out nutrients, not duplicated,

is not that complicated.

But I like having a simple method,

especially one that can be tag completed

for doing such a common operation as drop duplicate.

So I see nothing particularly wrong with that.

So I’m also going to rename the description

and group columns in this nutrient data.

And now we can merge together the metadata

about the food and then the nutrients

because we added this ID column.

So now we can merge everything together

with the Canvas merge function.

So we’re merging nutrients with nutrient metadata

on the ID columns.

We’re gonna do an outer join between the two tables.

So we get back now a single data frame.

has all of the data.

So we can do something like compute a cross tabulation of nutrient group and unit to see

how the cross tabulation is just the count.

So when you see a positive value in this table, it’s just the number of observations in that

table that are of that pair.

So it turns out that each nutrient group, with the exception of, well, I guess there’s

several that have multiple units, but some of them, like amino acids, is only represented

in grams in the data set, for instance.

So if we look at, I just grabbed the position 30,000 row from the table, you can see what

an individual row in the table looks like.

So we have nutrient, nutrient group, food group, and ID and value of units.

So then we can do, potentially, go crazy from here once you know how to use CANDIS.

So suppose that we want us to group by nutrient and food group.

Let’s compute the median value for each group in the data.

These are median nutrient densities by nutrient and food group.

And we could just look at, let’s say, zinc, and then order by value and make a bar plot.

So we chain together all of these operations.

You can think about, these might be totally foreign to you, but it’s nice to be able to

think about each of the data operations.

Here we group, we select a column, we compute median, we select the zinc values, we order

by value, and then we plot.

So it’s nice to be able to chain together these operations in a very natural way.

And then we get a plot, and we can see that the highest concentration of zinc is found

in beef products, and there’s almost none to be found down in fats and oils and beverages

category.

So a little bit of a vignette of some data tools that I’ve been involved with building,

but my general philosophy has been that it tries to minimize friction in writing code,

and the code should be as simple as possible for representing state operations.

So the summary I would say here is that, given that there’s a shortage of data sciences that

we’re hearing a lot about, the main solution to data science’s shortage is to make existing

data sciences more productive.

And when you consider that a lot of the headache involved in data analysis has to do with just

cleaning and manipulating and preparing data, we really should be thinking hard about our

tools.

And whenever you run into a problem where you’re spending a lot of time doing some manual

cleanup in Excel, or you’re having to call somebody to help you out with your analysis,

you think about how could the tools be better, and if you could have that magical one line

of code that does the thing that you need to do with that data.

It’s helpful to kind of write that down and think about it.

And if you can’t find an R package or a Python package or something in Canvas that does exactly

what you need, to reach out into the community and talk to people and to share your use cases

and problems that you’re encountering with your data.

I think the tools, the tools, today’s tools can certainly be a lot better than they are

now.

And I’m very excited to see that in the last four or five years, there’s been a lot of

progress to make data preparation, data manipulation, data analysis tools a lot more productive.

So, Hadley Wickham in the R community has done amazing things to make working with data

a lot easier with the vshape2 and flyer packages.

And I’ve sort of been Hadley’s counterpart in the Python community with pandas.

My hope would certainly be that in the future, whenever the future gets here, that in any

programming language, because the programming language that you’re using may not totally

be up to use, you may not be able to use R, may not be able to use Python.

But I would hope that every language has a core set of structured and semi-structured

data processing tools, so that you can be programming in any language and still express

operations on the data in a very natural way.

And I think the common structures and the ways that the operations that you need to

do on data, I think there’s a lot of commonalities that can be built across all languages and

how the API would look if you’re programming in Haskell versus programming in C or Python

or R. I think the code may look a bit different, but the basic idea of what you’re doing to

the data is going to be very similar from place to place.

And I’m continuing to work on data tooling, and that’s kind of in my area of interest,

and I need to see very smart people doing data analysis and finding themselves limited

by the tool.

I am working on a new project, which I hope to talk about more this spring, so I won’t

share too much about it in the webcast, but I’m actively working to build better data

tools and make data analysis more productive for more people.

So, thanks very much for listening in.

I share my thoughts occasionally on these topics on Twitter, and there’s a very active

community of data analysts and data scientists on Twitter, so if you’re not on Twitter,

I would encourage you to join in on the conversation and to find the thought leaders in our space,

Pete Skomorok and Hilary Mason and Josh Wills and sort of people who are really active in

spreading knowledge and what people are doing in data science these days.

So, thanks everyone for listening, and thanks to Accenture, our sponsor.

I guess, Yaz, maybe I’ll let you take it over from here.

And as we close the program, we would like to say a very big thank you to Accenture for

sponsoring today’s webcast and let you all know that knowledge is key, but knowledge

without insight isn’t much of an advantage.

Accenture has more than 257,000 people serving clients in more than 120 countries.

I’m going to push out a little link to you all here so you can download their PDF.

It’s a global management consulting, technology services, and outsourcing company that helps

its clients use analytics to turn data into insight, insight into action, and action into

tangible results.

That’s high performance delivered.

You can learn more about how Accenture helps its clients navigate and harness the power

of big data by building big data infrastructures.

And here it comes right there on your site, folks.

All right, there’s a PDF that they’ve made available for you.

So, if you want information, please download that.

You can follow up with them there.

Thank you, Accenture.

Thank you, Wes.

Thank you to you all for attending.

This will conclude our webcast today.

Goodbye, everyone.