Python Data Ecosystem: Thoughts on Building for the Future

Keynote
Event PyData Berlin
Location Berlin, Germany
Date May 20, 2016

This transcript and summary were AI-generated and may contain errors.

Summary

In this keynote, I discuss both community-building principles and a vision for the future of data infrastructure. The PyData community has grown from 75 people in a room at Google in 2012 to conferences all over the world—a growth enabled by supportive community and too many people to thank.

On community building, I draw heavily from Peter Hintjens (creator of AMQP and ZeroMQ), who wrote about engineering online communities to get the right behaviors. Key principles include: understanding what motivates people’s decision-making (especially when they have employers with business interests), reaching out to build collaborations, committing to consensus-driven development even though it’s harder than being a dictator, and valuing contributions beyond code—teaching, community organizing, and being helpful to newcomers. I also discuss the challenge of bad actors as communities grow, and why codes of conduct matter.

On Python packaging, I acknowledge it’s gotten much better but remains hard—we still spend 20-30% of every conference talking about it. I highlight CondaForge as a game-changer: a community-governed system with automated CI/CD that makes it nearly as easy to install a new C++ library as installing NumPy.

The technical portion introduces ideas that would become Apache Arrow. I discuss how RAM is becoming the new disk as solid-state storage approaches DRAM latency, envisioning a world of multi-terabyte working sets with multiple processes grabbing pointers to shared memory. The problem: every system (pandas, R, Julia, databases) has different memory representations, requiring expensive conversions. NumPy solved this for multi-dimensional arrays but not for tabular data with strings, categories, and missing values.

I introduce Apache Arrow as an emerging project to define a standard columnar memory format for tabular data. I demonstrate Feather, a simple file format Hadley Wickham and I designed using Arrow’s memory layout, showing a gigabyte of data moving between R and Python in seconds. I also discuss ongoing work on Parquet support with help from Blue Yonder.

The talk also covers building open source communities: the importance of making decisions on public mailing lists (Apache principle), marking issues as ramp-up tasks for newcomers, and the observation that programming attracts some people with personality disorders—not that there are more than in general population, but the internet provides bad actors direct access to people they want to troll.

Key Quotes

“Consensus is a lot harder than… It’s kind of like democracy in a way. It’s kind of easier to be a dictator and to lay down the law, but consensus takes real work.”

“I think one of the worst things that can happen in open-source communities is project maintainers see somebody struggling with a problem, and they say, oh well, it works for me, so patches are welcome, but you’re kind of on your own.”

“Making the community successful is way more than just writing code.”

“If the discussion didn’t happen in public on the mailing list or in the view of the world on the public record, that it’s the same as if the discussion didn’t happen at all.”

“We spend probably 20% or 30% of every Python conference talking about packaging.”

“RAM is becoming the new disk.”

“It’s not just enough to build a piece of software that works for you… as soon as you publish that software to PyPI, people install it, they start using it, and they run into a bunch of problems.”

“It’s not that the higher percentage of programmers have personality disorders, but the nature of the online communities… it’s very easy for somebody to troll… the internet provides you a direct way to get in touch with somebody that you want to troll.”

Transcript

Wes McKinney: Thank you. All right, that thing’s on. Good morning. Thanks for waking up on a Saturday morning. I know I’m sure many of you would rather be sleeping in. This is early for a group of programmers. So I have several things that I’m going to talk about. I give a lot of talks that have to do with just a summary of what I’m working on. And whenever I give something more like a keynote, I try to also talk about things that go beyond that and extend to the community and some of the things that we can do together as a growing community to work better toward common goals in the future. So I’ll talk a little bit more about that as the coffee kicks in. And thanks again to the organizers for having me here and to NumFocus and the PyData organizers for putting this conference together. It’s really amazing to me going from a small group of under 100 people at Google in 2012 at the first PyData meeting. I guess I can’t even call it a conference. It was just a meeting at Google with like 75 people in a room. And now we have conferences all over the world and thousands and thousands of people. So it’s really wonderful to see. And it’s been enabled by supportive community and too many people to thank.

So I’ve been doing lots of things the last few years. Most of you know me from my work creating pandas and the book Python for Data Analysis, which four years later has become a standard text on learning to use some of the Python data tools. Now I guess you work on problems for many years in a row and all of a sudden you look back and you say, well I’ve spent almost a decade working on a certain set of problems. I didn’t expect in 2007-2008 to spend the next 8-10 years working on creating data tools, but that’s what ended up happening, which there’s nothing wrong with that. But it’s allowed me to be involved in a bunch of Python projects. And now I’ve been getting involved in the Apache software ecosystem. I’ll say a few words about that. But not a whole lot has changed. I started out a Python programmer, ended up doing a lot more C and C++ programming, which has turned out to be quite nice. So if you haven’t been doing any C++ 11 or 14, there’s no better time than now to learn. I’m kind of glad that I skipped C++ programming until it turned into suddenly modern C++. It’s a much more friendly environment. You still have build tool chains to wrestle with, and that’s no fun. But as soon as you get your build system set up, programming in C++ is now quite nice.

I am working on a second edition of Python for Data Analysis. The first edition has held up surprisingly well. Most of the code still works. I’ve spent a lot of time with the book. You know that there’s some code that is broken. Or with pandas, there’s some things like axis indexing that the .ix is now no longer considered a best practice. There’s some things that need to get considerably revamped in the second edition. I have my work cut out for me, but look out for that. If I can get my act together, probably early next year when it hits bookshelves. But hopefully we’ll have an early release from O’Reilly before that.

Let me figure out my water situation. I wanted to say a few words about building open source communities. I can’t claim to be an expert in building open source communities. There’s other people out there who are much smarter than me and have been doing it for a much longer time. But I think for all of us, it’s something to reflect on as the PyData community continues to grow and becomes larger and larger. I think by and large, we haven’t had to deal with that many problems. I think that overall, PyData is a pretty well-socialized community full of helpful people. We produce a lot of great content, blogs, tutorials, learning resources. I think all the good people in the community have been a major part of what’s helped it grow to this point. But as communities grow, inevitably you have to start thinking about how to engineer the community to continue to grow and flourish in a positive way.

Our friend, Peter Hintjens, who gave us AMQP and ZeroMQ and a lot of technologies that have really shaped and influenced our ecosystem as well as many others. He’s a prolific programmer who’s designed over 30 protocols. He’s dying of cancer and he can’t stop writing and committing on his projects and writing books. It’s really an amazing person. But it’s interesting that somebody who’s been a prolific open source software developer ends up writing about basically how to build online communities. So if you haven’t read this book, he has some other books you want to check out. But thinking about how to engineer the environment so that you get the right behavior out of the group of people that are in it.

I have a few thoughts about things that we can do and things that we are already doing well and continue to do well to function well as a community. I’m also kind of seeing some of this in the Apache software ecosystem as I’ve gotten involved in those projects. One of the biggest things, especially when we’re working on open source software, and now a lot of the software that I’m working on and that many of the people in this room are working on, we may be building open source software, but the time that we’re spending building it may not be volunteer time. We’re working in a salaried job, and we happen to be building open source software on the job, but at the end of the day, we have to collect a paycheck. And so it’s very important when you have a group of people who are all working on open source software as part of their jobs to understand what is motivating their decision-making, why are certain things important to them. And so whenever you’re disagreeing about something, you’re able to see the other person’s perspective. And sometimes people aren’t always upfront about, if they’re arguing really passionately for something or pushing the software in a certain direction, to understand what is motivating that, to help others to understand. So you’re putting in the effort to understand other people’s problems and what other people need, as well as communicating your motivation so that other people can see your perspective.

I think in my work, I’ve always spent a lot of time trying to understand the needs of users and to put myself in the shoes of people who are using pandas or other Python libraries, because it’s not just enough to build a piece of software that works for you, and you design a function or an API, and it solves the problem that’s right in front of you, but as soon as you publish that software to PyPI, people install it, they start using it, and they run into a bunch of problems. And I think one of the worst things that can happen in open-source communities is project maintainers see somebody struggling with a problem, and they say, oh well, it works for me, so patches are welcome, but you’re kind of on your own. So maybe not spending that much effort to be helpful to build a community and get more developers and users involved.

Which brings me to my second point, which is reaching out to others. And when you build a new open-source project, and you spend the feedback cycle, you write a piece of code, and the time until you’re actually getting feedback on that code is generally a very long time. So my rule of thumb is about six months. So you make a new release of a software. It takes people a long time to even find the new features that you built. Maybe they’re stuck on an old version of the library, and it takes them a while to be able to upgrade it and make sure that it didn’t break their applications to try out your new features. And this can be really difficult the first time that you release software, and you’re not getting immediate feedback about the Python code or the C code that you just spent all night working on or you spent all weekend working on. But the feedback does eventually come if you’ve built something that other people want to use. But you also, as a developer and as a user, coming to conferences and engaging with others to learn about their use cases, but also if you’re building a new project, to reach out to others to try to bring them into your community to figure out if there’s ways that you can be collaborating, if there’s anything that you mutually can be doing to be more useful, to understand each other’s problems and say, like, oh, I can build this and you can build that, and then eventually we can meet in the middle and we have something where the whole is greater than the sum of its parts.

Another thing that is extremely difficult, especially in open source projects that grow to many developers, is having a commitment to consensus-driven development. And not all projects necessarily need to proceed on the basis of consensus. But it is very tempting, especially when you’re the person who created a project. So, for example, pandas. You spend a couple years of your life building a piece of software, and then all of a sudden you have a half-dozen additional developers who are spending their nights and weekends or maybe their day jobs working on the library. And so it’s natural, even though you have other core developers working on the project, it’s natural for you as a creator of a project or somebody who spent proportionately a larger amount of time than other developers, it’s natural to feel some sense of entitlement to be able to make decisions or to push forward changes that maybe not everyone agrees with. And it does take active effort to put in the energy if you want to make a decision or you feel strongly about something to spend the effort to explain your reasoning and to try to, basically, with a bunch of very different people in the room, arrive at some mutual understanding of the problem and a consensus about what is the way forward. And consensus is a lot harder than… It’s kind of like democracy in a way. It’s kind of easier to be a dictator and to lay down the law, but consensus takes real work. I think the Python community in general has done a good job of this, and even in core Python, with Guido being the benevolent dictator for life, it’s very rare that the special privilege of the BDFL is invoked.

I think another thing that we can do that is very helpful, I find is very helpful, is to really put a lot of value on contributions to the community outside of the code itself. And I think there’s definitely… We all have a cognitive bias, and I do as well, but the code is the code, and it’s the git commits and the pull requests and the bug reports. And it is a lot of work to build working software, and maybe it’s 80-20 or 90-10 or 90-5-5, but all of the work that goes into making dependable, scientific data analysis software that we can build our professional lives on, it’s a lot of work, and a lot of it is very, very unglamorous programming. You’re chasing down some kind of numerical precision bug that ends up being you’re down a rabbit hole for three days, and eventually you find some crazy bug, and you’re trying to explain it to your collaborators, and the esoteric stuff that you run into in building these projects goes largely unappreciated by the users. But there’s many contributions beyond that, being helpful to users, helping teach and educate new people coming to the community, being tolerant of all the… I guess there’s no dumb questions, but there’s a certain… You could, like many communities, just tell people to read the effing manual, but I’m always amazed when people go out of their way to be kind and helpful to new people in the community rather than just dropping them a link to the documentation for a project and sort of being patient and helping people be successful. There’s also the community organizing, the conference running, the online community maintenance, web presences and all that. So it’s not just building the projects. I think that making the community successful is way more than just writing code. And I think over time, I think early on in my career, I felt more as like, oh, well, as long as the code works, but there’s much more than that if you want to grow beyond a small group of scrappy hackers. You have to be able to embrace a much larger and more diverse community.

Another thing that as we grow, I’ve also been concerned about, and luckily it hasn’t ended up being all that much of a problem in our community, but something that we all collectively will have to be on our lookout as, imagine if the PyData community grows 10 times as large as it is now. If you look at PyCon now, with 4 or 5 thousand people at the major North America PyCon, the first PyCon that I went to in 2010 was only about 1,500 people. So the Python community itself has grown significantly. And as online communities grow, inevitably there will be that small subset of people who are making things worse for everybody else. And one of the challenges in growing the community is engineering the rules and the social behaviors of the community so that things are made harder for those, what you could call bad actors. If you read a lot of Peter Hintjens, he does talk about these bad actors who make things worse for everyone else. So things like codes of conduct, that’s why sometimes you’ll see somebody making an online ruckus about a conference that doesn’t have a code of conduct. But this is actually really important. So if you’re at a conference and you see somebody who is a bad actor, that you can say, hey, you’re violating the code of conduct and that is not acceptable behavior. And we hold ourselves to a high standard. But if you don’t have those social contracts and rules in place, then it can make it harder to deal with a problematic person because you get into this situation where it’s like, I did this thing and I didn’t think that was so bad and I don’t see what you’re so upset about. But you need to say, here’s why this is offensive or here’s why you’re making things worse for everyone else so that you can weed out those bad actors.

And so I think it’s amazing that Peter Hintjens wrote a book called The Psychopath Code about bad actors in general for a pretty broad definition of bad actors. So I think a lot of people think of psychopaths as being like the serial killers, kind of really bad people. But if you expand the definition of psychopaths to include people with antisocial personality disorders, it’s actually a shockingly high percentage of the population. And unfortunately, programming does attract a lot of people with personality disorders. So there’s some stuff to watch out for there. It’s a great book if you’re interested or you feel like there’s been a psychopath in your life that has drained you of your energy or you feel like you might have a difficult person and you’re trying to understand, like, why can’t they see my perspective? Why are we having such a hard time communicating? And why does this person seem to not understand that they’re draining the life out of all of us? You might say, well, they could be a psychopath maybe. They just are incapable of feeling empathy. Like, it’s a real thing.

I will also say that on the subject of dealing with bad actors, that with the nature of the Internet and Twitter, it is very easy for things to escalate out of control. And as tempted as I’ve been to call out bad behavior on Twitter and in public forums, I’m always a little bit careful to press the enter key because it’s difficult to know when something is going to go viral and when you might potentially be doing a lot of damage to somebody’s reputation that wasn’t intended. So a book on this subject by not necessarily an endorsement of John Ronson, but an interesting book on this subject about public shaming people online. There was the high-profile Donglegate from PyCon several years ago, which turned into just a giant mess over this. So with the Internet, it’s a double-edged sword in dealing with bad actors.

Luckily, we have organizations that are helping us grow the community. So it’s amazing that we have NumFocus now providing a financial conduit for funds to reach open-source projects so that businesses that are depending on the software in this ecosystem can give money and support pandas or support Scikit-learn or Jupyter or any of these projects. I also started working in the Apache Software Foundation because I’m doing some big data stuff, which serves some similar goals. One of the things I like about the Apache Software Foundation is that in addition to the fiscal-legal relationship with projects, that Apache projects also have a social set of rules around how the projects operate. So things like consensus-driven development, merit, a focus on if you want to be an Apache project, you have to be able to demonstrate that you can grow your developer base and that you can be inclusive and bring people into the community. And part of that inclusiveness is being open and transparent and carrying out all public discussions on public mailing lists so I think projects can devolve into a secretive cabal of developers who get together and have whiteboard discussions and make all the decisions about a project. So if you’re a budding open-source developer and want to get involved in that project, you might arrive and say, well, I see a lot of code that’s getting written, but I don’t really understand who’s making the decisions or where the discussions are taking place. So one of the things I really like about the Apache ecosystem is that there is a very ingrained part of the culture is that if the discussion didn’t happen in public on the mailing list or in the view of the world on the public record, that it’s the same as if the discussion didn’t happen at all. So make decisions in public, and that also is part of rooting out those bad actors so that if there is somebody who is acting in a way that is harming the developers, that they’re doing it in a place that is part of the public record so that you can point to and say, hey, what you did there was not okay, and you’re making things worse for the rest of us.

So let’s say a few words about everyone’s favorite topic, which is Python packaging. I swear we spend probably 20% or 30% of every Python conference talking about packaging. Luckily, it has gotten a lot better, but it’s a hard problem, and you think about all of the stuff that you have to get right to be able to do packaging well. Just having reliable and reproducible infrastructure on all of the platforms where we work, it’s not even just Linux, OS X, and Windows. You also have people on Solaris and other kinds of exotic infrastructure, but at least the main platforms we can settle for. Let’s do things well. That’s a hard enough problem. So now we have things like the manylinux project, which you can Google about, for building binary wheels. That work on many Linux distributions. But outside of just being able to get consistent environments for building the packages, there’s also the issue of the build tool chains. You have to set up compilers. You have to have the right compiler options. You have to have build scripts for all of the projects that work on all the platforms. Integration testing, whenever you change a dependency of some other downstream project, you need to have some… And this is something that we don’t do a lot of, is knowing whenever you bump a version of a dependency of a project, what impact does that have on all of the downstream dependencies? And it is still possible to get your Python environment or your Conda environment into a broken state because of dependency issues. We’re still building for multiple Python versions, like Ian Oswald and others. I’m hoping that in a couple of years, we can just build for Python rather than Python and legacy Python, aka Python 2. Having built now single code-based Python 2 and 3 projects, I would rather just build for Python 3, and I’m trying to do that as consistently as I can. But still, there’s a lot of 2.7 users out there. You have to be able to host and distribute the packages. Also, as far as the hosting and distribution, in a lot of businesses, you may be dealing with… I don’t know, do any of you work any place that has air-gapped machines that don’t have access to the outside Internet? We’ve got a few of you. Imagine if you can’t access pypi.python.org. There’s situations where that is a requirement, and so being able to mirror package repositories and to be able to pip install or Conda install in the exact same way behind a firewall or in an air-gapped environment is a requirement that some research labs and companies have, and it’s still a very hard problem. And we’re also building multiple applications to being able to manage multiple environments in a seamless way. I think we’ve come a long way with Conda and current gen of packaging tools that have made this quite a lot easier.

So I think even now, new people coming to the community, part of the reason that the community has grown large is that the packaging tools have gotten a lot better, and I think back 7 or 8 years ago when I started doing Python, I was sort of existentially afraid that my efforts to build pandas or the library that would become pandas and to do open-source software in general or to continue to do Python were going to fail because I couldn’t get all of the libraries working on the machine of the person who was sitting next to me. And I’ve spent more time, and I was using Windows at the time. I had no choice but to use Windows. And so I’d spent more time than I’d like to remember force-quitting Python.exe and rerunning NumPy installers and Matplotlib installers and then testing things out manually to make sure that I had the same environment on my machine as my colleague’s machine. So it’s a world different now, and that’s something that we should never take for granted, that when something just works out of the box and you have a Docker file or build.sh that sets up your Python environment, the fact that that just works is something that we should really feel good about.

But it takes a lot of work, and one thing that I’m really excited about is this new project, which if you haven’t looked at, called CondaForge, which is a set of tools and a workflow for writing new recipes for Conda and deploying them into a community-governed and managed environment where the build, upload, and management of those packages is automated and all handled on public infrastructure on GitHub and not necessarily open-source infrastructure. Things like Travis CI and CircleCI, these various free-for-open-source continuous integration services, they aren’t open-source, but they are free, which is very nice. I think one of the great things about CondaForge is that there’s a lot of packages out there, and especially new software, that hasn’t become important enough to be maintained officially in one of the enterprise distributions like the Enthought distribution or Anaconda, but we still want to be able to Conda install that and make that available to people, to anyone who’s using Anaconda, and make it just as easy as installing NumPy, as installing a brand-new C++ library or Python and C++ package that you built. They can Conda install that, and it works just the same. So it’s a really cool project, and there’s a lot of tools that Phil Ellison and Kyle Kelly and others have built. Folks from Continuum are also involved in CondaForge and have helped make it possible. So basically, there’s some automated tools that set up what are called feedstocks on GitHub, and it uses a mix of CircleCI, TravisCI, and AppVeyor to build, respectively, for Linux, OS X, and Windows, and then upload the built packages to Anaconda.org. So you can add CondaForge to your Conda setup right now and install packages from that channel.

So on the technical front, not to spend the whole keynote talking about stuff I’m hacking on right now, but the technical things that I’m likely to spend the next couple of years on… So the first thing is enabling the data science communities beyond Python to be able to collaborate or at least have some points of technical collaboration so that we aren’t always… I think, unfortunately, in the past, and this is part of the community, growing the community, is that, you know, if people are encouraging language wars or, you know… I got it. Yeah. Trying not to spill on my laptop. And I think this is something that we can all do, that if, you know, you see somebody… I think there was a point where, if you go back seven or eight years ago, and I’ve been guilty of this as well, that it’s okay to punch up sometimes. And I think when we were first starting to do data analysis and statistics in Python, it really felt like we were the small people in the room and there really wasn’t anyone paying attention to us and say, hey, like, we should be able to run regressions too and we should be able to do statistics in Python. And so it was easy to pick on the R folks and say, well, R is weird and, you know, has a language with a lot of words. And then, you know, years later, it’s like, oh, we’re still having the same kind of R versus Python debate. And, you know, over time, I think both of those… The competition has been good because the software has matured and, you know, R folks have built stuff, you know, kind of out of competitive spirit to make R better for data analysis. So some good things have come out of the language wars, but I think they’re also counterproductive in many ways. And the thing is, the fact that we all share common infrastructure, you know, with Julia, there’s LLVM and, you know, C, C++. You know, we can share code with Julia. We can share code with R. You can write C++ libraries that you can wrap in R or wrap in Python. There’s a lot that we can do to share code and to share ideas between these communities.

Now that I work at Cloudera and work on some big data problems, making sure that Python can talk to these systems efficiently has become really important. And also kind of looking at how people are using Python and how the data itself is changing. So we aren’t just dealing with multi-dimensional arrays anymore. We’re dealing with a lot more, like, web data, JSON data, making sure that the Python tools are growing to meet the needs of changing, kind of evolving data formats and data representations.

So I think… And so kind of building on Olivier’s keynote yesterday, that a lot of people have started to say that RAM is becoming the new disk. And so, you know, I don’t know that we ever expected to have terabytes of DRAM, although maybe it’s inevitable. But another thing that’s happening outside of the growth in DRAM is that the performance and latency of solid-state disk and DRAM are converging. And they aren’t going to become equivalent, but the 100x or 1,000x performance difference between… I don’t know what the exact numbers are between, you know, spinning rust and DRAM. But there’s been a definite convergence. And with the new generation of Flash, which, you know, if you should read about… I’ll let you read about 3D XPoint from Intel, which is next-gen Flash technology, you know, 500 to 1,000 times faster than NAND Flash. And also, you know, memory that is, you know, non-volatile memory, so you can have application data where you make a change in a data structure in an application, and that change is immediately persistent on disk, so that if, you know, you pull the plug on the machine, you aren’t losing application state. So there’s very interesting implications there.

And I think in the future, what we’re going to see is, you know, you’ll have, like, a multi-terabyte working set. Like, this is the data set, you know, all the data sets of interest for a particular data science problem. And you’ll have many different consumer applications which are grabbing pointers to memory kind of in that working set and are using a mix of, you know, hopefully at some point there will be some kind of abstraction to make, you know, manually managing memory of, like, is it on disk? Like, is it in RAM? You know, kind of the out-of-core problems that we’re dealing with today, that there’s more of a consistent way of, like, I just address memory, and the operating system handles, like, whether that’s in RAM or on disk. Like, it’s just a pointer, and the virtual addressing is handled by the operating system kernel.

But there’s challenges with this model, because, you know, it’s easy to make, like, these simple diagrams and say, okay, well, that’s great. We have, like, a couple of terabytes. Like, some of it’s in RAM, some of it’s on disk, and we’ve got multiple processes reading from it. Like, oh, that seems great. Like, let’s do that. But there’s complications, because… And the biggest… One of the biggest complications is you have many different processes that want to grab a pointer, like, a pointer to this data and say, okay, I want to, like, you know, do a pandas group by operation, or I want to, you know, perform some linear algebra, or, like, fit, you know, do some feature engineering and run scikit-learn on some memory that’s in your working set. But any given system, and this might be, like, an R process or a Python process or a Scala or, you know, C++ process, in general, the systems have different memory representations, so they use different data structures, and what happens is that when you want to pull data from the working set, you have to perform some conversion between the storage or the, you know, the way that the data is stored in the working memory and the in-memory data structures of that process. And so you’re spending a lot of CPU cycles converting data between one data structure and another.

Another problem is the metadata, so being able to describe from one system to another, you know, what is the data, like, what are, if it’s a table, what are the data types that you have a consistent way of, like… It turns out one of the biggest problems in numerical data is decimals, so if you’re dealing with databases, double-precision floating points are often not enough, and so you end up down the rabbit hole when you’re trying to decimal representation scale and precision and different ways the decimals are handled. And so just being able to talk about, like, I have decimal data, and this is how it’s represented on disk is very, very important. There’s also the issue of who owns the memory, and how do you know when you’re done with the data so that it can be evicted to cold storage and moved out of your working set? So these are all, you know, reasonably unsolved problems, but they have solutions within certain subsets of the… certain subsets of, you know, different problem spaces.

And so if you’ve been in the Python ecosystem for a long time, you’d say, well, we already have a solution to this problem, like, the solution is NumPy. And it’s true for many problems that if you think about, you know, the role that NumPy played in bringing together the Python scientific computing ecosystem, the fact that we have this common array data structure which is built on, like, okay, here’s some allocated memory, and then here’s information about how to interpret that as a multidimensional array. We have metadata that we all share, which are the NumPy D types. And so it is true that NumPy has a whole computing… computational framework attached to it where if you look inside a NumPy data type, like, there’s a function table which says, okay, here’s how I sum elements, here’s how I pull elements from, you know, select and index elements from a NumPy array. So there’s computation connected to NumPy data types, but you can think of them really as just providing a description of what the memory is. But the way that memory is shared between processes is handled on a case-by-case basis. So there’s some things that NumPy doesn’t solve as well. So things like, you know, JSON-like data. You know, in pandas, we’ve had to hack around and build our own missing data implementation representation. Handling of, you know, strings and category types, so like enums, factors, you know, not built into NumPy, although there have been some efforts to do so in the past where we basically struggled to arrive at consensus about how to deal with enums or categories in NumPy, which is very unfortunate. And there’s also memory representations designed for analytical workloads that are common to databases. So I think NumPy was never really intended to be part of a database backend. And there’s a lot of interesting research work that’s been done in memory compression and, you know, kind of designing for kind of scan-based kind of database-like workloads.

And a lot of, if you look at pandas internally, a lot of pandas internally looks like a sort of small, like a small database. So I’ve been a part of a new project called Apache Arrow, which I’ll tell you briefly about, which attempts for a certain narrow subset of these problems to build something that will work for the big data ecosystem and something that Python can participate in. So we, first of all, spent a lot of time to forge a collaboration amongst a large set of Apache projects. So these are mostly Apache projects, although I was involved from pandas. We’ve gotten some R folks involved. And so the goal was to create a memory representation for tabular data that, in and of itself, within a particular application, has characteristics that make it very fast for doing analytics in memory, but that we can also use for, you know, for that, like, shared data, like, common working set, that we can move data between systems or that we can reference data in shared memory between two processes, and we have a consistent way of sharing memory without conversion between systems.

So the columnar kind of tabular part of the problem is really just about CPU efficiency. So if you have a table of data, traditionally, in databases, you have record-oriented data, and it used to be that, you know, this was, you know, kind of old, you know, older architectures were transactional-based databases. This is the way you would want the data stored. So for every row, you have that data packed in a tuple and then stored on disk. But with columnar data, all of the data that has to do with a particular column is stored in a contiguous memory region. So typically, you might have very wide tables where you’re only interested in a few columns, and so whenever you’re addressing that in memory, usually you’re scanning down one column, and so the CPU is very good at using CPU cache to, you know, to better utilize the CPU. You can also do things with in-memory compression, delta encoding, you know, different kinds of… There’s a lot of research on this subject where you can compress the data in memory to take advantage of the cache kind of decompression tradeoff where you can actually decompress the data out of CPU cache faster than, you know, to break through memory bandwidth barriers, which is really interesting, and a lot of analytic databases have done work there.

For me, the biggest benefit of this project is the end-choose-two problem of building converters and data adapters to connect system A to system B, that every time you add a new storage system or new computational engine to the smorgasbord, you’ve got to figure out, like, okay, how do I make pandas talk to this file format or to this database, and you end up building a bespoke connector from one system to another, and there’s a lot of, you know, redundant code spent converting between data structures and different memory representations, and so what I’m hoping to do with Arrow is to simplify the I-O point of contact for pandas users with the rest of the ecosystem, so rather than saying, okay, I’m going to build, like, a custom adapter to Hive or Impala or Parquet file format, that we’re just getting Arrow memory back, and then there is some cost associated with converting to and from pandas and Arrow, but that is all things considered, not the end of the world. The devil’s in the details, of course.

So one thing that we did just for fun is I saw Hadley Wickham from the R community in January, and we talked about Arrow and ways to make R and Python more interoperable, and Hadley said, well, you know, we don’t have as many binary file formats in R as you do in Python, and it would be interesting if we could share data frames between R and Python a lot more easily, so we designed a small file format called Feather which uses the Apache Arrow memory layout to put the bytes on disk, and we came up with a metadata that was sufficient for describing schemas both in Python and R, but in a language-agnostic way so that R can say, like, this is my factor, and on the Python side, I can say, okay, I know how to convert an R factor into a pandas category D type. Because the file format is so simple, you’re basically just mem copying bytes off of disk, and there is a small amount of conversion, but most of the work of the project is in defining the metadata so that R and Python can communicate the shape of their data to each other.

So the code looks basically the same in both languages, and let me just run a demo. I’m supposed to do code demos at a Python conference. Is this working? Okay. Look, I’m running R code at a Python conference. Okay. So I’m generating a data frame that’s about a gigabyte and then calling the write feather function, which writes it to test.feather, and then… Okay. There’s my terminal. Okay. So I’m importing feather in Python. So then I read the data frame in feather, and so that was a gigabyte of data. So here’s R, here’s Python, and we move a gigabyte of data from Python to R in a matter of seconds. So that’s really very nice. And the goal of feather was not to replace any of the existing, you know, HDF5 or, you know, data storage tools that exist for pandas, but more to build something that enables code sharing and more interoperability between R and Python.

So what we… So what we have is a… We have a small C++ library written in C++11 that interacts with the file format and then Cython bindings for Python and Rcpp bindings for R. So most of the code is in the core library and then there’s a little bit of data munging to convert between R data frames and pandas data frames. The nice part of the project is, and I’m running up against my time limit, it’s very fast. It’s not as fast as native, you know, HDF5 pandas stuff, but it’s pretty fast and, you know, it’s close to disk speed in a lot of cases. The downside is that you have to convert between pandas data frames and the arrow kind of memory representation whenever you’re writing and reading from disk.

So a couple of words before I get off the stage here. One thing that I’m working actively on right now with Uwe Korn from Blue Yonder is we’re building an arrow-based adapter to the Parquet columnar file format in C++. So I’ve been working on the Parquet C++ implementation, the arrow C++ implementation Python interface, and then you’ll be able to get pandas data out of Parquet files hopefully in the next month or two and install the packages from the Apache, at least initially from the Apache dev conda channel. So this is very cool and very grateful to have the help from the Blue Yonder folks. So I’ll skip my last slides, but if these sorts of problems are interesting to you, join the community, all the activities on the mailing list, and that’s all I have. So thank you for listening.

Audience member: Yeah, I think we have a few minutes for questions if there are any. Can a few of our volunteers grab some mics? Any questions?

Audience member: I had a question about the… So you talked about in the beginning, you talked about kind of getting the community involved, and I was just interested in how do you… Because I often find that for people that are new to projects it’s usually easiest to get involved in, like, contributing to documentation, but that can be quite daunting because you do need to actually know roughly what the code is doing. So how do you bridge that gap? How do you get the new people in quickly but also efficiently so it doesn’t take too much of the core developer’s time?

Wes McKinney: Yeah, it’s hard. I think, you know, one thing that helps on GitHub is, you know, marking issues as, like, ramp-up tasks or, like, kind of access… Like, whenever you see something that maybe isn’t urgent, like, doesn’t need to get done urgently, but is an accessible task, that you clearly mark it as such, and that when new people come and they say, I want to write documentation, or I want to fix some bugs or, like, add some new features, that you have some place to point them to get started. But it is hard because, you know, understanding the code base well is often necessary for writing the documentation. I think improving doc strings is, like… You know, because a lot of documentation is auto-generated by Sphinx from doc strings, so, you know, I think the quality of all of our doc strings could be made way better. And it’s just part of learning and using a library to say, well, okay, this wasn’t clear to me, and so it at least encouraged, like, continuous doc string improvement. You know, a lot of libraries, half the doc strings are out of date.

Audience member: Okay. I think there’s a good nudge to everyone. Go and fix the doc strings in pandas.

Audience member: First of all, thank you very much for your talk, and I also liked that you mentioned social as well as technical issues. My question goes to one of your claims. Is that fine? Okay. You made the claim that the number of people with personality disorders under programmers is higher than, I suppose, the general population. Do you have any evidence for that claim?

Wes McKinney: It’s purely anecdotal, so it’s hard to measure, but I guess it seems that, I don’t know, it would be nice to put some numbers on it, but I find that the field of programming, given that we’re doing bits and bytes and writing programs, does, at least in my experience, does attract more people who are kind of quote, unquote, on the spectrum. But, you know, who knows.

Audience member: Sorry. Yeah, definitely there’s no selection for social skills, but my anecdotal evidence directs to the different direction that most seem rather happy, and there’s something, it could be that the ones with the personality disorders cluster at certain places.

Wes McKinney: Yeah, so just, sorry. Yeah, I would say, I guess I’m going to get kicked off here, but I think it’s exacerbated by, it’s not that the higher percentage of programmers have personality disorders, but the nature of the online communities that the people who do, it’s very easy for somebody to troll, like Linus Torvalds, and stir up some stuff on, you know, so it’s like if you are, if you derive pleasure from driving people crazy, the internet provides you a direct way to get in touch with somebody that you want to troll. So, yeah, yeah.

Audience member: Yeah, I thought you meant possibly the core developers.

Host: Okay, let’s thank Wes again.

Wes McKinney: Thank you.