Data Tools and the Data Scientist Shortage

Talk

Event Data Summit at Web Summit

Location Dublin, Ireland

Date November 4, 2015

Watch Talk Slides Video

This transcript and summary were AI-generated and may contain errors.

Summary

In this talk at Web Summit 2015, I discuss the data scientist shortage and what can be done about it. I reference McKinsey’s estimate of a shortage of 140-190,000 people with advanced analytical skills, and critique the notion that data scientists need to be “magical unicorns” with an impossibly rare skill set. Drawing on Harlan Harris and Mark Bazeman’s “Analyzing the Analyzers” report, I argue that effective data teams need a mix of roles—data researchers, data engineers, data business people—rather than trying to hire people who excel at everything.

I propose three approaches to the shortage: education reform in universities to include more practical data skills, the rise of data science bootcamps as a successful alternative model, and building company cultures that support data teams. But I spend most of my time on tools—my primary focus—arguing that better tools can make existing data scientists more productive. I describe what I call “the great decoupling” of data concerns into storage, computation, and user interface layers, predicting that data scientists will increasingly work only in the user interface layer while storage and compute become transparent. I discuss the hierarchy of needs in data analysis and why so much engineering effort has gone into the lower levels (storage, accessibility, cleaning) rather than analysis and visualization. I close by mentioning Ibis, a project I’m building to bridge Python with big data infrastructure.

Key Quotes

“A data scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician.”

“I like this definition, but I feel like it’s a very narrow characterization of the field and it kind of supports the notion that data scientists need to be these magical unicorns that are in possession of an extremely rare skill set.”

“80% of data science is cleaning data, and the other 20% is complaining about cleaning data.”

“I used to think when I said data preparation or data cleaning that I was saying something that was not very interesting, very valuable, but it turns out that solving those problems well really endears you to people because they spend so much of their time doing that.”

“You, as a data scientist, need to think less about how the data is stored and managed, think less about how things are computed, and really, your job is only to express the high-level model that you’re building, the high-level analytics, the data visualization.”

“Sometimes it’s the simpler models that are the ones that are the most effective… sometimes you see PhDs come into an organization and they want to apply a really fancy model and something really simple can get you there faster.”

“It’s not just a matter of hiring the data scientists and hoping that magic happens. You have to build a data culture, and you have to have managers for your data people that eliminate political barriers.”

“I start calling it the highlander fallacy. There can be only one—and I don’t think that that’s true. There need to be a lot of tools, and the more tools out there creates a productive conversation.”

Transcript

Wes McKinney: Thanks for being here today, and thanks for sticking with the data track through to the end of the day. I think I’m the last one on stage, which means I can talk forever. I won’t do that to you. So I’ll cut right to the chase. And this talk is a really important topic for me, because it relates closely to the things that motivate me in my day-to-day work, and it’s an important topic for the coming years.

So the brief summary of what I’ve been doing with myself for the last seven or eight years. I’m on the product engineering team at Cloudera. So if you don’t know Cloudera, Cloudera provides the leading open source big data management platform based on Apache Hadoop. And nowadays, when you say Hadoop, you’re referring to a whole open source ecosystem of projects. So it’s no longer just the Hadoop distributed file system and Hadoop MapReduce. It’s really, you know, 20, 25 projects that all fall under the Apache software umbrella and is growing every year as new projects come into the fold.

I’m best known as the creator of pandas, which is a Python library for data wrangling and analysis. I got started on that as a quant working for AQR in Greenwich, Connecticut, back during the financial crisis. And of course, you know, financial crises are always a really good time to create new software and take a lot of technology risks. But it ended up working out in the end. I went on to found and be the CEO of Datapad out in San Francisco and joined Cloudera about a year ago through an acquisition. I also wrote the book, Python for Data Analysis, which is a big part of helping the Python data ecosystem grow and become more popular by enabling people to educate themselves.

So the general focus of my career has been looking at how people interact with data and making it easier to work with data and able to express your ideas with the data and get value out of it faster. I have worked on graphical data interfaces, but I’ve been most interested in how technical people who program can write simpler, faster, and more expressive code on that data.

So talk about something for a few minutes here. So only three years ago, maybe some of you saw this article, but in the Harvard Business Review, DJ Patil proclaimed that data scientists is now the sexiest job of the 21st century. And so DJ, as you might know, ran the data team at LinkedIn, and along with Jeff Hammerbacher, who ran the data team at Facebook, was responsible for coining the term data scientist, and it’s become such a big deal that DJ has gone on to now be the, to work in the White House as the first chief data scientist of the United States.

And it’s true that, you know, businesses all around the world are reorganizing themselves to store, to collect and store more data than ever before, to use that data to make better decisions and also to create new products from that data that would not have been possible without all of that careful data collection and curation.

And you can hire data scientists, but there’s a hefty price tag, and the money is so good that statisticians and computer scientists everywhere have been rebranding themselves as data scientists in order to get a major salary increase. It’s a pretty successful strategy, and so it’s pretty disappointing to hear you change your job title and you get, you know, you get a 10 or 20 percent pay increase, maybe even more than that.

And it’s given rise to post-graduate, you know, data science boot camps where you spend three or six months learning to, you know, be a data scientist, and if you can get somebody to hire you, then your tuition in that data science boot camp is paid for, and this has been a very successful model.

The trouble is, a lot of companies, when they’re looking to do more, you know, they hear about data science, they come to, you know, the Strata conference or they come here to Web Summit and they hear all about, you know, the kinds of products that are being created with data, and they want to create their first data team, hire their first data scientist, and it’s a real struggle to identify the data scientist and hire the right people.

So it really makes you wonder, like, well, what is a data scientist exactly? And the consulting firm McKinsey put a number on it that, you know, I don’t remember when this article was, maybe a year ago, that there’s a shortage of, they claim 140 to 190,000 people with advanced analytical skills, and an additional ten times that who have the ability to understand the results of data analysis, and obviously this is a two-way street, so it’s not only the job of the data scientist to do the analysis, but they also have to present that analysis to the other business stakeholders in the organization in a way that can be understood, consumed, and acted upon.

And the business stakeholders, for their part, have to be able, have to put in the time to learn enough about the data analysis in order to make effective decisions. So you can’t just show a business person the results of a regression if they don’t know anything about, you know, what is this regression and what does the significance mean, and so you end up with all of this, like, bad statistics and decision-making, you know, sometimes you’ll see, you know, bad statistics out in the wild, in terms like results that are very nearly significant, and if you’re not a statistician, you need to know that, you know, very nearly significant is the same thing as not significant, and so you need to learn enough to know the difference between actionable, significant statistical results and ones that are just noise that look significant.

Drew Conway, a data scientist, well-known data scientist in New York, created this very popular Venn diagram describing the, you know, what’s come to be known as the, you know, popular accepted view of what a data scientist is, the special blend of software engineering skills, statistical, machine learning knowledge, but also you need to marry that with domain expertise so it’s not enough to be a machine learning expert who can write code, you also need to understand the business problems, the product problems, the marketing problems, you know, the business side of what you’re doing and how you can take your skills and apply them to the task at hand to achieve results, and sometimes it’s the simpler models that are the ones that are the most effective, and so, you know, sometimes you see, you know, PhDs come into an organization and they want to apply a really fancy, really fancy model and something really simple can get you there faster and in a way that can be put into production with much less work and move on to other high-value tasks.

Another well-known data scientist, a friend of mine, Josh Wills, quipped on Twitter that a data scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician, and I like this definition, but I feel like it’s a very narrow characterization of the field and it kind of, you know, supports the notion that data scientists need to be these magical unicorns that are in possession of an extremely rare skill set and have invested enough time to build the domain knowledge to be effective in their jobs, and the reality is that to really be effective in solving data problems, you need a mix of different skill sets and different kinds of people to build effective data teams.

So I think it was last year or the year, probably a couple years ago at this point, I have lost track of time, Harlan Harris and Mark Bazeman from Washington, D.C. created a data self-identification survey which they ended up publishing as an O’Reilly free, you know, you can free download the published report from the survey called Analyzing the Analyzers, and they created this classification of people who identify as being data people into four different buckets and then the subclassifications of how, maybe what kind of job title do they actually have in practice, and so a lot of people in a lot of organizations have dumped, you know, everyone’s a data scientist now, where in fact really maybe you’re a data engineer or maybe you’re a data business person who builds dashboards and works with other parts of the business, but it’s easier and, you know, it’s very easy to call yourself a data scientist where maybe you fit into one of these more granular categories.

One of the biggest pitfalls in businesses is to focus too much in hiring roles in one of these categories, so suppose that you never have, you don’t have any data scientists in your organization, and so you end up hiring a bunch of PhD level researchers who’ve never worked with product teams, never worked with marketing, never built data products, and so you end up with this, you know, this gap between, you know, the skills and the domain knowledge, but really having a mix of people from this list ends up being, you know, a very good recipe for success.

So addressing the people shortage is not exactly an easy problem, so education is one part of the solution, so there’s a lot that can be done in universities, which is now increasingly being done, but five or ten years ago, you know, the term data scientist didn’t really exist and there wasn’t a general awareness in the universities that there was a huge need for statistical and data skills, so I’m not suggesting that universities turn themselves into trade schools for industry data scientists, but, you know, having graduates of computer science and statistics programs, having experience working with databases and cleaning messy data sets would be a pretty good place to start.

It’s also complicated by the fact that because open source technology moves very quickly, that you can develop material for a course that is not still current two or three years in the future, and so it creates the need for a lot more agility in the course curriculums and in the universities to change the curriculum to keep up with the latest needs out in industry.

There’s some friction, especially in major research universities, you know, to not want all of their PhDs to go into industry and take jobs, but, you know, that’s, you know, kind of a form of denial given that there’s certainly not enough academic jobs for all of the PhDs, so you think it would at least be a good idea that if, you know, those ten percent of academic jobs, or whatever the percentage is for all the new PhDs, that at least the other ninety percent are able to go out and be successful in industry.

All of the data scientists that have, you know, data science and data engineering boot camps that have cropped up have really been a game changer and, you know, feel how you want about them. Yes, they are for profit and they have to be sustainable and attract, you know, high quality instructors who’ve been successful data engineers or data scientists out in the field, but it’s a really, it’s a nice model in that, you know, you spend three or six months turning yourself into a data engineer, into a data scientist, you put in, you put in the effort, you get yourself a job at the end, and your tuition for the, for the course is covered, and there’s more and more of these. It’s been a successful, successful business model.

Company culture has to change, of course, so it’s not just a matter of hiring, hiring the data scientists and hoping that magic happens. You have to build a data culture, and you have to have managers for your data people that eliminate political barriers to tackling, to working with the data and tackling, tackling, exercising their skills in the business. So if you talk to somebody like DJ Patil, he’ll credit his success at LinkedIn as being, hiring some of the best, you know, data scientists that he could find and then running interference and basically making it so that all the data scientists didn’t need to think about the politics in the organization, getting access to data, different, you know, security issues, basically helping the rest of the organization organize themselves to aid, aid the data teams to solve their problems effectively.

The last topic and what I’ll, what I’ll spend the rest of the talk talking about is tools, and I’ve, I’ve done the most work in my career in tools, you know, because I started out as a, doing, doing analytics and working, working in quant finance, and so I, I found myself attracted to software engineering and creating tools to make myself more productive, but in thinking about the tools, it’s about taking your existing, rather than throwing more people at the problem, you take your existing people and make them more productive and more effective, and you’ll, and you know, one of the only ways to do that outside of, you know, eliminating political barriers to doing their jobs is making the tools more efficient.

And so to paraphrase Big Data Borat on Twitter, a lot of people complain that 80% of data science is cleaning data, and the other 20% is complaining about cleaning data, and it doesn’t need to exactly be that way, and so I often call pandas, my, you know, my, you know, project I’m well known for, I call it a data preparation library, and I used to think when I said data preparation or data cleaning that I was saying something that was not very interesting, very valuable, but it turns out that, you know, solving those problems well really endears you to people because they spend so much of their time doing that.

The data analysis and data science inherently is a process with, with many steps, and so this, this diagram breaks down the data process for, say, exploratory analytics into a few different buckets, and it’s not exactly a linear progression from, from, from left to right, where you get the data, you clean it, you look at some graphs, you build a model, and then you ship it, and then you move on to the next problem. The reality is that you’re going to move back and forth between these stages many times, maybe during your data visualization or your analysis, you find some problems, some data cleanliness, or data collection issues that you have to go back and fix, or maybe you have to bug somebody else in your organization to go back and fix those problems, and it’s more than, more than likely it’s, it’s some, there’s multiple people who are responsible for each of the steps in this pipeline, so eliminating, you know, any, any friction in the process and barriers to collaboration will make this whole, the whole process, jumping back and forth in between the steps a lot more, a lot more productive.

And so, there’s been a general trend in open source data technology, which I, which I’ve started calling, you know, it’s very technical, but I start calling the, you know, the great decoupling of concerns for, for analytics and industry, and so I broke it down into these three simple categories of data storage, computation, and user interface, and so when I say user interface, I don’t mean exactly graphical interfaces, web browsers, you know, visualizations, although that certainly fits into that bucket, it also means the code that you, the code that you write, so if you write Python code, or you write R code, or Java, or Scala code, in, in a sense, the code that you write is your user interface, and you’re interacting with data that is stored someplace, and compute, compute engines, which may be on your laptop, or they may be in some, some cluster someplace.

And part of the trouble with, part of the trouble with data science, and the reason why, that, you know, the kind of myth of like the unicorn data scientist, who’s an expert in big data, and software engineering, but also knows machine learning, and can talk to business people, it’s that, it used to be that mastery of all of these topics was a prerequisite for being effective, you needed to understand how to build data collection pipelines, how to shard and partition the data in the proper way, so that your compute engines will run fast enough, so that you can deliver those results with, under some, under some SLA.

But increasingly, these concerns are becoming decoupled, and you, as a data scientist, need to think less about how the data is stored and managed, think less about how, how things are computed, and really, your job is only to express the high-level, you know, model that you’re building, the high-level analytics, the data visualization, and leaving the rest to, you know, translation compilers, you know, other concerns, you know, everything’s transparently handled under the hood, and this is not exactly perfect right now, and there’s a lot of work to do, but if you think about the long run, and where we’re going to be in five years, ten years, twenty years, you better hope that user time is going to accumulate in the top part of this triangle, and that you, as a data scientist, you as a, you as an analyst, are going to be working almost exclusively in the user interface layer, and you would like not to have to think at all about the storage and compute parts of the problem, and that hasn’t always been the case, so vertical integration has been the norm for many years, and that is changing, luckily.

So my hope for the future is that, you know, who knows when it’ll, when it’ll be here, is that, you know, we’re observing all of the data, all of the events that are happening around us, they’re all being collected in real time, and made immediately available for query and analytics and modeling, and you would like not to think about that very much, and just be able to, you know, pull out your phone and, you know, ask Siri to run a query for you, but hopefully you’re not speaking SQL to Siri, it’s more of a high-level business question, which is being translated down into the appropriate queries and data transformations and so forth.

But it’s understandable why that hasn’t really been a focus, just because, you know, looking at the hierarchy of needs and data analysis, so much of the, so much of the time has been focused on the bottom part, the bottom part of the, you know, of the pyramid, and that just storing data, just making it accessible, just making it clean, are so hard, and still so hard, that all of the engineering investment has gone into, has gone into, you know, the lower parts of the problem, and certainly effort has been invested in the data visualization tooling, in the analysis, in the machine learning, modeling, in the user interface, you know, programming user interfaces have improved substantially, but if you look at, you know, the number of person hours, person years that have been invested, person centuries that have been invested in data problems, you know, the accumulation looks not unlike this, but luckily, enough of those problems are being better taken care of, that more energy can be invested higher in the hierarchy of needs.

And there have been major innovations, you know, this might be the most self-serving slide in the deck, but, you know, just in the, you know, small part of open source data science that I pay attention to, namely, you know, the Python, R, Julia, JavaScript communities, there have been significant innovations that have transformed the way that analysts and researchers can work in these domains.

So if you’re working on the web and creating data visualizations, the way that you can create interactive data visualizations has been completely transformed by D3. The way that R programmers have worked, you know, in the last five, you know, five years ago versus today is almost completely different through projects like ggplot2 and ongoing work that Hadley Wickham and others are doing in the R community.

If you’re doing deep learning, you have folks at Facebook and elsewhere creating Torch and other accessible projects for deep learning to make, you know, cutting-edge research available to data scientists in ways that were never before possible, and more than just the actual computation core of the systems, it’s the user interface and the fact that you can express your models and do your data analysis in a way that’s high-level and expressible that makes these projects so important.

And so I think, you know, looking at all tools, it’s important to remember, you know, and I didn’t coin this, but, you know, I start calling it the highlander fallacy. I’m sure you know this one, right? There can be only one, and so I don’t think that that’s true, and there need to be a lot of tools, and the more tools out there creates a productive conversation in looking at, you know, what is good about a tool, what is bad about a tool, examining the use cases, you know, places where they’re effective, and also the usability aspect of how productive, how enjoyable is it to use those tools, how easy is it to read the documentation to go from not knowing about, you know, an ecosystem of tools to being able to productively solve real-world problems.

I also think that while there’s been a resurgence in SQL programming, you know, it was all about NoSQL five years ago, and nowadays it’s SQL everything, SQL everywhere again, and certainly SQL’s not going everywhere, but, you know, I think of it as being, you know, the Fortran of analytics, and we’re still kind of at the punch card stage of evolution, so I do see that changing over time.

I’m running out of time, but there’s still lots of SQL engines, certainly created by, you know, the biggest, you know, data-driven companies in the world are building data software for the rest of the world to use. Nowadays, what I’m personally focused on is bridging the gap between user interface layers, data science languages like Python and R, and enabling you to work in that domain and get access to distributed compute frameworks in a way that is seamless and productive, continuing to use high-level usable data science languages without having to learn as much about the distributed compute frameworks.

Building a project called Ibis, which is a unified front end for Python programmers to access big data infrastructure, so if you’re a Python programmer, happy to speak with you about this, and so, you know, what I’m doing to improve the status quo. There’s certainly a lot of work to do, and, you know, the data science community is still growing exponentially, and I expect to continue for the in the upcoming years, but it’s not just to make more data science, as we also have to make the data scientists more productive.

Thank you.

Audience Member: Hi. Thanks for sharing the great insights and very practical hands-on information. I have one question that I’ve seen big changes in the industry in the last four or five years. I remember five years ago, if you do data analysis, it’s a horrible thing to just set up the Hadoop cluster for, it could take one month or two, to just get something going, but now it feels like there’s a lot of tools you can use out of the box. What do you think about the trend it’s going to be? Is this something like AWS for hosting, or is this something going to be that direction?

Wes McKinney: Yeah, I think the general trend of, you know, if you look at AWS and the fact that, you know, Amazon has kind of done away with a lot of the DevOps, and you don’t have to buy your own machines or, you know, manage racks, and a lot of the provisioning and managing machines has all been automated, and the same thing is certainly happening with big data software, so I expect that trend of, you know, big data infrastructure as a service to continue. I think one of the challenges, given that the software is still in a stage of maturation and scaling, is obtaining software distributions that have been curated and had, you know, all the right patches applied and different, you know, all the problems and bugs that can happen at scale, so in some sense, knowing that somebody is looking after the software that you’re using to manage, you know, your most precious data.

Audience Member: Hi, so talking about the shortage of data scientists, what do you think about the shortage of female data scientists in this domain? About the ratio, do you think it’s getting any better or getting worse?

Wes McKinney: I don’t know the data on the, you know, the gender demographics of the community, but I would say that my understanding is that there’s the same kind of general problems with the data science ecosystem as with women in tech and women in software engineering, and so there has been some, you know, given that software engineering is a larger discipline, there’s been more of a focus in social media and in the press on the, you know, kind of, I look like an engineer kind of movement and sort of, you know, shedding light on gender issues, and I think that those issues are prevalent in data science as well, but I think like engineering, you know, I think it varies from, you know, company to company and creating a positive culture and looking to build diverse, you know, diverse teams.

Host: All right, we’ll leave it there. Wes, thank you very much.

Wes McKinney: Thank you.