Modern Open Source: Challenges and Opportunities
This transcript and summary were AI-generated and may contain errors.
Summary
In this keynote, I reflect on nearly a decade of open source development, starting with pandas in 2009 and what I’ve learned about building communities and sustainable projects. I distinguish between “industry-led” projects (like TensorFlow from Google, Swift from Apple) and “community-led” projects, each with different trade-offs.
Industry-led projects can move faster since a single company makes decisions, but they risk abandonment when company strategy changes. Community-led projects can be slower due to consensus-building, but they have fewer single points of failure since they don’t depend on any one company’s continued support. The downside is that community maintainers are more prone to burnout, especially when most of their time is unpaid volunteer work.
I share sobering statistics from NumFocus: pandas, NumPy, and Matplotlib together have only about 15 maintainers keeping these widely-used projects running. Most of that time is volunteer work. There’s a toxic dynamic where users develop a sense of entitlement—one pandas user literally described himself as a “customer” of the project, despite never paying anything.
I discuss several myths about open source: that “organic growth” will naturally produce features you need, that burnout isn’t a real problem, and that developers are replaceable parts. In reality, maintainers carry so much project history in their heads that it takes months or years for someone new to become effective. Tidelift found that the two biggest funding sources for open source are “self-funding” (no funding—you pay with your own time) and employer time, both of which are vulnerable to life changes.
I explain why I’ve become a fan of the Apache Software Foundation’s governance model—it emphasizes consensus-driven development, merit-based decision-making, and transparent public discussions. If discussions didn’t happen on public mailing lists, it’s as if they didn’t happen at all.
The talk ends with a call to action: contribute time or money to open source. If every pandas user gave just $1 per year, that would be enough to fund multiple full-time developers. I also announce Ursa Labs, the organization I’d just created to raise money for full-time Apache Arrow development, partnered with RStudio to build infrastructure for data science across R and Python.
Key Quotes
“The best code is the code that you don’t have to write. So if you can avoid reinventing the wheel, then you can spend your time solving other problems.”
“There’s 15 people who are essentially making these three open-source projects that we all depend on work, which is sort of terrifying.”
“I actually had an email thread one time where a pandas user described himself as a customer of the project. And so I had to sort of chew on the word customer for a little while and I was like, well, the check must have gotten lost in the mail.”
“A lot of open-source maintainers carry so much project history and knowledge in their heads that it would take, it could take an individual, months or years to develop the level of project knowledge to be effective as a project maintainer.”
“The two most significant sources of funding for open-source, the number one is self-funding, so no funding. So essentially you are paying for the project indirectly with your time.”
“If everyone gave, if every pandas user gave $1 per year… that would be a lot of money and could allow a lot of people to spend work full time working on pandas.”
“If developers burn out, it can make a project significantly weaker.”
“One of the reasons I like community-led projects is that there are fewer single points of failure. So you aren’t dependent on a single company to continue to support that project.”
“Projects also need to be maintained, and I find that communicating, trying to explain to people all of the work that goes into maintaining a large and complex open-source project, particularly for business people, it can be hard to explain.”
Transcript
Wes McKinney: This talk is about… So I’ve been building software in Python for a little bit over ten years from now. You know, it’s been a little bit over ten years since I started programming in Python. About a year and a half into my first job as a software engineer, about a year and a half into that process, so this was at the end of 2009, I started to get involved with the open-source world. So I was working in a financial firm. We decided to open-source the library that was just then began to be known as pandas. And at that time, it was a really small project. pandas has become a really big project, and most of the development of pandas has happened since after it was initially open-source. At that time, I didn’t know very much about what it meant to build open-source software. I didn’t know about building communities or, you know, how to make decisions in public. So over the last eight or nine years, I’ve been learning a great deal about all of the good things and bad things about building open-source projects. And as time has gone on, those challenges have evolved, particularly as things going on in the world of business and the business’s relationship with open-source has changed a great deal in the last decade. So I guess the subtitle for this talk would be Community-Led Open-Source and You. And I will explain what I mean by community-led open-source and why you should, you know, know it and how to distinguish different kinds of open-source projects and how, you know, sort of think about them and talk about them with each other.
So things in the open-source world have changed a lot in the last ten years. GitHub was started in 2008. So, you know, now when you look at how projects are built on the Internet, it’s hard to imagine life without GitHub, right? So, you know, the Python programming language is on GitHub. pandas is on GitHub. You know, there’s so many projects there. And I don’t know for a fact, but I would guess that, you know, more than 50%, probably more than 50% of open-source development now is either directly or indirectly being carried out on GitHub, which is a huge deal. And so GitHub has been really important in making it easier for people to collaborate, easier for people to come together and figure out how to solve problems.
Around ten years ago, we also saw some other very important open-source projects get off the ground. We had the Apache Hadoop project started in 2006 at Yahoo and kicked off a decade of development of projects for dealing with big data. You know, there have been many important projects that have, you know, didn’t exist ten years ago or 12 years ago. And open-source has been particularly important in, you know, the world that I’m in, which is the data science world. Now, obviously, there’s many other parts of technology and programming that have experienced great change. But, you know, data science wasn’t even a term a decade ago. We used to call it statistics or statistical computing. And so sometime in the intervening years, you know, data science came about, and now it’s become one of the most important job functions in many modern businesses. And open-source has played a really important role in enabling people to learn about tools for doing data science, to be able to educate themselves, to obtain the tools, and become productive.
So one question would be, is open-source a cause of all of this progress in data science and other areas, or is it an effect? So did having all these open-source projects, you know, help spawn the field of data science, or was open-source a consequence of the growth of the field? And really, it’s a complicated answer. I think they needed to co-develop. So if you think about how to build complex projects for solving difficult problems, having projects be open-source makes the collaboration process a great deal more productive, because if somebody builds some code to solve part of the problem, you can take that code and reuse it more easily to build new projects, either by forking a code base or building on an existing code base. If the software was closed-source, that kind of code reuse and collaboration, you know, wouldn’t happen. So, you know, you can think about it as, you know, what I like to say is the best code is the code that you don’t have to write. So if you can avoid reinventing the wheel, then you can spend your time solving other problems.
Another factor in the development of open-source is all of the, you know, I don’t know whether we’re on, like, Web 3.0 or what we’re calling it now, but, you know, the rise of, you know, the rise of big data and social media in the last ten years where we have every, you know, every website and every mobile app is now finally instrumented with data collection. So, you know, any action that you take on the Internet is being logged and stored in a Hadoop cluster or in S3 or someplace in the cloud. And so not only is massive amounts of data being collected that was not, that we were not able to collect in the past or there really wasn’t a place to collect it, but then it’s not enough to collect all that data. We need to create systems and tools to manage it and analyze it. And if you think about all of the biggest, you know, tech companies in the world that are collecting and profiting off all of this data, they needed to scale their ability to analyze that data much faster than any commercial software vendor could respond to their needs. And it happened to be that many of these companies, you know, Google, Facebook, you know, Microsoft and so forth, they believed that building the software as open source would help them make progress, help them make progress faster.
Another factor that I think’s influenced the push towards open source is the licensing models related to cloud computing. So that’s one of the biggest things that’s happened, you know, in the last decade is the development of, you know, infrastructure as a service like Amazon Web Services, Google Cloud Platform. So, you know, it used to be, you know, let’s say you wanted to do some data analysis. You would buy a MATLAB license or a SAS license. So now an individual data scientist might, you know, one day maybe they’re using one machine and the next day they’re spinning up a thousand machines on AWS to run some analysis. So whenever you wanted to scale up your analysis, if you needed to call up your software vendor and negotiate more software licenses to be able to do that, that would create a big problem for you. So increasingly, you know, this idea that the software should be free and it should be easy to deploy and install it anywhere has been, I think, really driven by, you know, the cloud model of being able to elastically scale up and scale down your computing.
Another factor has been reproducibility issues in science, and projects like IPython and Jupyter have spent a lot of time talking about reproducible research. And, you know, there have been a lot of high-profile cases where people have made mistakes in Excel spreadsheets in doing research. Or you might publish an important research result, but the data is not made available for scrutiny or the code that they wrote is not made available. And so there have been, I don’t have any to cite, but there have been many cases that you can look up around problems reproducing research results or errors that were found in the analysis, whether it’s deliberate emissions of negative results or kind of other kinds of science problems. So, you know, increasingly there’s, I mean, it hasn’t quite happened yet, but, you know, I think in the future we are on our way towards a world where along with scientific research papers there will also be the expectation that you can provide, you know, a Docker file or a Jupyter notebook or some other way for others to see the complete lineage of your analysis so they can look at all the steps you took, all of the data preprocessing, data cleaning that you did so they can understand all the decisions that you made in producing your research. And having the software be open source is an important part of that because it’s not enough to be able to see the code. One also needs to be able to see how the software is implemented to fully understand, you know, top to bottom how the science was produced.
So, you know, one major theme, you know, the biggest and most successful, maybe not the biggest, but the most successful open source project of our lifetimes is definitely Linux. And, you know, making Linux work in an enterprise setting, you know, turned out to be a pretty difficult problem. And it spawned, you know, really one of the most successful open source businesses, which is Red Hat. And there have been a number of other businesses started around Linux. And nowadays there are many companies which are working to, you know, to make a business out of building or providing support for open source software in an enterprise setting. But this has also created complications because in many spheres there is the expectation that the software is open source. You know, particularly in the data science world, a lot of people won’t use software if it isn’t open source because they want to look at the implementation to understand exactly, you know, all of the math and everything that goes into how the software works. But at the same time, you have all of these free software projects and often the work that open source communities do is not enough to meet the needs of enterprises. So things that often fall by the wayside are, you know, security and auditing issues, you know, integration with, you know, data systems or other systems which you really only encounter in large enterprises. And so, you know, from my perspective, having worked in many open source projects, we often don’t really see what it’s like to be inside, you know, a big company which has, you know, very complex needs. And so it’s hard for the open source community to respond to the needs of big companies that want to adopt and use the software. So this creates a bit of a problem in that companies are… people are creating startups and companies to fill in the gaps that… for what enterprises and companies need to make open source work for them. And sometimes it’s difficult for those companies to invest back in the underlying open source projects. And so there’s always this tension between doing things to make money and making the software in the community strong.
So this gets to one of the main points of this talk that I want to spend some time on, which is, you know, where does all of this software come from? And, you know, when I talk to people, especially when I talk to big companies, they don’t… a lot of folks don’t spend very much time thinking about how the software was produced. I think there’s this idea of open source developers, you know, like hobbyists working in their attic, spending their nights and weekends, you know, building the code. And, you know, the reality is it’s a great deal more complex than that. I mean, if I were a big company, I’m not sure that I would want to be using software libraries that were primarily somebody’s hobby or somebody’s, you know, nights and weekends project. You know, preferably, you know, the software would be produced with a high degree of professionalism. So this gets to the question of, you know, you have open source projects, but who paid for it? And so I’m going to put that paid in quotations. So there are many ways that software might be paid for. In the case of open source, a lot of… and I have a little bit of data on this in the presentation, but a lot of open source software is paid out of people’s free time. So, you know, free time isn’t exactly free. So if I take, you know, a half of a day or a day to work on an open source project, even if no one is paying me, like, that work is being paid for out of my time and opportunity cost. I could be spending my time doing something else, which I might make some money for. And so even if I choose to spend my time on an open source project, that work isn’t free. It’s I’m paying for the work, in effect.
And who pays for projects? You know, one of the biggest divides is when you think about how projects are led and how they are governed. And I, you know, many projects, they straddle the two kinds of major projects, but, you know, the two buckets that I spend a lot of time thinking about are projects that are led by industry or that have single corporate sponsors or projects that are primarily driven by the community that are led by a distributed network of individuals around the world. So just to give you an idea of, you know, kind of industry projects and community-led projects. So in the case of industry-led projects, often these are, you know, this is software that was created at a company that was open-sourced later. Sometimes you have a code base that wasn’t created to be an open-source project to begin with, and it gets open-sourced later. And it turns out that open-sourcing a project is a lot of work, and sometimes companies will choose not to open-source something, even if they could, because it’s a huge… the amount of work to take something from an internal code base that works on your infrastructure to something that will work, you know, that can be used, let’s say, on all three, you know, all three major operating systems and deployed in a variety of settings. You know, just in the case of pandas, as an example, if we only needed to use pandas on Linux, it would save the developers a lot of time. Let’s say Linux and Python 3. So just supporting, you know, two flavors of Python and three operating systems creates a lot of extra work for us. And if we were working in a single company, we wouldn’t necessarily need to do all of that, do all of that work.
So some examples of industry-led projects are things you might have heard of, like TensorFlow, which was created by Google, the Swift programming language, which was created recently by Apple, the Rust programming language. I think Rust is now community-governed, but it was created originally by an open-source by Mozilla. Community-led projects are more complicated. Sometimes they’re started by ambitious individuals, you know, or a collection of developers get together and decide to launch a project they may not be affiliated with with any particular company. Sometimes community-led projects will be led by a group of companies getting together and deciding to work together, but where no single company essentially controls the project.
So one part of this is, you know, when you look at, there are so many open-source projects out there, and what motivates developers to create high-quality software? So if you’re building a project for your own purposes, you have the motivation that you make high-quality software because you want it to work for your particular use case and your business. If the code has a lot of bugs and it has a lot of problems, maybe you’ll get fired, so you’re motivated to not get fired. But in the case of open-source projects, in particular community projects, the incentives for producing high-quality software can be a lot more complicated. So you care about things like, will people take us seriously? So maybe if people don’t take you seriously, they might look at your software and say, well, these developers that make this library, they aren’t very good, so we need to solve this problem, so we’re gonna start a new project or we’re gonna fork that code base and ignore those developers. And you do see that kind of thing happen. And I know that I spend time worrying, and a lot of open-source developers do worry about giving people the impression that the software they are producing is high-quality, even if they, you know, they aren’t directly… If bad things… If something bad happens to the project, it won’t always negatively impact the developers directly.
So open-source has also become really important, you know, kind of as I was saying before around trust and transparency, and this idea of software freedom and being able to not have vendor lock-in and being able to walk away from, you know, walk away from a collaboration or walk away from a company that you’re working with on a project. You know, I think in the past, like, people have experienced all the problems that come with vendor lock-in, and open-source has given them a way out of that.
So a variety of problems can occur in these different types of projects. So some of the issues that I’ve seen and other people have seen in corporate-led projects is some… You know, when you have a project that’s being led or driven by a single company, sometimes that company might change their strategy or they might have a, you know, something related to how their business is doing, and so they might either stop working on the project or take the engineers that were working on the project and assign them to a new project. And so when you think about, like, who’s paying for a project, so if people are, you know, if the developers of an open-source project are being paid by a company, there’s the risk that, you know, their boss might say, well, you can’t work on that open-source project anymore. We need your skills and time to work on something else. Sometimes the… Sometimes projects are produced by startups. So I’ve seen this happen many times where there’s a promising open-source project that’s developed by a startup, and then either that startup fails or it gets acquired or it pivots. So in each of those cases, you could have a situation where suddenly the developers of that project are no longer able to continue supporting it, which leaves the users in a bit of a bind.
So this is not to say that community projects are not without their problems. So sometimes in community-led projects, progress can be a little bit slower because it takes a longer time to make decisions. So when you have a single company making all of the decisions and effectively owning the project, sometimes it can move faster. In community-led projects, you know, maybe it takes more time to build consensus about major changes in the project. Consensus can also be good, which I’ll say a few words about later. But I find that the developers of community-led projects, because often a much larger proportion of their time is spent, is unpaid time or time that they’re taking out of their lives, that the developers are much more prone to burnout and, you know, essentially burnout and leaving the project or having to take time off from the project. And I’ve experienced, you know, burnout in my own work, and I know many open-source developers have struggled for years with burnout, keeping up with the demands of keeping the user community healthy and happy. Sometimes developers are, if they’re working on the project in their free time, they might get a new full-time job or, you know, they might have a new consulting project that they need to pay their bills that prevents them from working on the project.
Another more insidious problem that I see in community projects is that there can often be an underinvestment in testing, in packaging, and continuous deployment. And so some of the issues that you have with poor testing or poor development infrastructure, it may take a while to present themselves. And really, it ultimately makes the development community less productive because you don’t have as many of those that support infrastructure for running a successful project.
So there was a NumFocus. So NumFocus is a nonprofit organization in the United States for funding projects like pandas. So it provides a way for organizations to give money to NumPy, to Matplotlib, to pandas, to many other open-source projects. I don’t know if you can read this, but there was a slide in… I should look up who actually, whose presentation this was, but they looked at three major projects in the Python ecosystem, the pandas project, NumPy, and Matplotlib. They looked at how much, you know, how many times have these been downloaded approximately, how much did these projects cost to build in terms of people time, how many contributors do they have, and then ultimately how many project maintainers are keeping the project running. So I don’t know if you know the difference between project contributors and project maintainers. So the maintainers are typically the core developers who are responsible for all of the flow of code and changes into the project. So contributors might open pull requests on GitHub, but it’s the maintainers that give code review that decide whether a patch or a pull request is good enough and gets merged into the code base. And so, you know, pandas has approximately four maintainers. NumPy has approximately six. Matplotlib has five. And so the tweet that, you know, about it was that, you know, there’s 15 people who are essentially making these three open-source projects that we all depend on work, which is sort of terrifying. And most of the… and I know at least in the case of pandas that most of the time that is spent by the maintainers is volunteer time. And so it’s kind of a terrifying situation when you consider how important pandas has become to the community.
So there can also be, you know, toxic relationships between the user bases and the developers. Like, I find that open-source users often grow a sense of, you know, often grow a sense of entitlement or they feel like, you know, that by using that project that they’re owed something by the developers of it. And so I actually had an email thread one time where a pandas user described himself as a customer of the project. And so, you know, I had to sort of chew on the word customer for a little while and I was like, well, the check must have gotten lost in the mail. And it’s a surprisingly common view that people invest their time in using a project and they feel that the developers owe them. And I’m not, you know, quite sure how true that is.
So there’s a number of myths that surround open-source projects. So one of them is this idea of organic growth. So the idea that, you know, essentially open-source software springs forth from the ground. And if a project doesn’t solve, doesn’t do what you need or doesn’t have the features that you need, that one solution is simply to make the feature requests and to wait and hope that kind of as organic growth proceeds that, you know, someone in the world will build the things that you need. Another myth that I hear is the idea that developer burnout is not really a problem. That essentially, you know, there’s 7 billion people in the world and a certain percentage of people will be interested in building open-source projects. And so if the maintainers of big open-source projects burn out, then as a result of that random process, other people will show up and pick up the slack from them. I’m not sure how true that is because if, you know, you came to work on an open-source project where a bunch of maintainers or developers had burned out, you would probably know and it might, you know, give you pause whether you choose to get involved in that project. So if developers burn out, it can make a project significantly weaker.
Kind of along these lines, this idea that, you know, I think a lot of businesses think about engineers as replaceable parts. So, you know, if one, you know, essentially if somebody gets transferred off a project, we can find somebody on short term to kind of take their place and take up the reins of developing a project. But, you know, the reality is, you know, a lot of open-source maintainers carry so much project history and knowledge in their heads that it would take, it could take an individual, you know, months or years to develop the level of project knowledge to be effective as a project maintainer. So I’m always impressed with Jeff Reback and other pandas maintainers who seem to have an encyclopedic knowledge of the, you know, pandas issue tracker, which has had, you know, 12 or 14,000 issues. And, you know, Jeff will often, you know, an issue comes up and he’ll know 21,316. Somehow he just has all the numbers memorized. It’s pretty impressive.
So there have been some surveys run about how open-source development is funding. So there’s a startup that’s working on open-source economics called Tidelift, and they found that the two most significant sources of funding for open-source, the number one is self-funding, so no funding. So essentially you are paying for the project indirectly with your time. And the second most is from employer time. And if you think about it, both of these, both of these sources of funding are very vulnerable. So if you, you know, at some point you need to make money or you may, you know, something may change in your life where you’re not able to spend, you know, people with families and children. I don’t have any children, but if I did have children, it would be much harder for me to spend time working on open-source. But circumstances change in people’s lives, and so that top one could go away really quickly. And employer support can also be unreliable as, you know, businesses do change over time in their ability to give people the time and space to work on open-source.
The work is also split between maintenance and innovation. I think people tend to understand the idea of building new stuff, so people like it when you build new projects. When pandas came around, I think people were really excited that they could read CSV files now, and I think that probably pandas’ killer app is its read CSV function. If I could cite one thing that made the project successful, it would be that. And so when companies think about paying for open-source, they like the idea of innovation because you’re building new things that weren’t there before. But projects also need to be maintained, and I find that communicating, you know, trying to explain to people all of the work that goes into maintaining a large and complex open-source project, particularly for business people, it can be hard to explain. So in the case of pandas, like just explaining, you know, trying to summarize the last 1,000 issues that were closed to the last 5,000 issues, the kinds of bugs and problems that come up are so difficult to understand if you aren’t working every day inside the code base. And so just that difficulty in explaining all of the time and effort that goes into building the project makes it even harder to convince people to give financial support to do the work. Because it’s easy to say, okay, this money is buying me a new feature that I didn’t have before. And so, you know, the same amount of money might be harder to get to maintain the project exactly as it is now.
So there was a report. So a woman named Nadia Eghbal wrote a report called Roads and Bridges, the unseen labor behind our digital infrastructure a few years ago. And it centered on some high-profile problems in open-source projects like the Heartbleed, vulnerability, and OpenSSL to essentially shed light on the kinds of problems that can impact businesses caused by open-source maintenance problems. So I encourage you to check that out and learn more about that.
So I’d like to take my last few minutes to talk about why I think community, even though community-led open-source projects have a lot of problems, why do they matter? And why should you care? So for me, the biggest challenge One of the reasons I like community-led projects is that there are fewer single points of failure. So you aren’t dependent on a single company to continue to support that project. So I find that the community can be a great deal more robust, and they work hard. And as a result of that, you know, distributed nature of how the project functions, the developers are more incentivized to recruit new people the developers are more incentivized to recruit new maintainers and new contributors into the project. And so if you’re just, if you’re working inside a company, you have an employment relationship with, you know, with your boss or your management chain around building that project. And so you may not be as motivated to recruit new developers for the project.
Community projects often function, make decisions based on consensus, and they often document those decisions on public channels. And so it gives greater transparency to people who are using the project, who are in the community, so they can see how the project works and how they can get involved and gain influence in the project. Influence is often gained by doing work, which is also a very good feature that if you want to influence the direction of the project, you can gain karma, gain a say in how the project works by contributing.
Open source developers often fall into a number of traps that make it, you know, to try to finance their work. So I think the three biggest traps that I see, one is starting a… Great, so I’ve got ten minutes to go. So one of the traps that people fall into is starting companies. So doing a startup to support the project. There’s a number of problems that can happen there. I mean, one of the biggest ones, because I’ve lived in Silicon Valley, I’ve had a venture-backed startup. Startups can create conflicts between the desires of the startup founders and its employees and the demands of their investors. And so sometimes a company will start an open source project and then as time goes on, they come under more pressure to monetize and to become profitable and the time that they’re able to spend on the open source project may dry up or may shift to proprietary software that is built on the open source project and that ultimately harms the… can harm the user community.
Sometimes developers are sponsored by… have a single corporate sponsor or work inside a big company that uses the project. As I’ve been talking about, this also poses the risk that they might get transferred to a new project or that the company might… their budget for supporting that project may change. Another trap that I’ve personally experienced is doing consulting to support yourself. So you find somebody who uses an open source library. You strike up a consulting project with them and use the money that you make from the consulting project to, you know, give you time to work on open source. And so this can be problematic because you end up caught up in the hustle of getting contracts and supporting yourself and the consulting project is often using the open source software to solve a business problem not developing the project itself. And so I’ve seen plenty of developers start doing consulting and find that they, you know, are only able to spend 20% of their time doing open source development, which is too bad.
So just thinking about community-led projects, in the last few years, I’ve gotten involved in the Apache Software Foundation, which is a non-profit organization in the U.S. that has developed a governance framework for open source projects that are community-oriented. And so the Apache Foundation, also known as the ASF, does not pay for open source development work, but it provides cultural norms to guide projects and how they should function and how they can build healthy developer communities. So some of the things I’ve been talking about, this idea that decisions should be made on the basis of consensus, that decision-making power should derive from merit and contributions to the project, that decisions and communications in the project should be open and transparent and viewable by anyone on the Internet. So I’ve become a big fan of the ASF, and a lot of my work nowadays is happening in Apache projects because I believe that the kind of culture of Apache projects is very effective at creating healthy and robust developer communities.
So the question is, community-led projects are difficult, so what happens if, you know, 10 or 20 years from now, developers by and large find that they’re too difficult and we should just focus on projects that are created by Google or Facebook or Microsoft or so forth. And there is that risk that if users lose trust in projects community-managed that don’t have a huge tech company behind them, they might stop using those projects and say, well, if the code isn’t produced by Google, we don’t want to use it or a company that’s of the size or influence of Google. So there was a movie that came out over 20 years ago called Demolition Man that’s set in the future, and I guess as a result of some things that happen, all the restaurants in the world are Taco Bell. And so there’s like a joke in the movie. It’s like, you know, now all restaurants are Taco Bell, so maybe, you know, that’s one possible, you know, bad thing that could happen in the future. It’s, you know, maybe the only effective open source or the only successful open source projects are projects coming from the ten largest tech companies in the world. I would find that to be pretty sad.
So I don’t have a great deal of answers. I think these are, you know, I’ve been doing open source for a really long time, and so these are the things that I think about because I want to see healthy communities develop for, you know, the users to be productive and for the projects to innovate and become, you know, better than they are now. I think we’ve come a great long way over the last ten years, and ten years from now, I would like to look back on today and say, wow, things have gotten much, you know, we’ve come so far in the last ten years, and I think these sort of funding and support issues are making things a lot harder.
So the two big ways to help open source projects, either you or your companies, or, you know, so one way is giving time. So contributing in all the different ways that you can contribute to open source projects, whether contributing code or writing documentation, answering questions on Stack Overflow, you know, doing code reviews, you know, kind of all of that stuff. If you have the ability to give money, you know, a little bit goes a long way. Like, I think if, you know, if everyone gave, if every pandas user gave, you know, $1 per year, and I’m sure that we all, you know, we get $1 a year of, I guess it’s what, 30, 32 bots? So maybe we get, you know, yeah. Don’t give it to me. You can give it to the project. But, you know, I think, if you think about, like, how much value you get out of these projects, you say, well, sure, I would give a dollar. But, you know, even, you know, if we wanted to collect a dollar from everyone, it would even be complicated to do that. If everyone did give a dollar, that would be a lot of money and could allow a lot of people to spend work full time working on pandas. So every little bit helps.
To help with, you know, raising money for open source development and the projects that I care about, so I recently, I’ve just created a new organization called Ursa Labs. And this was a couple of months ago, so I’m just getting started. Over the last few years, I’ve been working on an open source project called Apache Arrow, which you can read more about on the internet. But the basic idea is that we’re building shared infrastructure for data science to make projects like pandas faster and more scalable. We’re building libraries that can be used in many different programming languages. So, you know, Python, we’re working on, we’ll be building R bindings for Arrow. We’ve got JavaScript and Rust and Ruby and C and Java, and, you know, there’s a ton of stuff happening. So we’re raising money to hire full time open source developers for Apache Arrow. The team is embedded within RStudio, which is a company that’s just interesting. It’s an R-based company, but we formed an alliance to build infrastructure for data science.
So one thing I often, you know, when I talk to people about, you know, giving time or money to open source projects, it helps to have success stories and, you know, generally only good things happen when projects get, in my experience, only good things happen when projects get funded. You know, the most common case is that projects don’t get funded and so you never get to see what might have been. So just as an example, someone I know in the Python community, Nathaniel Smith, he’s done a lot of work on statistical computing in Python. If any of you use Stats Models or the Patsy project, he created Patsy and has been involved in Stats Models and many other projects. But Nathaniel was funded by the Berkeley Institute for Data Science in California for two years. So a single person funded for two years and he was able to undertake, you know, four, you know, work full time on scientific projects for Python and he had four, you know, four big projects and was able to make, you know, a huge impact as a single individual. And so it’s, you know, when you hear about, you know, individuals making so much impact, you know, I think about what it might be like if we had, you know, the ability to fund ten times as many people or a hundred times as many people to dedicate their efforts to making the software better. So I think we’ll get there eventually. Thank you for your attention. So I hope this gives you some things to think about and I’m looking forward to working with all of you over the coming years to build better software for data science. Thank you.