Ken’s Nearest Neighbors: The Future of Open Source

Podcast
Event Ken’s Nearest Neighbors
Location Remote
Date March 22, 2024

This transcript and summary were AI-generated and may contain errors.

Summary

In this interview with Ken Jee, I discuss my journey from quantitative finance to creating pandas, and my current work at Posit. I reflect on the early days of building pandas at AQR Capital during the 2008 financial crisis, how I convinced my employers to open-source the project, and the mentorship I received from John Hunter (creator of Matplotlib). I explore the tension between backwards compatibility and innovation in mature open-source projects, the evolution of the Apache Arrow ecosystem, and my role in bringing Python and R communities together. The conversation also covers open-source sustainability, the challenges of “techno-feudalism” in the AI era, and whether open-source can compete with well-funded proprietary AI development.

Key Quotes

“Long story short, after about six months of me whining and complaining, they’re like, I think they decided that releasing pandas as an open source project was better than having me resign to go and rebuild it from scratch.” — Wes McKinney

“John really took me under his wing and helped orient me in the Python community and taught me about open source community building, kind of looking at what he had done with building Matplotlib and what he’d learned from the NumPy and SciPy and the early scientific Python community.” — Wes McKinney

“I think it’s a sign of maturity and progress in an open source project when the creator is able to take a step back and not have to loosen your grip and allow other people to step in and step into a leadership role.” — Wes McKinney

“I really developed this mindset of everything must be really easy. If we can get this down to one line of code that’s easy to write, easy to reason about.” — Wes McKinney

“pandas is one of the most successful projects that’s been almost entirely volunteer-based, that’s had very little funding, very little corporate sponsorship.” — Wes McKinney

“One of the things that really helps is just making sure that as much of your project communication as possible is happening in public.” — Wes McKinney

“Open source was supposed to be about freedom, freedom to understand, freedom to modify. And, you know, fortunately or unfortunately, open source has become a go-to-market strategy.” — Wes McKinney

“The goal is to create products and services that are so essential that a large fraction of the world is like their serfs. They are the new feudal lords and they collect rent from their vassals and their serfs.” — Wes McKinney

“How am I supposed to have a life and a family and stuff if I’m working 80 or 100 hours a week for the rest of my life to engage in these open source heroics?” — Wes McKinney

“Moving from a more federated, fragmented ecosystem to one that’s more like working together and more focused on usability and user experience. I think we’re getting there, but it’s taken a lot of labor and a lot of time to make progress toward that.” — Wes McKinney

Transcript

Ken Jee: Wes, welcome to the Ken’s Nearest Neighbors podcast.

Wes McKinney: Thanks for having me.

Ken Jee: Yeah, I’m so excited to have you on. Obviously you’re working with Posit now, but you’re synonymous with pandas and the creation of the beautiful tool that I’ve used quite a bit over the course of my career as a data scientist. I would love just to get you a little bit more familiar with the audience, to hear about how you first got interested in data and maybe how that’s played out into what you’re working on now over time.

Wes McKinney: Yeah, it’s been a pretty wild ride up until now, but I’ve told the story back in many conferences and many conference talks and podcasts and things over the years, but the basic idea is that I studied pure math in college and I got a job in quant finance right as the great financial crisis was starting in 2007. And what happened during that era was that there was a huge amount of stress and pressure to be able to analyze data more quickly and turn around research results that you could use to react to what was happening in the markets.

And so not just where I was at AQR, but also across the financial industry, there was this realization that we need to make the tools and the programming tools in particular a lot better so that you can look at data, get access to data, manipulate the data a lot more easily. And in my case, I was under a lot of pressure and trying not to get fired basically. And we were writing a lot of Excel and SQL and a little smidgen of R, this was pre RStudio back in 2007. And I discovered Python and I thought, wow, this is a really great scripting language for doing basic stuff, interacting with text files, CSV files, what if we had some tools to do data analysis? What if I could make myself more productive at work using this programming language?

But there was a missing tool set for doing those data manipulations, working with time series data, all the basic table data operations, data frame operations that you would find in statistical computing languages, in R and in other places. And so it started out as a skunk works that I didn’t even have permission from like my boss or anyone I was working with, it was just my personal tool set. And I started socializing it with my coworkers, this was in like 2008, late 2008, early 2009.

And then at some point during 2009, I realized that what we had built at AQR and at that point, like some of my colleagues there had started helping me with the project. I realized that what we had built had a lot of potential value in society by making Python more useful for data analysis more generally. And so I started talking with my bosses about, hey, can we open source this?

And financial firms are not known for being the most charitable when it comes to open sourcing things, especially in that era. So long story short, after about six months of me whining and complaining, they’re like, I think they decided that releasing pandas as an open source project was better than having me resign to go and rebuild it from scratch. And by the end, they were very enthusiastic about open sourcing it.

And I gave my first talk to the Python community at PyCon 2010. So a little over 14 years ago in Atlanta. And that was my first exposure to the open source, like the open source ecosystem. I got connected with John Hunter who created Matplotlib. He also worked in finance. And so that was partly how we connected because I was interested in meeting Python programmers who worked in finance. And John really took me under his wing and helped orient me in the Python community and taught me about open source community building, kind of looking at what he had done with building Matplotlib and what he’d learned from the NumPy and SciPy and the early scientific Python community.

And so I had really great mentors and saw the value of open source community building. And really that set me on the path of figuring out how to rearrange my life to be able to do more open source development. So yeah, very intense period. I really worked pretty intensely on pandas until 2013. And then it became a community project after that. Like I got busy doing a startup and enough of a group of core developers had come out of the woodwork, people that I’d met and encouraged to get involved. And I was comfortable handing over the keys to the project to them to lead development.

And so the development has been led by a group of core developers over the last 10 or 11 years. And so that’s been really fantastic. I think it’s a sign of maturity and progress in an open source project when the creator is able to take a step back and not have to loosen your grip and allow other people to step in and step into a leadership role. And yeah, that’s enabled me to go off and work on other projects, which is great.

Ken Jee: I love that. Something that I find really interesting, something I’m seeing in a lot of different areas is there’s largely no new ideas, right? We’re pulling from different areas and things along those lines. What were the models for creating pandas that were not related necessarily to Python? Were you pulling from just like your pure math background? What was the definite specific need that you were seeing to create this infrastructure?

Wes McKinney: Yeah, so I had some colleagues that were using R and that were doing statistics and modeling work in R. So that was, basically they were doing linear regression, robust linear regression, getting data out of SQL databases, wrangling it a bit in R and running some regressions to do some time series, some time series forecasting. And so I had that exposure, interacting with the developers who were using R, got some R programming experience. And I’d also written a lot of SQL by that point. It’s all SQL server, so it’s Microsoft SQL server. And I’d use spreadsheets and I had a math background.

But I also was learning about the ethos of the Python community, the Zen of Python, the ideas like practicality beats purity, like there should be one and preferably only one obvious way to do it. And so we started sort of looking at things that were hard to do in my work for my job. I really developed this mindset of everything must be really easy. If we can get this down to one line of code that’s easy to write, easy to reason about. And I think this has driven a lot of other projects in the Python community, just this concept of it needs to be easy, it needs to be obvious.

And so that kind of pushed towards both being pragmatic and not overly wedded to technical purity and making things really easy. That kind of became the Zen of pandas development. Now, of course, data itself is very complicated and messy. And so as a result, pandas has grown into a pretty large and complicated project because it’s trying to accommodate all of these different usage patterns of people doing data analysis.

And so there’s many different shapes and varieties of data. pandas has become one of the ultimate tools for environments where you have high data heterogeneity, the type and the shape and the structure of the data varies a lot from use case to use case. Whereas there are some tools that are only, they’re more suited for data that’s more like, the kind of data that you would find in business intelligence that you manipulate in SQL. You can work with that data in pandas, but if you’re working with financial time series data, you can do that in pandas too. And so we tried to create like a very inclusive, like a very inclusive project that was open to adding new things that would help expand, like make the tent bigger and make it a tool that could bring in like a wide variety of types of users, if that makes sense.

Ken Jee: Yeah, that makes a tremendous amount of sense. I’m interested, as we’re talking about bringing in new types of users, bringing in and getting started growing, were there any major growing pains that you had early on? What were some of the challenges that you faced with building something? Obviously it didn’t start at the scale it is now, but going from something that you’re just using versus a team using versus a whole community using, I would imagine that there was probably not the most smooth sailing the whole way.

Wes McKinney: Yeah, some of the earliest challenges were like in some internal, like internal design details that we were concerned with our relationship with NumPy and how important was NumPy compatibility or interoperability. And so that led to a bunch of decisions that currently the community is in the process of trying to unwind.

So for example, how does pandas take views or how does it mutate or interact with data that originated in NumPy, for example? And so in the 2008, 2009 era, we believed that if pandas was going to be used at all, it needed to have a high degree of interoperability. You need to be able to jump back and forth between NumPy and pandas as efficiently as possible and with as little impedance, like a little difference in behavior.

And so this led to designing for NumPy, but now fast forward 15 years, NumPy is not nearly as important as it once was. And a lot of people who use pandas never need to use NumPy directly, or NumPy is only there even as like an annoying detail. Like sometimes you get NumPy arrays out, you’re like, what are these things? Like, what do I do with this? This doesn’t have the same functions and features that pandas does.

So that’s definitely one challenge, what types of use cases we optimize for. So often, if you want to make something faster, you’re making something slower in the process. And so choosing like what things do we care about making fast, that was also like a challenge.

I think longer term, since I haven’t been actively involved in day-to-day development since 2013, 2014, it’s been led mostly by people like Jeff Reback, and Joris van den Bosch, and Tom Augsburger, and others. There’s kind of a whole cast of pandas core team that are super active maintaining the project. Mark Garcia has done a huge amount for growing the international pandas developer community, running documentation sprints, and all those kinds of things.

And I think one of the conflicts that I’ve seen, that I’ve watched play out, and I haven’t gotten involved too much because it’s better for communities to just work out their differences on their own, is the conflict between, it’s almost like similar to politics, like the progressive camp and the conservative camp. And in the conservative camp, there’s a lot of concern about backwards compatibility, supporting legacy code bases, legacy users.

And we recognize that a lot of people who, there’s whole businesses that are Panda shops, top to bottom, and it’s the basic tool that they use for doing all of their analysis, and all of their modeling, and statistics, and whatnot. And so, but a lot of people don’t have tests, like rigorous testing, and other ways to check that whether they’re using some odd corner of pandas that suddenly now there’s a non-backwards compatible change that’s gonna break their models silently.

And so there’s a lot of concern about breaking or backwards incompatible changes. And that often is in conflict with making important design or structural improvements to the project that can address design shortcomings or other decisions that were made a long time ago, but maybe now we recognize that, yeah, this was a trade-off that was important in 2009, but in 2024, we would make a different trade-off now, we would make a different decision now because the world has changed a lot.

But still, there’s that push-pull between progress and being willing to break things and say, hey, maybe it’s okay, we’ve warned people that we’re gonna break this for several years, maybe it’s okay that we break it now. But yeah, that kind of back and forth over how much pandas is in maintenance mode, just being stable and reliable and exactly the way it was several years ago versus identifying the things that are really gonna make the project.

I think more recently, some of the work has been oriented at improving memory use and especially working with larger datasets, given that that’s been a problem for the project is that it works really well for small datasets, but when you have larger datasets, you can have out-of-memory errors on your laptop pretty easily, and the general wisdom is you should have five to 10 times as much RAM on your computer as the size of your dataset.

And so people sometimes are like, what, I have all this, I’ve got this 64-gigabyte RAM on my MacBook Pro, why can’t I work with a 10-gigabyte dataset? It’s like, well, you can, but you have to be really careful because if you do X, Y, and Z, you’re gonna blow out your system’s available memory and have a bad time.

So I see the different perspectives there, and these are, I mean, as with any large open-source project, these are the types of things that have to be negotiated and debated, and I think one of the great things is all of the discussion and debate is happening in public, so anybody who’s interested in the project can go on GitHub and see what the developers are saying, and thankfully, this community has, I think, a very positive tone, doesn’t have a lot of bad actors. I think people are pretty pragmatic, just like to get things done. Very rarely making ad hominem attacks on each other. You often, you’ve seen legendary examples of that from the Linux community, so I think it’s a very cordial group of developers, and everyone’s very, just wants to protect the users and get things done, even when they disagree.

So yeah, it’s kind of like in the Ruby community, the mantra in the Ruby community is, Matz is nice, and so we are nice. I dislike conflict, and people who are disagreeable in open-source, I tend to try to avoid, and so perhaps maybe the reason the pandas community ended up the way it is, being very collegial and very positive, was also because if people showed up with a bad attitude, maybe I wouldn’t have been so encouraging of their involvement in the community, and so the early people who got involved were people who were running on the same wavelength, and said, of the kind of, very down-to-earth, like, let’s just, let’s build features, let’s get things done, let’s keep our egos out of that, and realize that we’re all just people trying to build software, and we’re all doing this as volunteers as well.

pandas is one of the most successful projects that’s been almost entirely volunteer-based, that’s had very little funding, very little corporate sponsorship, so I think in more recent years, I’ve gotten to experience the many different models of providing financial support for open-source development to tackle ambitious new projects, and that’s a big can of worms there.

Ken Jee: Oh, I bet. Something I’m interested in, you said obviously you were able to shape a lot of the early community in these types of things in the project, and then you moved from maybe being more directed by you and a smaller team to being something that’s broadly open-source and worked on in a community. How do you make that transition successfully? I think we see a lot of people, projects, go a little bit off the rails when there’s a transition like that without as much of a clear vision and clear leadership. Is that something that was created by maybe some of the cultural elements you were describing before, or is there some other secret sauce in there?

Wes McKinney: Yeah, I think it did require being mindful of the potential risks that were involved. So there was, so I’ll give you a specific example. So when Anaconda was founded, it was called Continuum Analytics in 2012. So they were founded in the beginning of 2012, and there started to be, for the first time, some significant venture capital in the Python world.

So there, and Anaconda has been over its lifetime, has been a tremendous partner to pandas, has paid developers to work on the project, to do maintenance. But there were people who were concerned about the potentially corrupting influence of venture capital in the community, that that might lead to perhaps a group of developers working in private on designing things or doing new things in the project and not explaining to the other developers what they’re planning or what they intend to do.

And so you end up with this, and sometimes you do see this in open source projects where it’s this one-sided communication where you have a secret cabal of developers working on things, and every now and then they’ll put up a pull request, but you don’t really see, you have no visibility into what’s going on in their heads. Like, what are they planning? Like, what’s motivating this work that they’re doing?

And luckily, we haven’t had too much of that, but there was an active discussion amongst the developers. Like, how can we inure ourself? Like, how can we protect ourselves against the possible corrupting influence of corporations, for-profit corporations getting involved in Panda’s development?

And I think that one of the things that really helps is just making sure that as much of your project communication as possible is happening in public. And even when it’s tedious, that if you have a video call, that the call is as open as possible, that you publish notes about what you discuss in the call, if you make a decision in the call, that you write up the decision and your rationale for the decision, and you post that in some kind of a public forum.

And in more recent years, I’ve gotten involved in projects in the Apache Software Foundation, and in Apache-branded projects, there’s actually a mandate that all project communications be public for this exact reason, that you want to create this safe space where corporations can collaborate with each other, especially when you have salaried developers, like people who are working on an open-source project full-time or part-time, definitely as part of their employment.

And if you have Oracle over here contributing to this project, and Amazon over here contributing to this project, and Microsoft, and everyone, the corporate interests, they have their product managers and their internal incentive structure. You want to make sure that everyone’s playing by the same set of rules, that if they’re planning a big piece of work or they’re thinking about something in the project, that there’s a burden that they articulate what they’re doing, why, and they don’t need to necessarily say why they’re doing it or what they’re going to use it for.

So sometimes you see people building some new feature, and you look at it and it’s curious, like, why do they care so much about this? Like, what is this for? And often, you don’t know what it’s for, but still, there’s an expectation that there is a technical argument explaining why is this useful, like, why should this be a part of this open source project? For who else would this be useful for?

And so I’ve really enjoyed kind of the Apache way of building software projects where all project communications are public, discussions happen in public, and creating that kind of open dialogue where you’re also, by being public and open, that makes it easier for other people to get involved in the project because imagine how difficult it would be if you have this open source project that you’re using, you’re interested in contributing to it, you show up and you’re like, no, like, where’s the dialogue happening? So you realize that the dialogue’s happening in private, and so, yeah, maybe there’s GitHub issues and there’s pull requests and stuff, but really, unless you’re part of the secret club, like, you’re left out, and so that’s no way to create an inclusive community, and there’s already, like, great inclusivity issues in open source and in the software world more generally, and so when you have an exclusive environment, things happening in private, that makes it all the harder for people to get involved in the project.

So I think in the case of pandas, we stumbled into developing that type of culture from the outset just because intuitively, to me, it seemed like the right thing to do, but more recently with Apache Arrow and Parquet and some of the other projects I’ve been involved in, it’s more of an explicit part of the project charter that this is the only way that we’re allowed to work, and that, in practice, has made it safer, and it also makes it safer for corporations to feel comfortable collaborating with their competitors, and so you can have cloud vendors collaborating with each other on a project, like we’ve had in Apache Arrow, like we’ve had developers from Meta and from Google and from Microsoft and from Apple all collaborating in a project, even though different parts of their businesses compete with each other, and yet they all share this piece of technology that they use in their projects, and because there’s this established way of working, like we can’t have Apple developers working in private or Microsoft developers working in private. There’s a burden that they go on the mailing list or they go on GitHub and say, this is what we want to do, and here’s why there’s a technical argument, there’s technical merit for this work.

So yeah, it’s been very interesting to learn about all that and see what works well and what works well and doesn’t work well, but open source has become a big business, and that’s also, yeah, I think, something I didn’t expect to go so far down the rabbit hole of corporate sponsorship of open source, but at a certain point, I realized that, as an individual, there’s only so much I can personally do, like no amount of heroics and late nights and weekends. It’s like, how am I supposed to have a life and a family and stuff if I’m working 80 or 100 hours a week for the rest of my life to engage in these open source heroics?

Like it really, to scale myself and to scale the efforts of the community, like you have to have, you have to recruit other developers. Developers need to get paid as much as possible, so there needs to be money flowing into these projects, so we have to convince corporations why they should be putting money on the table to fund the development of these projects, like why they should be donating money to charitable organizations that fund open source development or provide grants for specific projects, and that’s, I think that has gotten a lot better in the last 10 years, and now there’s numerous grant, active grant programs for open source work, and contributing to open source has become increasingly an accepted and standard way of operating in corporations.

It went from being something that’s weird and something that, oh, I’ll never get my boss to let me contribute to this project or allocate 20% or 40% of my time for open source, but now I think people have seen many examples of the value of doing something within the framework of an open source project that you depend on as opposed to building everything in-house and maybe some day later throwing code over the wall.

Ken Jee: So all of what you’re describing there, it now makes so much sense how Posit fits into this for you. I was wondering if you could expand on that and decision to, again, it seems to align directly with what you were just describing there, but.

Wes McKinney: Yeah, yeah, so I actually, I met JJ Allaire, who founded Posit. I met him first in 2012, which is a long time ago now, and I knew some R, and I met JJ, and we were at the R and Finance Conference. I was there as like the Python heretic.

And so I met Joe, CTO of Posit and JJ, and I was really impressed with what they had built with RStudio, and we got to chatting about just, at the time, the struggle was how do we get people, more people to use open source for doing data science and statistics work?

And so we realized that we shared this mission of creating like an ecosystem trend away from closed source computing environments like MATLAB and SPSS and SAS to open source technology. So we really bonded over that in the early days. I think there was a lot of work to do to make the RStudio IDE and the R ecosystem more successful. This was even before, this was before Hadley joined, Hadley Wickham joined RStudio at the time.

And so that was how we initially got in touch. And so shortly after that, I started a company, Datapad, and I stayed in touch with JJ for the next few years. And then in 2015, I started putting together an open source project called Apache Arrow. And the idea was to create a reusable infrastructure layer for sharing data frames between programming languages and building high-performance computing engines that could power data frame libraries.

And so Hadley had just spent several years building dplyr, and that was emerging as a next-generation API for data frames in R. And so I got back in touch with JJ and Hadley in early 2016 when we were starting Arrow. Hadley and I got together and built the Feather format, which was the first example of really good data connectivity between Python and R.

And so after a couple of years working on Arrow, kind of my big sponsor for Arrow development in those early days was Two Sigma. And so I reached a point where I knew that I needed to scale up Arrow development. And Posit had grown, you know, Posit’s mission, building open source data science, but also building a sustainable business that can fund that open source work.

And so I was really attracted to that, you know, being involved, like I wanted to find a way to be involved in that mission while also funding Arrow development. And so I met up with JJ and Hadley, spent a day with them, and we started kicking around ideas like how do we, you know, Posit was willing to put serious funding on the table to fund Arrow development and help with the R-Python interoperability, like end the language wars was the way that we were thinking about it at the time.

But then we also wanted to create a structure where I could raise money from other companies to fund Arrow development as well. So that’s what became Ursa Labs. And so this was like my first formal partnership relationship with Posit. So I was, you know, so during that period, like anyone who was part of Ursa Labs was technically also a Posit employee, including myself. So they handled like the back office of Ursa Labs, and we were able to focus 100% of our time on Arrow development.

You know, Posit was the biggest financial backer of that work. And yeah, so we’ve had, you know, we’ve had a long relationship now, and they helped me incubate the startup that I spent the last several years working on Voltron Data. And so they enabled us to work for multiple years on, you know, 100% on Arrow development without having to take venture capital and be beholden to investors.

And that enabled us to like learn about the ecosystem, learn about what’s needed, like where we should, like what are the opportunities before we decided to spin out and start a company. And so, yeah, so after, you know, after four years of working on that startup, I’d reached a transitional period in my kind of journey as an entrepreneur and decided to come back to Posit in a full-time capacity.

Also because the company rebranded from RStudio to Posit and expanded its mission to include Python and to become a polyglot computing and data science company. And I really wanted to be a part of that and help that succeed even more than it already has.

Ken Jee: Yeah, well, I don’t think they could have brought anyone on better to help push the polyglot kind of overall messaging and ideal forward. And you’re bringing, with you and Hadley, probably two of the most prominent people in shaping these two languages. And I think it’s incredible that kind of both under one roof to be able to push them both forward together, which is fascinating to me.

I’m interested if we rewind a little bit and talk about the open source versus closed source conversation. So in machine learning tools, in I guess like data organizational tools and like in that general space, I think open source has overwhelmingly won in the terms of most data scientists I know are using pandas, Polars, whatever it might be. They’re also using scikit-learn and some of these other tools for building their models.

When it goes to creating the pipelines, maybe there’s some other tools that are being used, et cetera, but broadly speaking, more traditional machine learning is almost exclusively open source, unless you’re in a meta or a Google or an Amazon where they have internal tools. But as we’re evolving into more of what many people would consider an AI age, it seems like privatized models are more commonplace. They’re also more widely used than, for example, like Llama or something that is more open source.

How do you see maybe that playing out and what are the benefits or drawbacks to something that is privatized like this? And does open source even have a chance to compete with some of these behemoths?

Wes McKinney: Yeah, it’s a big question. I mean, if you think about what are the motivations or what are the motivations or the incentives of the players involved? Like if you think about what are the longer, like what’s the longer term outcome that the major players in the AI machine learning space, like what do they want to achieve?

And so I read this, it’s like I read this book recently, this kind of, you know, sort of new book, this former Greek minister of finance, Yanis Varoufakis, and he sort of coined this term techno-feudalism. And so the idea of like the company, especially the cloud vendors or companies with like major cloud operations, like their goal is to create products and services that are so essential that, you know, they just have a large fraction of the world is like their serfs, you know, so to speak. And so they are the new feudal lords and they collect rent from their, they collect rent from their vassals and their serfs.

And so in a sense, like, you know, the cloud vendors, you know, are like the new, kind of like the new cloud landlords and using their products and services, like, you know, we’re effectively, even if we are not directly paying, we’re paying with our data and our engagement is a form of rent that they collect.

And so I think that’s like one of the major motivations that drives like this use is that, you know, and explains like Microsoft’s, you know, kind of close involvement with, you know, with open AI is that, you know, they want to create these essential products and services that people simply can’t live without. And so that creates more or less an infinite rent stream for these massive, you know, multi-trillion dollar, you know, the magnificent seven corporations.

I even saw a joke the other day that like, it’s like real world Dune is kind of likening the all the different concepts in the Dune books to, you know, TSMC and NVIDIA, and, you know, the kind of, you know, AI, anything, everything related to AI, where it’s like the spice is like, you know, high-end Silicon and, yeah.

And so it’s, there’s a huge market to be, you know, controlled and monetized there. So I think that that doesn’t bode that well for cutting edge AI being very open source. And I think that maybe components and pieces, like some of the frameworks that are used, I do think that the many of the frameworks that are used to build and train the models will be open source.

And I’ve been following pretty closely this new company Modular, which is Chris Lattner’s new company, building like a kind of a new inference engine compiler, you know, deep learning LLM compiler on top of MLIR, which is like a layer on top of LLVM. And so they’re delivering, you know, that’s their plan. They say they’re going to open source and I hope it’s open source. And they say that they’re delivering better performance than PyTorch.

But the idea is that it’s built as a hardware heterogeneous or hardware agnostic framework so that you can basically using compilers, you can generate custom, you can specialize to custom Silicon, you know, as CPUs change or new architectures, we’re about to see RISC-V emerges like the new dominant CPU architecture. So we want to be able to incorporate new hardware accelerators, new types of computing chips, custom ASIC processors into our machine learning AI.

I mean, one of the big themes of the company I spent the last several years working on Voltron Data was building a hardware agnostic analytical data processing. So GPUs for acceleration, yes, but also we want to be ready for using custom Silicon to accelerate AI, you know, basically feature engineering and data analytic workloads for machine learning and AI.

So I think the frameworks, I think a lot of the frameworks will remain open source and will, you know, I think that the major players will contribute work to keep those frameworks open source. And I think that they will, my guess is that they will release like enough, like light versions of their models or that they’ll release just enough software to make it easy for people to become essentially customers of their cloud offerings.

And so it will be very, so if you look at Microsoft, if the analogous, the analogy would be, if you look at Microsoft’s strategy around VS Code, VS Code is an open source IDE, it’s amazing. It has all of these extensions, it has all of these integrations. So it’s really eating the, you know, eating the IDE space, but many of the services to integrate with VS Code, while VS Code is open source, many of the services and things to integrate with VS Code are not open source or maybe they’re open source, but they’re only licensed for use or they can only use the Microsoft cloud services if they’re being used within VS Code.

And so it’s a very, it’s an interesting strategy because it’s like, it’s open source, but, you know, if you want the full experience, like you need to be a customer of, and like to use this or that, you know, segment part of the VS Code stack, like you need to be a Microsoft customer.

And so I think that given that AI powered programming, the co-pilot use case is probably the most valuable and the main, like the main use case for LLMs that will survive the current, like the current AI wars. This is very relevant because who’s building and controlling the environments where we’re doing development, ultimately, you know, yeah, we should, we should be, I use VS Code, I think VS Code’s great, I think co-pilot is great, but, you know, what things look like longer term, it’s a, it’s a very interesting question.

I think that, you know, AI powered software development, AI powered data analysis is, is clearly like what the future looks like and who controls that, who profits from it and how much we as, how much agency we as individuals have over, you know, being able to change and manipulate these environments or contribute to them without necessarily being a part of one of these companies that’s building and maintaining the software is, yeah, it’s, it’s very big and kind of interesting question.

But again, you know, I think it just, I think follow, follow the money, who’s profiting and who wants to, you know, how, in what way are the kind of the massive, you know, cloud companies creating these defensible kind of, you know, kind of rent, rent streams from, from, you know, data that they’re generating, whether they’re monetizing that data or whether they’re, you know, leading to, I think corporate, clearly like enterprise subscriptions to, you know, LLM powered work services, you know, data intelligence services. It’s like, I’ve heard that term the other day, data intelligence versus artificial intelligence. It’s almost a better, a better term that, yeah, it helps, it helps understand better, like, okay, like why are they making this decision? Oh, okay, this is because this will lead to like more, more defensible and future-proof revenue streams for these companies.

Ken Jee: Well, I hadn’t originally thought of it. You sort of prompted it in the conversation is that open source is a lot more of a spectrum than I think a lot of people believe. You know, you have all these different companies with varying levels of openness. You know, Microsoft, you described, they completely changed their tune. They used to be quite the opposite or, hey, all of our stuff is privatized. You’re going to use it, and it’s only going to operate in our ecosystem.

You have Apple that sort of does that with their cell phone hardware, but from a, like, like on their, on their, in their like, sorry, their, their laptop operating system or their desktop operating system, it’s a little bit more open source friendly than say a Windows machine would be.

And I think a lot of people really generalize that to being black or white, where it’s like, oh, they’re a private company and they’re, they’re doing this all over here, or they’re an open source company that’s, that’s viewing it widely as everyone can, can participate.

And it’s interesting to think about who will be successful in the long run with what models. I mean, obviously Microsoft is partnering very heavily with open AI and Microsoft is probably of the large tech companies slightly more on the open source side, but the product that open AI produces is probably one of the most privatized ones. And I think that that dichotomy is really interesting as we think about how the landscape evolves.

Wes McKinney: Right, right. Yeah, I think, you know, open source, open source started out as, yeah, I think, you know, Richard Stallman is, Richard Stallman is still alive, but, you know, if he would be preferred, I mean, he will, you know, Richard Stallman won’t live forever, but I’m sure he will be rolling his grave, you know, probably for, you know, for eternity because, you know, the free software movement hasn’t, like the values and principles of the, you know, of GNU and the free software movement, the original free software movement in the 80s, like hasn’t, you know, hasn’t played out like a lot of people hope because open source was supposed to be about, was supposed to be about freedom, freedom to understand, freedom to modify.

And, you know, fortunately or unfortunately, you know, open source has become a go-to-market strategy. It’s simply like, okay, like this is, you know, having these open source components and then, you know, whether it’s like to back to use the VS Code analogy, VS Code is open source. It’s intended for like mass adoption. And, but there’s like kind of a periphery of products and services that can yield revenue, you know, from, you know, directly or indirectly from people using and adopting VS Code.

And so, yeah, it’s, yeah, there’s still plenty of purely community, purely community-based, community-backed open source. And so I think that will continue to be, I think it’s, and then it’s important, it’s even more important to, for projects that are truly are community-led, community-run projects, it’s important to support those as much as you can.

So a really great book about this whole topic is Nadia Eghbal’s Working in Public. It’s like the book that I recommend to like anybody non-technical to read, Working in Public. And it’s a great book and it kind of demystifies the whole world of open source for like a business, like a general, like kind of somebody with business background who maybe doesn’t know a lot about software development because there is this enduring image of open source as being, you know, yeah, kind of, you know, unkempt, you know, unkempt hackers on their nights and weekends, you know, scraping together, you know, scraping together, you know, some code in their spare time.

But, you know, the reality is that it has, you know, since the, even since like the early 2000s has become a, you know, a big business and entire businesses that are based on, you know, open source, open core, you know, open core products.

Now there’s a little more like sort of things are a bit more in flux around licensing because of companies that have projects that they want to be available on GitHub. They don’t want them to be like not stolen, but like co-opted by the cloud vendors. And like, you know, there’ve been countless stories of companies having their lunch eaten by Amazon or usually it’s AWS.

And so now there’s a whole movement to like, you know, AWS proof your project. They have these new source available licenses, which it’s like, basically it’s open source eventually. And so there’s like a time, like eventually it becomes permissively licensed open source and you’re free to do whatever you want with it unless you’re Amazon. Like unless you operate, you know, a cloud, you know, like a cloud service, you know, at a certain scale, you can do whatever you want with it. So it’s like the license is structured basically just to make it so that Amazon can’t take your software and turn it into an AWS product.

So that’s created like a new, you know, a new dynamic kind of in the open source world. But it’s, yeah, it’s very interesting. It’s, you know, evolving still, and I’m sure in five years things will be different. Things will be different still.

And I think right now we’re in this transitional period, which as you pointed out, you know, Microsoft went from like the Steve Ballmer era of like what Microsoft open source, like absolutely not to like embracing open source with, you know, with open arms. But, you know, again, like, yeah, I think, I don’t know if Microsoft is still the most valuable company in the world, but it’s, you know, it’s sort of, it’s like a duking it out with Apple and, you know, Apple and Google.

But, you know, open source is now part of the, it’s part of the business strategy and it’s clearly working. And so it’s yielded a lot of really good and nice things, but there’s also like, there’s also risks involved with that. So, you know, I think as developers, you know, we keep our eyes open and yeah, I think, you know, at Posit, you know, I, you know, I’m really, really optimistic about, you know, JJ’s vision to create a hundred year company and to be kind of an enduring force for good in building open source data science tools and creating a sustainable, you know, business to support enterprises that are doing data science so that we can make open source work in the enterprise and make sure we have, you know, consistent, that we have sustainable funding to fund these essential open source projects that make data science, you know, free and accessible for everyone.

Ken Jee: Amazing. Well, I’m gonna link Working in Public in the description and in the show notes for everyone. I think that’s an awesome place to start to learn more about the open source communities.

The last question I have is related to some of the challenges that open source sort of face. So something that I’ve observed mainly in the data space is that like the access to compute and the access to specific data is what I believe to be the major thing that’s holding open source back is that there’s an equal access to these types of things and there’s exorbitant costs associated with that. In a perfect world where those were not factors in large language models, I think it would be, there would be dramatically more involvement in innovation in the open source space. Is that a solvable problem for open source or is there like what would have to happen to make that less of an issue?

Wes McKinney: That’s a very good question. Yeah, on social media, I often see people joking about people being GPU rich and GPU poor. It’s like, oh, I’m GPU poor, so I can’t, it’s like I can’t do that.

I think there’s some tools and platforms emerging that are helping with expand access to GPUs for the GPU poor or people that can’t afford to buy a, you know, can’t afford to buy an H100 because they’re, you know, $37,000 or whatever.

But so like, so one company that I’ve been excited about is Modal, so it’s M-O-D-A-L. So that’s Eric Bernardson’s company and they’ve figured out how to get like, you know, GPU job startup to happen with really low latency in a serverless computing environment. So people, you know, tinkering with LLM stuff that need access to H100s and are on a modest budget can run those types of workloads at reasonable cost on Modal.

And I think there will be, I mean, there’s obviously like other platforms where you can get access to GPUs, you know, GPUs by the hour, you know, with relative degrees of, you know, of cost. But it’s, yeah, I mean, the hardware is like, the hardware is in short supply and in very high demand and very expensive. And so it’s, yeah, the barrier to entry is significant.

And so, yeah, it does beg the question of like, you know, who can hope to do serious things with LLMs outside of like, you know, the most valuable companies in the world that have like massive capital to invest in, you know, infrastructure and, you know, can afford. I’ve heard about the cost just associated with training, doing training runs on some of these LLMs. And it’s like, well, I mean, you have to have like, you know, 10s or, you know, you have to have billions or billions or 10s of billions or hundreds of billions of dollars in revenue, you know, for this to even be worth it.

And so, you know, how can anyone else hope to innovate, you know, outside of that? And so, and it’s not like things are getting, you know, it’s not like any things are any easier. Like I’ve also read about like how much the power consumption is going to grow over the next decade just for supporting the needs of running LLMs. Like, you know, people have been, you know, concerned and worried about, you know, like the power consumption of the Bitcoin network. And like, it’s like, I think compared to what’s going to happen with LLMs, like it’s just, if they’re going to be more concerned about that in the future, like the carbon footprint of these data centers.

So it’s, yeah, I don’t know that I have a good answer there, but it’s, yeah, it’s, you know, yeah, there is a risk. And I think it’s, unfortunately, it’s the most likely outcome that like outside of, you know, the most, you know, the most valuable companies in the world, you know, in China and in the U.S. that it just, you know, won’t be accessible. Like if you want to work on cutting edge AI, like you have to go work at one of those companies, which is, you know, they’re the new, like they’re the new nation states, you know, in a sense. And so I don’t know what else could, what can be done about it.

Ken Jee: Well, I think there is some hope in the long run. So let’s say in two years, we can dramatically reduce the costs where it’s expensive, but not exorbitantly expensive. I think there’s a massive incentive in open source to make things increasingly efficient. Whereas in these larger companies, if they have access to resources, there’s an existing paradigm to just spend money to train.

And so there’s some hope that, hey, like, if we do things and we build, you know, we work in public and we do these things, there’s efficiencies that we can create because of the resource constraints that are there to begin with. Not saying that that will necessarily happen, but I would imagine that there is a tipping point where it’s like, okay, we have enough resources to be able, or like the costs have gone down enough where with the resources we have access to, we can start innovating very quickly because we’re forced to. But right now I think the amount of resources are so large that we haven’t reached that threshold, but hopefully there’s some crossover.

Wes McKinney: Yeah. And I think like, it clearly like, clearly like Nvidia’s sort of, Nvidia’s status is the primary, like the main provider of computing hardware for these workloads. Like, I wouldn’t bet money that that is going to remain the case. And I think that, you know, it’s clear that like custom Silicon for LLMs is coming. I don’t know what the timeline is for that. I assume that people are already working on it. And it’s just, it takes-

Ken Jee: Sam Altman’s trying to, he just made $7 trillion.

Wes McKinney: Yeah, yes, that’s right. So I think that that is like custom Silicon as part of the solution. You know, we have to worry about, I don’t know if it’s safe to talk about on this podcast, but you know, we have to worry about like, you know, the geopolitical risk around TSMC and like where are those chips going to get fabbed.

But yeah, I think, you know, the status quo where it’s like, you know, people are GPU poor and it’s like, just access to hardware, like that clearly has to change. And so it’s good people are working on it, but yeah, I’m optimistic about where things are going. So, you know, more hopeful than full dread, I would say.

Ken Jee: Excellent. I think ending this on a hopeful note is really good. Wes, those are all my questions. Do you have any final thoughts, any words of wisdom?

Wes McKinney: No, I think it’s, you know, we’re in a really interesting time right now. I think, you know, the 2010s were pretty intense because we, you know, we went through this era of like making open source work in business and getting, and so now suddenly like the Python ecosystem and open source data science is now the incumbent and like just the, it’s like, of course we’re working in Python and building things in Python and like the language of AI and machine learning is all Python.

And so it’s great that we got all of that behind us because it used to be that we had to convince people why they should, you know, be willing to, you know, run their businesses on these, on this technology. So now I think, you know, moving on to how we can improve the, continue to improve the user experience, make things more efficient.

Like a lot of my work has been, we didn’t really get into it in this podcast too much, like a lot of my work more recently has been about kind of like modularizing and kind of helping, helping like with the, like horizontal integration, like having really specialized high quality components that solve different layers of the data stack. Similar to what’s happened with Silicon, like kind of Silicon manufacturing. It used to be that everything was very vertically integrated and now, you know, cutting edge Silicon manufacturing is very horizontally integrated.

And I think the same type of thing is happening with, with happening with data systems. And so projects I’m working on are oriented at helping with that, with that problem is how to get, you know, how to, and partially that’s like ending the language wars. It’s like, you know, this idea of like the shared, like the shared runtime, like a shared stack that we could use across all the programming languages where we get, you know, make a performance improvement, we make something faster, make something more scalable and we get that improvement everywhere.

So moving from, you know, more federated, fragmented ecosystem to one that’s more like working together and more focused on usability and user experience. I think we’re getting there, but it’s taken a lot of labor and a lot of time to make progress toward that. So, so I’m very excited about, yeah, what, what, what’s in store, you know, the rest of this, the rest of this decade. And yeah, I think I’m, I definitely feel like I’m in the right place to continue driving this work forward.

Ken Jee: Well, I’d love to continue that conversation in a later episode.

Wes McKinney: Sounds good.

Ken Jee: Thank you so much again for coming on.

Wes McKinney: Thanks for having me. It was a lot of fun.