Kelly Bodwin: Quarto hacks, AI in the classroom, and why R should stay weird
This transcript and summary were AI-generated and may contain errors.
Summary
In this episode of The Test Set, Michael Chow and I talk with Kelly Bodwin, an assistant professor of statistics at Cal Poly San Luis Obispo. Kelly shares her journey from studying math, physics, English, and French in college to becoming a statistics educator after taking a probability class with Joe Blitzstein at Harvard. Her path to data science came through R, particularly after attending an early RStudio conference that she describes as “an awakening.”
We discuss the challenges of staying current as an educator when the industry’s needs evolve faster than academic programs can adapt. Kelly points out that learning new skills like Python or SQL to teach them isn’t part of her paid work—it’s a labor of love. She mentions the potential transition to Polars in her Python courses, and I share some thoughts on how Polars has matured and may be easier to teach than Pandas due to its simpler, more consistent API.
Kelly also describes her applied research collaboration with a history professor studying the Polish Revolution of 1989, working with longitudinal social network data about 500 individuals. We talk about the evolving job market for data science students, the role of AI in education (both as a teaching aid and as something that can ace her conceptual exams), and her experience building a Quarto extension called Flourish using AI to translate R code to JavaScript line by line. The conversation concludes with Kelly’s reflections on her PositConf keynote about “keeping R weird” and the value of welcoming, quirky communities in both R and Python.
Key Quotes
“The need to appear. What’s exciting to me is always the human interaction. I was not a happy person during COVID.” — Kelly Bodwin
“I think they’re kind of two words for the same thing. Certainly, there’s the theoretical math, stat, and probability that’s maybe outside the sphere, but I don’t really make a distinction between statistics and data science.” — Kelly Bodwin
“The only way that I’m going to teach that to a student is I’m going to go out and learn it myself. That’s kind of fun sometimes, but it’s not, you know, I don’t have any extra time in my day to do that. I’m not paid to do that. It really is a labor of love anytime you pick up a new skill.” — Kelly Bodwin
“I’m very resistant to using it for any writing because I’m an English minor. I still am attached to writing my own things.” — Kelly Bodwin
“I didn’t write the whole program in R because it didn’t really make sense but I would say, okay, what I’m trying to do here is loop through all of these strings looking for a regular expression. So I’d write the one line in R and copy that line into ChatGPT and say translate this to JavaScript.” — Kelly Bodwin
“There’s sometimes the industry needs, especially in this field that moves so fast, and the industry needs are not things we can offer for the graduates going into the industry because we don’t know them.” — Kelly Bodwin
“I think one of the good things for education is that Polars is a lot smaller, API is a lot less complex in many ways than Pandas, and so I feel like it may actually be easier to teach and there’s fewer rough edges.” — Wes McKinney
“I knew that it was in development. It was actually a relief for me because when we started the Arrow project about 10 years ago, I thought that I was going to get on the hook to build the next generation Pandas-like library that was based on Arrow, but fortunately Richie showed up and started building Polars.” — Wes McKinney
“When you don’t take yourself so seriously it also makes a culture where when a beginner is asking a question you know you don’t jump on them you don’t say what else is in the documentation—a beginner doesn’t know how to read documentation.” — Kelly Bodwin
Transcript
[Podcast intro]
Welcome to The Test Set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning. Digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.
On this episode, we sit down with Kelly Bodwin, board game nerd, candy corn defender, and assistant professor of statistics and data science at Cal Poly.
Michael Chow: Hey Kelly, welcome to The Test Set where we chat with interesting folks in data and see what makes them tick. We’re so happy to have you on. For a little bit of background, Kelly Bodwin is a professor at Cal Poly, which is short for California Polytechnic State University, San Luis Obispo. And we’re also joined by Wes McKinney, who’s a principal architect at Posit and the creator of Pandas and interesting libraries like Ibis and Python. So Kelly, we’re so excited to have you on. We’ve chatted with a few different people who are professors and educators. I feel like I’m so excited to talk because I love the energy you bring to any time we see at conferences. And I know when preparing for this, you mentioned a lot of really interesting stuff about sort of how you got to where you are today and also what makes you tick in terms of collaboration. But to maybe kick it off, I’m curious to hear what gets you out of bed in the morning. And is it, per our conversation, a Coke and orange juice?
Kelly Bodwin: Yeah, I think I put it as a controversial opinion that a morning drink I enjoy. I don’t drink it that much because it’s very sugary. But yeah, half Coke, Coca-Cola, and half orange juice. It’s like Orangina, but with a little caffeine. It looks weird, but it tastes delicious, so don’t knock it till you try it.
Michael Chow: Are you toting this in a Coke bottle or do you have it in a thermos? Do you disguise it?
Kelly Bodwin: It was in college, so it was from the machines. It was in a little plastic cup. I think you have to see the weird color to truly appreciate the drink.
Michael Chow: Yeah, you’re saying this is born out of an undergrad dining hall.
Kelly Bodwin: Exactly. You have the orange juice right here and the Coke right here. I didn’t have a caffeine habit until grad school, and so this much Coca-Cola was enough to wake me up all day.
Michael Chow: I love that. I love who’s born out of undergrad dining hall ingenuity, which makes so much sense to me. What gets you out of bed in the morning outside of fancy beverages?
Kelly Bodwin: Definitely the main motivator for me is always other people, either excitement to see them or knowing they have expectations on me. This is why the professor job works well for me because I’ve got to get out of bed because I’ve got to teach a class, so I can’t just not go to that. I’m not a morning person, so literally what gets me out of bed is that I have commitments. The need to appear. What’s exciting to me is always the human interaction. I was not a happy person during COVID. I know it’s not possible, but I’d rather be in person with you two as well. I like being in my department. I’m friends with all my colleagues in my department, and we chat in the hallways. Being in the classroom with students is really fun. I really like the lecturing, and I really like chatting with them. Those are the things that motivate me. Outside of the teaching context, even in my personal life, I’m known to not be home in the evenings because I just want to go see people and do the things most of the time.
Michael Chow: You mentioned you’re deep into the quarter. What does your week look like as a professor?
Kelly Bodwin: This quarter, I’m pretty chock-a-block, like 9-6. I have three different classes right now that I’m teaching, which is a typical load for us. It’s usually three sections, but sometimes two will be the same course. After this, I’ve got to rush over and teach four hours of lecture. We just started a master’s program, which is really cool. This is the third year of it, the second year that I’ve supervised students. I’ve got four master’s students. I’ve got meetings with them scattered throughout the week. Lots of admin and grading type stuff throughout the day, too, on committees and whatnot. What I like about the job is that every day looks quite different, and every quarter looks quite different. Right now, it’s week 7 out of 10. We just gave midterms, so I’ve got a pile of midterms sitting over here to grade. That makes it a tough time in the quarter. The average day is just running across campus from a meeting to a lecture to a meeting to office hours.
Michael Chow: Do your master’s students work in different fields?
Kelly Bodwin: It’s a statistics master’s program, so everything is statistics adjacent. There’s quite a variety of projects in the program. Some of them will be applied projects with partners on campus. Last year, three of them were TidyClust-R projects, so contributing new algorithms to TidyClust. I shouldn’t say new algorithms. Ones that have not been built into TidyClust yet. It’s a thesis-driven program, so they do independent research. The ones that work with me tend to do R, although I have a couple more applications and stuff. I’ve actually got one this quarter working on something related to my PhD thesis, which I haven’t really touched since I graduated, so that’s very cool. But yeah, it’s all statistics writ large.
Michael Chow: What was your journey like into data science and being a professor? What did that look like?
Kelly Bodwin: I started college thinking I would do either pure math and physics or English and French, so I was all over the place. Then when I took a probability class with Joe Blitzstein, who was my advisor in undergrad, that was what made me say, oh no, I’m going to do statistics. This is so cool. It was very much a focus on statistics. We weren’t really saying data science too much back then anyway. My classes used R, but I didn’t really learn R. Then I did a senior thesis with Joe Blitzstein where I was doing a lot of simulation in R, so that was where I came to, I guess. I wouldn’t say at that point I loved it. I would say at that point I knew how to use it. Then I went to grad school, and a lot of my work was also R, but again, that wasn’t the main focus. I’m just focused on R because my path to data science is definitely as an R person. Then I came to Cal Poly, and my first year at Cal Poly, I went down with a colleague, with Shannon Peledji, to the RStudio conference, and it was just like that was the awakening. It blew my mind. But even then, I wouldn’t have said data science. Even then, the data science program at Cal Poly, we have a cross-disciplinary studies minor, but it’s more of a double major with CS and that. But I wasn’t teaching in that because that’s largely Python, and also I didn’t know much about predictive modeling and stuff. I still wouldn’t have said data science even though I was doing a lot of R and teaching the R classes. Then eventually, I say during COVID, I picked up the Python class and I learned Python. That was when I was teaching in the data science program, so I was like, okay, I guess I’m a data science professor now. But to me, there’s not a difference between data science and statistics. I think they’re kind of two words for the same thing. Certainly, there’s the theoretical math, stat, and probability that’s maybe outside the sphere, but I don’t really make a distinction between statistics and data science, so I don’t really care.
Wes McKinney: It’s a little bit of a marketing term, I think. There was an effort to position it as statistics done in a business context and by people with domain knowledge. I remember the whole Drew Conway famous Venn diagram. There were people in domain fields doing data science. I don’t necessarily know that it’s accurate to describe all of those people as data scientists. I have worked with biological data often, and I am not a biologist. Because it’s a marketing term, I sometimes think it’s overused as a catch-all for, I touch data with a computer. These days, everybody touches data with a computer. Almost everybody.
Kelly Bodwin: To me, data scientist is more like the set of skills that lets you tackle any data set that comes your way. You could still, I guess, break that set of skills into many stages, and some of them are more computer science oriented, and some of them might be more statistics and analysis oriented. I don’t want to erase computer science from the equation, but it just feels like, here’s some data, how do we make conclusions, is already what statistics is doing. I don’t really make a huge distinction. I have a friend who he used to say that we should start calling all our classes data science, and we should call the probability class theoretical data science.
Wes McKinney: I think one of the problems that emerged was that at a certain point, data scientist wasn’t a specific enough term, and it started to include too many skills, and so then businesses were searching for these unicorn individuals who were really good at DevOps and building pipelines and configuring stuff in the cloud, but also knew how to do statistics and do causal inference and all those things, as well as being really good software engineers. But I think the reality is that now things have become more specialized, and so I find that I see fewer and fewer job postings and companies that are searching specifically for a data scientist and more looking for a statistician or a research engineer or an AI infrastructure engineer or a data infrastructure engineer or something like that, which seems like a healthy thing, because then they actually know what they want, rather than expecting this unicorn person who’s an expert at doing everything.
Kelly Bodwin: I’m surprised you say that trend, though, because the students are always asking me, how do I search for jobs, because I search data scientist and then this one wants SQL and this one wants R and this one wants none of the above, and I just tell them, search by skill or language instead.
Wes McKinney: Yeah, I think that’s good advice. It’s interesting. I know DBT at their Coalesce conference had an interesting talk that was down with data science, which was a little bit more about the broadness of the title versus more specific things. One of the dirty secrets or maybe ugly reality is that a lot of data science roles, especially thinking about the mid-2010s to the late-2010s, a lot of those jobs ended up being essentially doing business intelligence, like building dashboards and counting things and not actually doing that much statistics or real data analysis. Maybe some exploratory data analysis, which is a real thing that does require skill and experience, but a lot of the end deliverables would be building, crafting dashboards and reports and things that were a little bit more mundane, and so now you’re seeing more of those roles being a little bit more reframed. It’s like you’re an analytics engineer or a business intelligence engineer or something like that, more or less a dashboarding specialist, or being able to build all the stuff that goes into producing the final dashboard that gets delivered to the business.
Michael Chow: Have you seen changes over time in what students want to do, whether they think of it as stats or data science?
Kelly Bodwin: I think that longer term, like 10 to 15 years, there’s definitely a move towards more interest in programming and learning R or Python and then using that in the job. Honestly, right now all the students are struggling to get jobs, and so they are not being picky about it. A lot of them are ending up in good jobs that pay them to do data things, and it’s not really necessarily their dream. We’re seeing a few more go into grad school in the last couple of years, go into PhD programs after our masters or that sort of thing.
Michael Chow: Yeah, that’s an interesting side of it, just the demand for jobs, how that affects students.
Kelly Bodwin: There’s also whatever’s happening in the economy right now, but there was a data science bubble that’s kind of on the edge now, where when I majored in statistics in undergrad, I was one of, I think, eight that graduated my year, and now it’s 500 per year in that program. And then data science obviously wasn’t a thing, and now there’s degrees in data science, so I think there’s also just the delay when the market needs something to when educational programs can deliver that thing, and so now there’s enough data scientists going into the market for the needs of the market, and so it’s harder for students to find specific jobs. You can’t pick and choose quite as much as someone who graduated seven years ago or something.
Michael Chow: Yeah, it’s interesting you mentioned before, too, you said something like people in domains were already doing data science. There were people specialized in things, already analyzing data. It is interesting the shift from that eight-person class that’s more general, like analyzing data less specific to a domain, to kind of like 500 people focused on kind of the broader, I guess the more methodological side.
Kelly Bodwin: Well, I didn’t know that we used statistics programs, so I think it’s still equally general. It’s the same program, just more people wanting a statistics degree than it used to be. It used to be a very niche degree, yeah.
Michael Chow: It’s interesting there’s so many people trying to go that way.
Wes McKinney: Yeah, it is true right now that there’s a little bit of a quiet recession combined with perhaps, depending on who you ask, an AI-driven reduction in entry-level jobs and things like that. It’s been interesting the last couple years talking to folks in education and academia just to see how AI is affecting the students, affecting their psychology. I know among people doing kids in school doing computer science, there’s a lot of existential dread about, is there going to be a job waiting for me when I graduate? I feel like maybe statistics and data science work in general maybe is a little bit more robust to some of that because I feel like data science is still one of the areas where human judgment and having a human in the loop is still really important. So maybe we’ll be slightly more resistant to some of the AI erosion. Maybe somebody doing computer science were going to build a CRUD data entry app at a Fortune 500 company. Maybe those jobs are more likely to be eliminated than somebody with statistics skills who actually needs to analyze data and do nuanced analysis that is more difficult to automate with a coding agent. I don’t know, I’d be curious what you’re seeing or what the vibe is.
Kelly Bodwin: The AI can ace all my exams, including my conceptual exams. So it can figure out what’s the right statistical test or model to use given some data information. It can do a lot of base-level things. The places where I’ve found the human judgment most needed I guess would be the EDA, honestly. AI can produce exploratory plots and so forth or maybe spot little issues but the understanding of the structure of the data, like does this observational unit really address the problem that I’m trying to address? That still feels like it needs a human. And then I’ve found every now and then when I’m stuck on my own applied project and I’m like, what model should I be trying here? If I turn to AI, it usually gives me unnecessarily overcomplicated answers. They’re not wrong per se but they’re more than is needed. I suspect that will continue to improve a little bit but the judgment to know when AI has gone overboard is maybe where a human is needed. Certainly you can’t quite vibe code as directly as you can with some of these coding tasks but I think that those jobs, especially more basic data analysis jobs are still in danger of being taken over.
Michael Chow: I know you’re working on some projects and you mentioned before that you are using Gen AI a little bit and sort of being productive with AI on some projects. You talked a little bit about EDA and that dynamic. I’m curious about some of your other interactions.
Kelly Bodwin: I’m very resistant to using it for any writing because I’m an English minor. I still am attached to writing my own things so I’m resistant to that but on the other hand, the papers I’m getting from my students are much better. I don’t hate that I don’t have to read as bad of writing anymore but for myself, I honestly don’t use it that much in my research. I use it, like I said, sometimes to suggest the next path forward. I’ve used it to unstick myself when I can’t motivate myself to work and more often than not, I’ll say, you know, I’m trying to write this thing. Can you write it for me? And then it writes it and I hate what it’s… right, like code, I mean. And I hate what it’s written. And so then it motivates me because I’m like, this is not correct. Here’s the correct way to do it. So I actually go do the thing.
The place where I really use it, this might be what you’re referencing is a student and I have been working on a Quarto extension and figuring out the structure to make a Quarto extension was pretty straightforward. I took a really great workshop at PositConf about it but also there’s good documentation. But the one we wanted to write, it had to be JavaScript. It couldn’t be done in R. And I don’t know JavaScript. So the way that I did this is I literally went line by line. I didn’t write the whole program in R because it didn’t really make sense but I would say, okay, what I’m trying to do here is, I don’t know, loop through all of these strings looking for a regular expression. So I’d write the one line in R per map, whatever, and copy that line into ChatGPT and say translate this to JavaScript. And then I’d copy that back in and I’d run it on some tests. So it was a very tedious process and it didn’t work to just design the whole thing and say write me this in JavaScript. Partially because it wasn’t perfect and partially because I don’t have the ability to debug a full program in JavaScript. But I do have the ability to debug line by line, test this line and make sure it does what I want. So that was where I was using it.
Michael Chow: Yeah, it’s so interesting. And this is the, just for context, you said the Flourish extension. Is that right? To highlight different aspects of code?
Kelly Bodwin: Yeah, it’s, I don’t know if you ever used Flare back in the day but Flare was the version for R Markdown that was totally built in R and Quarto opens up so many cleaner ways to do what Flare was doing. Flare was very hacky. But what it’s for is so that you can, for teaching mainly, so that you can establish, you’ve written a code chunk and when you show it to the students you want it to highlight all the functions or something like that. But you still want it to be reproducible so that what you’re running is the output of that code. So yeah, Flare and then Flourish is a way to well, with Flourish you put it in the Quarto comments and you say like which words in the ensuing code you want to be highlighted and so it still behaves like an ordinary code chunk in every way but the visual on the screen also has the highlighting in addition to the syntax highlighting. Does that make sense?
Michael Chow: Yeah.
Wes McKinney: Yeah, if I understand with Flare you were kind of, you kind of got meta, you like were calling a function over the code you wanted to format to output.
Kelly Bodwin: Yeah, with Flare I was using like the sub-render tricks and knitter. So I was running the render, pulling the raw HTML that outputs from that render just on the one chunk and then like manually injecting HTML wrappers to make it be highlighted with a yellow background or what have you. So it was very hacky.
Michael Chow: I know you mentioned like really enjoying collaboration and being around people. You mentioned you have some applied projects with friends on campus. Is this like projects with other professors or?
Kelly Bodwin: Yeah, I think the one that is the most fun to talk about is my collaborator in history. His name is Greg Domber and he works on the Polish Revolution. So like in 1989, you know, in Poland, they all met this group of 500 and some people met at a series of meetings and like peacefully transitioned from communism to democracy or dictatorship. And so he has collected like painstakingly by hand over a decade data about those 500 people and what they were doing since 1950. So he has this like wonderful longitudinal social network data that is, you know, these two people were in this organization together. These two people co-signed this protest letter, etc. Yeah, and then I’ve been working for like eight years now on that data. You know, the first thing was cleaning it, which was big. I give him credit because he wrote it all down in a spreadsheet and that’s like very impressive for a historian. But there was a lot of work to do. And then we made a shiny app where you can explore these and we presented it in Poland a year and a half ago, which was very cool. And now we’re working on kind of doing some like modeling analysis to talk about, you know, people that were in this one organization. Were they in some way more impactful than people in this other organization? That sort of thing.
Michael Chow: How did you get involved with that? Like did you get in the door on that? Like the second week that I landed on campus, he sent me an email because I had listed my research interests on the website and I put digital humanities because I really like, you know, text analysis, literary analysis, that sort of thing. And he you know, we got coffee on campus and he was telling me all this like I have this idea that if we had, you know, these connections and we could study it. And he had a really good vision for it as someone who isn’t a data scientist. I mean, he might be now, but he wasn’t then. And I was getting more and more skeptical. Like, okay, like plenty of people have these aspirations where they think that statistical analysis is just like throw everything in a blender and you get a magic answer. And then he opened his spreadsheet and I was like, oh, you have data. You’ve actually collected this data and structured it in a clean, not clean, but a consistent anyway form. So yeah, it was very cool. And so we’ve just been working on it ever since and become great friends.
Michael Chow: Yeah, it’s so interesting to hear. I feel like just about how you choose projects or like how projects run past your desk as a statistics professor.
Kelly Bodwin: There are way more available than we have time to do. There’s plenty of people on campus who want help with their data. Our department has a consulting service that sometimes you like make connections through that. You know, Greg obviously just reached out to me. My other applied project right now with my friend Katie Watts was actually a data science capstone sequence where the students get in groups and work on a real project. And so we solicit projects from campus for that. And so she brought her data to that. And then, you know, that made a lot of progress. And then one of those students is now continuing that progress as a master’s project with me. It just builds from there.
Michael Chow: One thing I’m curious about is like how you pick up new skills or how you spend time staying current. And I know you mentioned like some of the challenges of being an educator.
Kelly Bodwin: Yeah, I mean, I think like a thing that I wish people thought about more as far as educators and especially educators. You know, my job is not a research job. There’s a little bit of research element, but it’s a teaching job. You know, people in the industry say, oh, like Python is now a little more of a lingua franca. We want Python or we want SQL or we want whatever. And how am I going to teach that to a student if I didn’t learn it? And so the only way that I’m going to teach that to a student is I’m going to go out and learn it myself. That’s kind of fun sometimes, but it’s not, you know, I don’t have any extra time in my day to do that. I’m not paid to do that. It really is a labor of love anytime you pick up a new skill. So there’s this disconnect where there’s sometimes the industry needs, especially in this field that moves so fast, and the industry needs are not things we can offer for the graduates going into the industry because we don’t know them. So, you know, when I picked up Python, that was literally just I learned it. Like I did my own class. I inherited some material from someone else, worked through them, and then tried to teach them. And it was rough the first time. And there’s a big demand for SQL right now in our department. I would be happy to learn SQL, but I don’t have time. So that’s tough.
So as far as new skills that I pick up for fun for myself, you know, I think of a lot of the R work I do not as work. I think of it as a thing I do for fun. So when something exciting, you know, comes across my Blue Sky or I find out in a conference or something and then I like get an idea and sit down and do it, that doesn’t feel like a work task. It feels like the same as when I cut stickers on my Cricut. You know, it’s like a hobby. So I pick up new skills in the R world just because I’m excited to. But when it’s like skills needed for class.
And then there’s this decision in the classroom, too, of there’s these skills that I’ve picked up. Do I build them into the class? You know, I mean, even the moment where we converted our R class from base R primary to tidyverse primary, that was a lot of work, even though I knew the tidyverse. There’s the conversion and I’m actually, Wes, you might have an interesting thought. I went through this fall of whether I needed to rebuild my Python class with Polars and I ended up not doing it this time. But I feel like I need to build it in there soon. And then there’s the whole, you know, DuckDB Arrow world. Do I build that in at some point? So even the stuff that I know how to do, that’s a different thing than having built the materials to teach it, which also is in my free time. So, yeah.
Wes McKinney: I’m having a conversation with my editors at O’Reilly about Python for Data Analysis, which is now a 13-year-old book and we’re talking about a fourth edition. And so I think we’re going to do a fourth edition that still uses Pandas, but I’m using Polars more and more. And maybe a few years ago I used Polars a little bit and it would run into bugs and it would crash and it felt like it wasn’t ready for primetime. Whereas I’m building stuff with Polars, I’m not writing that much Polars code directly, I’m having Claude Code write a lot of the Polars code. But it’s fast, it’s got an API that’s, especially for dealing with complex datasets and dealing with more Arrow-like data, like JSON data and stuff like that, it’s definitely a lot more powerful and expressive for some of those more complex transformations. But yeah, it’s this transitional period where people need to know Pandas and maybe Polars and then maybe eventually things will become more Polars-native in the future. One of the good things for education is that Polars is a lot smaller, API is a lot less complex in many ways than Pandas, and so I feel like it may actually be easier to teach and there’s fewer rough edges, you don’t have to care about it, there’s no indexes and so I know that indexes is one of those complexities that people coming from the Arrow world find very tedious and unintuitive in Pandas, and so I totally feel that pain.
Kelly Bodwin: Yeah, and it is interesting to Wes’s point about Polars, sometimes these tools, if they do offer a simplification maybe the whole class gets easier versus shifting to something.
Wes McKinney: Yeah, and definitely Pro, a simpler API is one version of better. I’m not always attached to speed or breadth of abilities, but not from the teaching perspective, I want them to have the skills that they need going forward, which right now, Pandas.
Kelly Bodwin: I’m trying really hard to say Pandas, I get really made fun of because for some reason I got in the habit of saying Pandas like it’s a Greek word.
Wes McKinney: I actually say Pandas more like the Greek version, yeah. And the man himself, stamp of approval.
Kelly Bodwin: You have it here, yeah.
Wes McKinney: If you watch old early videos of me talking about the project, it’s definitely more Pandas and not pandas, yeah.
Kelly Bodwin: It is like this exotic Pandas library. I get everyone saying data.table and then I’ll stop being made fun of.
Wes McKinney: Oh yeah, so in the jobs that they’re going into I have the impression that the data science teams are still using Pandas over Polars at the moment, so I’d rather send them out able to use Pandas. We’ll see.
Michael Chow: Okay, so maybe this is a good question for Wes. We talked a little bit about staying current, and I saw one thing you had asked us before this is how do we stay current or up to date on things? Wes, do you have any takes on how do you stay on top of technology or pick up new tools?
Wes McKinney: It’s hard. It’s really overwhelming to know what to pay attention to at any given time because there’s only so much new information you can take in and internalize, and so just choosing what new things to learn about and get your mind wrapped around I find is really difficult. I had hardly used Polars at all until maybe a year or two ago. I knew that it existed. I knew that it was in development. It was actually a relief for me because when we started the Arrow project about 10 years ago, I thought that I was going to get on the hook to build the next generation Pandas-like library that was based on Arrow, but fortunately Richie showed up and started building Polars and it’s as good or better than anything I could have built if I had decided to work on that thing.
I was a little bit AI-skeptical for a long time, and so I turned a blind eye for a really long period of time to everything AI-related in the Python ecosystem and Python has become a little bit taken over by the AI world in recent years, and so I was looking at the top 100 most popular, at least based on GitHub stars, projects on GitHub and half the projects are projects that I’ve never heard of. I’m like, gosh, there’s this project that I’ve never heard of and it’s got 50,000 stars on GitHub. I feel like I don’t know anything.
It’s hard, and I’m sure the people listening also identify with that struggle that it seems like every day you go to Hacker News or you go to Reddit and there’s a new project that is getting a lot of attention that you haven’t heard of, but I think what’s interesting is that when I started out working in this area, and this is probably true of all of us, is that data science and Python and R were not mainstream at all and now it’s become so mainstream and so that’s just increased the volume of new projects and things going on so it’s 100 times harder to keep in touch and to be aware of all the things that are going on compared with what it was like 10 or 15 years ago.
Michael Chow: It’s interesting being on the open source team I think similar to teaching a course, if you’re teaching a Python course, it’s a good excuse to learn Python. I feel like a lot of that happens in open source too where you’re like, if I could build some useful open source tool that uses some technology, it’s like a good excuse for me to kind of familiarize myself with that technology. But it is hard because then it has to be maintained forever so there’s a sort of trade-off.
Wes McKinney: I’m curious about that from both of you. It could be a full-time job just maintaining your existing projects so how do you decide where to split your brain power between making sure that project stays current versus any new ideas that you might have?
Kelly Bodwin: Fortunately for me, it seems that the pattern that I’ve developed over the years is that I start or help start projects and then get them to a place of critical mass where I recruit developers to join the core team to maintain the project and then when the project is being sufficiently well looked after and maintained, I move on to start the next project. A fortunate side effect for me is that I’m not actively maintaining any open source projects right now. Of course there’s a large team maintaining Pandas and there’s a team of people maintaining Arrow and Ibis is in need of more maintainers so if anybody listening is interested in helping maintain Ibis I may get back involved in the project because I think it’s an important project that has a bright future.
It’s hard and another thing that I spoke about in a talk recently was that now with generative AI and ChatGPT and Claude and all these things, it’s going to lead to much more use of these open source projects because ChatGPT and friends are all really good at writing Pandas code and they’re all really good at writing Polars code and so now the effective, the addressable audience, the addressable user base of people who can become Polars users or become Pandas users is probably 10 or 100 times what it was before and so to go from 10 million global Pandas users to half a billion Pandas users who are now running into the same bugs and issues that everyone else is running into but OpenAI isn’t helping maintain Pandas so what the heck? I mean maybe at some point they should but they’re not right now.
Michael Chow: I wonder, are there Python users who don’t know they’re Python users?
Wes McKinney: My friend just showed me, he’s like planning his wedding and he’s like, oh I just asked ChatGPT to code me an app that I can use for tracking guests and so forth and he showed me this app and I said, what language is this app in? And he’s like, oh you know, he like never saw the back end of the app, it’s just like built and deployed. Because LLMs are not good at doing data analysis directly so they need to offload all of that onto R code or Pandas code or whatnot and so anybody who’s asking an LLM to do any kind of analysis is going to end up touching one of these libraries so if we had perfect knowledge of the future going back in time when we created these open source licenses, if we could have baked in some kind of AI use royalty payment into the license it’s free to do whatever you want but if you use this code if you use code that involves this library to train your models and then you offer suggestions that you have to provide I mean even if even if ChatGPT provided like a hundredth of a cent or a thousandth of a cent for every response that it provides that uses Pandas code I’m pretty confident that would generate a lot of revenue that would help fund maintenance in the project but we’ll never retroactively be able to do something like that and I don’t even think that would be the right thing to do, it’s best for the software to be free and free as in freedom and free as in beer.
Michael Chow: It is interesting though, I feel like anytime ChatGPT starts rolling Pandas it is kind of surprising to think about that this tool is just how it’s thinking about data basically, that’s how it works through data.
Wes McKinney: Well the flip side of that question is that, and the thing that I’ve been thinking a lot about lately and I ask everyone about it what they think is that it will be hard to get people to use new open source projects when their AI assistants don’t know how to use them yet and so it creates this like chicken and egg problem where ChatGPT is really good at using Pandas because there’s huge amounts of training data available on GitHub and all over the internet of people using Pandas to solve problems and so they’ve been able to take in all of this data and become really good at solving problems with Pandas but suppose a new tool comes along and there’s just not a corpus of training data to teach the LLMs how to use them and so that basically if people become utterly dependent on AI tools to do anything then new projects will never get used, they’ll never get training data generated I mean maybe the LLMs will read the documentation and experiment and build their own training datasets, I don’t know but it’s definitely a weird future that we’re entering where the whole model of how training data for AI gets generated is going to change fundamentally and how people discover and use new tools because a lot of people are just not going to be motivated to learn how to use things the old-fashioned way where they read the documentation and they fumble around in their text editor and they experiment and learn how to use it themselves and say oh well ChatGPT doesn’t know how to use it so I guess I’ll use something else, that’s probably what’s already happening I would think.
Kelly Bodwin: I think you know when I watch my students try to adopt a new function you know they certainly turn to AI first and it’s not necessarily a bad thing I would actually say like the AI explaining the function tends to be more digestible than reading the documentation but there’s two things that are skills that they don’t seem to have right now and I don’t know if any of us are experts in this, one is doing the right query right so if you ask here’s this function show me some examples and explain why they work that’s going to get you much further than what is this function so the prompt engineering which is a very fancy way of saying it but what should you ask is hard and then the checking afterwards you know not just because AI isn’t always accurate but just because like still good to look and say this argument needs to be a data frame or whatever and so I’m seeing them not do those steps and I’m getting a lot more like they have found a cool function that will solve their problem and then it’s not working and the reason it’s not working is they put in an array and it wanted a data frame or something like that and they don’t know how to fix that problem.
Michael Chow: One thing I really wanted to be sure to loop back to is I love that you gave a keynote at USAR on keeping R weird. I was wondering if you could just say a little bit about that and maybe after like what do you think are the key ingredients to keep Python weird?
Kelly Bodwin: That was so cool of a thing to get to do. I’m not really a nervous speaker and you can really hear that I started that one out super nervous just because I looked out and I was like all my heroes are in this audience plus a thousand other people. It was like really intimidating but no it was a really fun talk to put together because because there’s kind of the two types of weirdness right there’s the language is a weird language and I didn’t know that because it was my first fluency well it wasn’t my first language but it was the first one that I was really embedded in you know and I’m not a computer scientist and so learning from people as I got deeper into R how it’s weird was fun but then the community is very quirky especially the like sort of tidyverse pocket although I think that now it’s just kind of a mix of everyone yeah I mean I don’t know if Python maybe Python’s already weird.
I do think the reputation from the outside for Python has changed a lot it was you know maybe 10 years ago or something it was a bro language and I had no real interest in getting involved in that community and now it feels more like the R community like there’s PyLadies and there’s fun stickers and I met this woman at USAR this past summer who had a really cool Python tattoo and like all these things that feel special to me seem to be existing and that makes me more excited to get involved with Python even though I don’t love writing Python code I prefer not to but so I hope that that’s spreading and I think a lot of that comes with an openness like when you don’t take yourself so seriously it also makes a culture where when a beginner is asking a question you know you don’t jump on them you don’t say what else is in the documentation a beginner doesn’t know how to read documentation you know so I’ve always really appreciated about specifically the tidyverse pocket of the R community where back in the old days of Rstats Twitter someone would ask a question a very basic question and it might be like Hadley or Jenny or Mina who would jump in and answer it and so you know that feeling of like my question is so not stupid that a name that I recognize will answer me is like so important to making people excited to work on it and so that’s where my perception whether it was true or not of the bro culture back in the day for Python has changed I now feel like it’s a language that is welcoming to beginners and I’m not sure what shifted that this is all my semi-outsider perspective. But yeah I think that’s the value of turning into one big party instead of this very serious you know let me think about the stack in the heap sort of approach to open source.
Wes McKinney: I think on the Python side I think compared with R I think the R community by comparison not to say it’s 100% like this but the R community is a lot more culturally homogenous in that a lot of the R community comes from a statistical origin like people they entered into learning R because they started out in statistics or biostatistics or more of a statistical field and so that’s the foundation or core of the R ecosystem whereas Python started out being a little bit more like Unix sysadmin and then it developed a web development ecosystem but there was always this scientific computing thing off on the side and then that turned into a data thing which now has gotten really big and the data thing grew an AI appendage and then the web developers were a little bit uncomfortable with the rapid growth of all the data people and so there was this weird period in the mid-2010s where I feel like the Python web community wasn’t sure what to make of PyCon becoming a lot more about data and science and machine learning and AI and so the modern Python ecosystem I do think it’s gotten a lot more accessible welcoming to beginners and more focused on having a big tent, not having exclusionary behavior like oh you haven’t been doing Python since the 1990s haha get out of here noob I don’t really see any of that happening in the Python world like you see in some other more prickly programmer communities so it’s very welcoming but it’s still very feudal and federated and there’s cliques and pockets of people that are separated by the type of work that they do and so I think that the AI and machine learning people are very different from the data science people and even more different from the web developers but there’s also people who are primarily using Python to do Arduino or robotics projects and all kinds of stuff so it’s a strange ecosystem but the fact that it’s become so mainstream is also for me the biggest change that I have to remind myself oh Python is a mainstream language now I used to feel like Python was a mainstream language and now it’s very mainstream.
Kelly Bodwin: I take your point about web developers being very federated and different from the data science or AI but I would push back and say most people come to Python through computer science that’s absolutely true most people come to R through some variety of statistics but they’re not necessarily like a lot of people are coming to R because they have data in biology or they have data in history now or they have data and they’re like this is the language people told me I should use to run my t-test the statistics part is true but it’s not statisticians and I wouldn’t say the majority even of R users are statisticians I think it would be more fair to say data than statistics to correct myself yeah so the homogeneity maybe R is less siloed than the Python community but I think that the variability in the particular skills or expertise of the people coming to the language is maybe higher I think most people coming to Python know what a for loop is many people are using R successfully and don’t know what a for loop is so there’s a background heterogeneity that makes the R community a little more you just get a lot more different perspectives.
Wes McKinney: Perhaps it’s fair to data analysis spans so many domains that it really brings in a pretty wide net. I remember for myself it was like I was using R so heavily to analyze psychology experiments and then once I realized I think I might want to go into industry and maybe at least open up software engineering as an option I just sort of switched immediately to Python but I do feel like it is an interesting comparison to data analysis people in different domains analyzing data versus more software engineering centric, computer science centric folks and what they’re drawn to.
Michael Chow: This brings up what I think is the most controversial take that you gave which is I saw you listed candy corn as a top tier candy and I actually have so many questions by top tier are you saying that it’s an all year candy?
Kelly Bodwin: Absolutely, it’s like red vines, candy corn brought out of the seasonal when I was in high school I was playing a lot of sports so I ate like a vacuum and I would come home from school from my back to back practices and I would eat a bag of candy corn and drink a full Gatorade.
Michael Chow: Oh my gosh they need candy corn flavored Gatorade that’s the solution, I don’t know my slime Gatorade was discontinued I guess for I don’t know how universal candy corn is so if you had to describe candy corn for someone who’s mentally picturing a tiny piece of corn that’s been candied how would you paint the picture of what a candy corn is just to be sure people know what we’re talking about?
Kelly Bodwin: It is sugar in triangular prism form and the important thing about candy corn any sugar is sugar but is the texture it has the perfect texture to make your teeth feel good chomping on it without it sticking, that’s why it’s top tier and the texture I guess is inspired by wax in a way.
Michael Chow: You’re going to tell me that you’ve never taken the wax from one of those little cheese things with the wax on the outside and chewed on it as a kid?
Kelly Bodwin: No, it’s even more controversial I’m not recommending that one eat it.
Michael Chow: I’m open to it as an adult next episode I’m just going to be chomping on one of those wrappers and we’ll see what happens.
Kelly Bodwin: Well Kelly, thanks so much for coming on I just love the energy you bring to the R community anytime I see you at a conference and you’re wheeling I’m actually so sad I didn’t have them back here you were slinging earrings, laser cut I so deeply appreciate the community and welcoming also the level of pump you bring into situations.
Kelly Bodwin: It’s fun, for me PositConf is a highlight of the year we go to board game conventions and then we go to our convention I think I planned my wedding so I could go to conference last year priorities having gone for 8 years, not counting COVID these people are my real friends it’s like a reunion I don’t know I appreciate you saying you like the enthusiasm I don’t view it as work in the community, I didn’t start the earring thing, I didn’t start the crafting I didn’t start the hex stickers, I just get to benefit from that I feel very grateful for all the things that the people have built, both the actual technological tools but the community that’s formed I think that really shines through I’m so excited for years and years of keeping R weird.
Michael Chow: I learned that I didn’t coin that, I don’t know how that phrase got stuck in my head but Hadley dug up some talk he gave from way back when where he had a keep R weird slide and I feel like it’s almost certain that I saw that and it stuck in my head so everything I do is derivative.
Kelly Bodwin: I think taking a slide to keynote is such a beautiful someone had to expound, write a thesis on the keeping of weirdness I’m glad it was you, and thanks so much for coming on I’ll see you at the next conference, and on the internet.
Michael Chow: Thanks for having me, and I’ll see you, if not sooner, at the next conference and Wes, great to talk to you, really appreciate everything you’ve done for Python and the WordPress platform.
[Podcast outro]
The Test Set is a production of Posit PBC, an open-source and enterprise tooling data science software company. This episode was produced in collaboration with Creative Studio AI. For more episodes, visit thetestset.co or find us on your favorite podcast platform.