How Open Source, Python and AI Are Shaping the Data Future
This transcript and summary were AI-generated and may contain errors.
Summary
In this conversation with Anthony Deighton on Tamr’s Data Masters podcast, I discuss the origins of pandas and how Python became the dominant language for data science—partly through timing and partly through the fortunate confluence of open-source projects like NumPy, SciPy, IPython/Jupyter, and matplotlib that together created a complete data analysis stack by 2013.
I explain the composable data stack and Apache Arrow’s role in it. The early 2010s were the “food, water, and shelter” era of big data where communities built vertically integrated solutions. Arrow emerged to address the cost of building pairwise interfaces between systems by providing a standard for tabular data interoperability. This has evolved into what we call the “deconstructed database”—modular, interchangeable layers connected by open standards. Projects like DuckDB and DataFusion represent different points on the spectrum from batteries-included to highly customizable.
On LLMs and data work: I argue that while LLMs will increasingly handle routine dashboard-building and BI work (especially with semantic layers), the more nuanced statistical and modeling work requiring domain expertise and human judgment will remain human-driven for now. LLMs struggle with tabular data fundamentally—they have trouble with basic retrieval, counting, and arithmetic. MCP isn’t an efficient way to expose large datasets to models. We’re still using “caveman tools” for AI-data interaction compared to the engineering we’ve done on Arrow.
I share my experience using Claude Code daily. It’s made me more productive but not obsolete—my experience reviewing code feels essential. I treat agent output like work from a motivated but error-prone junior developer. Test-driven development has become more important as a defensive practice. The existential question is how junior developers will become senior developers if they’re not doing the foundational work. There’s also the risk of “vibe coding”—deploying AI-generated code without proper review—leading to business losses. Code review remains a human bottleneck, and AI-generated codebases tend to be larger than necessary because agents often miss the more elegant solutions.
Key Quotes
“We were all rolling around our little snowballs and suddenly the snowballs merged together and became one really gigantic snowman that now powers the world.” — Wes McKinney, on how Python became the dominant data science language
“I mean people, some people underestimate how important it is to be able to read CSV files, but it turned out that just being able to point pandas at a CSV file was its initial killer app.” — Wes McKinney
“The cost of building one-to-one interfaces, pairwise interfaces between programming language A and computing system B, was not only hampering the performance of systems by introducing a lot of serialization and interoperability overhead but was also fragmenting effort and weighing down progress in the overall open source data ecosystem.” — Wes McKinney, on the problem Arrow was designed to solve
“Even MCP which was developed to provide a standardized interface for LLMs to interact with external systems and tools—it’s not an especially efficient way to expose data to an LLM… the AI equivalent of how do we expose data to an LLM looks like caveman tools by comparison.” — Wes McKinney
“I tell myself when I’m working with Claude Code like treat all the work that is coming out of this agent like a very motivated, very productive junior developer who’s prone to errors and frankly creating messes.” — Wes McKinney
“To be able to spot design, architectural problems, things that need to be refactored, code duplication, code smells—all that feels essential to getting the most value out of these tools.” — Wes McKinney
“We’re likely to have a vibe coding epidemic… there’s likely to be substantial business losses because developers are deploying vibe-coded software into production without sufficient code review.” — Wes McKinney
“One of the existential problems is how will junior developers become senior developers? The old working model was that you become a senior developer by doing junior work over a long period of time.” — Anthony Deighton
“Sometimes with an AI agent you can generate an impressive amount of code, thousands and thousands of lines of code in an afternoon to hack your way around a problem. But maybe there’s a more elegant solution that the AI agent just didn’t see.” — Wes McKinney
“Even with their massive context windows, they have the memory of a goldfish.” — Wes McKinney, on AI agents forgetting instructions
Transcript
[Podcast intro]
You’re tuned in to the Data Masters podcast. In each episode, we dissect the complexities of data management and discuss the data strategies that fuel innovation, growth, and efficiency. We speak with industry leaders who share how their modern approaches to data management help their organizations succeed. Let’s dive straight into today’s episode with Anthony Deighton.
Anthony Deighton: Welcome back to Data Masters. If you’ve ever written import pandas as PD in Python, then our guest today needs no introduction. He is the creator of pandas, the seminal open-source library that practically defined Python as a language for data science. But he didn’t stop there. He is also the co-creator of Apache Arrow creating the de facto standard for how high performance systems exchange data, in particular tabular data, and is currently a principal architect at Posit, the chief scientist at Voltron Data, and a general partner at Composed Ventures. Wes McKinney has spent nearly two decades building the foundational infrastructure that powers our industry. And today we’re going to talk about what comes next. From the rise of the composable data stack to his contrarian view on why LLMs might be heading for a trough of disillusionment when it comes to actual analytics. Wes, welcome to Data Masters.
Wes McKinney: Thanks for having me on the podcast.
Anthony Deighton: So, I wanted to maybe start by going back in history a little bit. And I know this is almost an unfair question to sort of cast your eye back 15 years ago when you started the pandas project. But I think it’s fair to say, and again with the benefit of hindsight, that it really built this Python-based data science ecosystem. And in that sense, Python has become the de facto standard for a whole bunch of AI platforms, TensorFlow, PyTorch, etc. So, I’m just curious from your vantage point sitting here today looking back, if when you were first building pandas, did you have a sense that it would become the sort of foundational layer for AI?
Wes McKinney: It would be hard to predict that that far into the future, but I think I definitely saw there was a lot of untapped potential in Python. And if only there was a toolkit for basic data manipulation, data wrangling, then that would help unlock whatever was the potential or potential future for Python as a mainstream data language. But back in 2008 when I started working on pandas, it was not at all the case and was not a foregone conclusion. Even using Python for doing professional business data work was seen as fairly risky at the time because Python was unproven. It had a fairly immature open-source ecosystem for statistics and data analysis work.
The idea initially—pandas started out as a toolkit for myself to do my work at my job at a quant hedge fund. And I enjoyed building the toolkit for myself. And then eventually I was building it for my colleagues who were excited about using it. And eventually we open sourced the project and I started engaging with the Python community and seeing like is there an appetite for this? Do people want this? Is this something that the world needs? And eventually it turned out that the answer was yes—there was it was at a little bit of being at the right place at the right time around 2011-2012 where people were starting to talk about data science and big data and there was suddenly a massive need for people with data skills. And Python was an open-source accessible programming language that people could learn easily, and the sudden availability of a toolkit to be able to read data out of databases and load CSV files and read Excel files and then be able to do meaningful work with that data with code that was easy to write and easy to reason about was one of the things that helped unlock Python as a language that could be accepted in the business world.
And I don’t know if this was causal or something that really factored in to Google choosing Python as the language for TensorFlow. I think it was a little bit of an accident. I was partly inspired by the fact that Google used Python as one of their three languages—the other two being C++ and Java. So, Python was their scripting language that they would use to build interpreted interfaces on top of mainly C++ libraries using SWIG and other wrapper generators. So, that was probably the main reason why Google chose to do TensorFlow in Python. And eventually Meta started building PyTorch—initially Torch was all in Lua I believe and eventually they migrated that to Python to create PyTorch.
So there’s a combination of being lucky and making the right prediction but also this lucky confluence of open source projects and then major AI research labs needing to choose a programming language to build their AI frameworks in. It just happened to be that everyone chose Python. And so, you know, we were all rolling around our little snowballs and suddenly the snowballs merged together and became one really gigantic snowman that now powers the world. So, it’s been an interesting time, but I’ve been resistant to patting myself on the back and saying like, “Oh, I predicted this was going to happen. I knew it was going to end up like this.” Because that’s definitely not the case. I was hopeful that things would end up like this, but I would have been satisfied with a much less successful outcome.
Anthony Deighton: Well, I can pat you on the back. So, how about that? So, I think that’s fair. In that same spirit though, and maybe this is an unfair question, but just to throw it out there. Were there other features of Python that lent itself—clearly you solved the big unmet need around access to data and also munging that data to use your term. But were there other features of Python that made it particularly relevant for this use case?
Wes McKinney: I think because Python is really easy—was originally created as a teaching language. So it’s easy to learn. It’s easy to read. People back in the day would often describe Python as readable pseudo code, similar to the code that you would write to describe algorithms. So you could hand a piece of Python code to somebody who’d never written Python before and they would pretty much be able to get the idea of what the code was doing without a lot of types. And of course now Python has types and so that’s changing, but the language had an accessibility and a readability aspect that made it really appealing to do scientific work in.
Python also had an existing numerical computing ecosystem. So there was a group of folks that were essentially building an alternative to working in MATLAB. So if you were doing neuroscience research or physics research or things like that, you had NumPy and SciPy as a basic computing foundation for numerical algorithms, optimization, linear algebra, the essential things that you would need to begin to start doing statistics and data analysis work. Like whenever I needed to run a linear regression in Python, I didn’t have to wrap linear algebra libraries myself. That work was already done. So that was an essential factor.
I think another thing that really helped tie the room together was the computing and development environment which initially started out as being the IPython shell and eventually the IPython notebook which turned into the Jupyter notebook and now the Jupyter ecosystem and Jupyter Lab. And so that also gave people—that was one of the first mainstream open-source computing notebooks. People were familiar with Mathematica and other closed source computing notebooks in the past but this one was inspired by things that had come before out of Mathematica and MATLAB and things like that but it was open source and just worked with matplotlib and the plotting libraries and all the things that existed at the time.
So essentially very quickly by 2013 or so we had this full stack environment that had the bare essentials: an interactive computing environment through the IPython notebook/Jupyter, numerical computing through NumPy and SciPy, plotting through matplotlib, and data wrangling through pandas. And so if you were an aspiring data scientist or somebody who was looking to do business analytics or build a data analysis application in a business setting, at that point you could credibly make the case to your colleagues and your boss that you had the tools at your disposal to be able to build something without getting pulled down into a rabbit hole having to build essential components from scratch.
I mean people, some people underestimate how important it is to be able to read CSV files, but it turned out that just being able to point pandas at a CSV file was its initial killer app. Like, “Oh, I have a data file. I can say pd.read_csv and read it.” People take that for granted now, but circa 2012, that was a big level up for Python at the time.
Anthony Deighton: No, I mean that alone—the hell that is parsing CSV files—being able to abstract that away and deal with the common use case there is, and probably I would venture to bet 98% of use cases were loading CSVs off a local file system even for very large amounts of data.
Absolutely. So shifting to the more closer to the present—you know you’ve also shifted your energy and focus towards Apache Arrow, and I think it would be fair to say that you would frame Arrow as the foundation for the composable data stack. And there’s tools like DuckDB and others that are sort of based on that. So talk a little bit about what the composable data stack is for those that are not familiar and maybe building a little bit on these lessons you talked about—building the bigger snowball, this idea of interoperability and building an ecosystem—and how Apache Arrow fits into that as well.
Wes McKinney: Yeah. So, well, you can think about the late 2000s, early 2010s as being a little bit of like the food, water, and shelter era of big data and open-source data processing in general. And so, you had disparate communities building solutions to solve the immediate essential problems that were right in front of them. And so there was relatively little consideration to building larger more heterogeneous data stacks full of multiple programming languages, different data analysis tools, processing engines, storage engines.
And so people were in general building vertically integrated systems where they would build a solution for a particular problem with a layer of technologies where classically you would have a database system that has everything tightly integrated together. So the storage engine, the data ingestion, query planning and optimization, query execution, as well as the SQL front end—that would all be present within a full stack vertically integrated system.
But one of the key ideas from the Hadoop and big data era which was originally kicked off by Google’s MapReduce paper was this idea of disaggregated storage or storage being decoupled from compute. And so you can start to think about okay how can we store data in a way that multiple compute engines can work on it. So that gets you thinking about standardizing data formats. Having open standards, open-source standards for data formats. And that was stage one of what was happening in the big data ecosystem.
But as time dragged on there was a collective realization in the mid-2010s that the cost of building one-to-one interfaces, pairwise interfaces between programming language A and computing system or storage system B, was not only hampering the performance of systems by introducing a lot of serialization and interoperability overhead but was also fragmenting effort and weighing down progress in the overall open source data ecosystem.
So the original idea of Arrow was if we could define an open standard for data interoperability for tabular data, in particular column-oriented data similar to what you would find in pandas or in an analytic database, and we could use that format for moving data efficiently between programming languages, between compute engines, between storage systems and execution layers—that would not only improve performance but also reduce the amount of glue code that has to be written by developers to make these systems work.
And so we initially started with this data interoperability problem of essentially having a better wire protocol for hooking together systems. But what was interesting is that once we had that—and that came through the Arrow project—we already had some standardized file formats like Parquet and ORC at the time and Arrow was designed to work really well with those as a companion technology. But what we’ve seen as time has moved on is that we can start to modularize and decouple the other layers of the stack. And so we can start to think about modular execution engines or decoupled frontends—different types of user interfaces for interacting with compute engines, not only SQL but also more DataFrame-like interfaces like what you would see from using pandas.
And so the kind of the way that we described this composable data stack—another way that we looked at it was the “deconstructed database.” So if you think about the architecture of a database system and trying to separate all the different layers of a traditional vertically integrated database system, putting building open protocols and open standards to connect those pieces with each other to enable interchangeability so that we can start doing—maybe if somebody comes along and develops a better storage format then we can incorporate it. And maybe that storage format only works really well for certain types of use cases but you can choose to use Parquet files for one set of use cases whereas now there’s new file formats which are specialized for multimodal AI data including images and video like the Lance format. And so you’d like to be able to take advantage of those new developments in the ecosystem without having to do a full tear out of your system and basically throw the baby out with the bathwater in order to get access to new functionality or enhanced functionality in one layer of the stack.
Anthony Deighton: That makes sense. So what I’m curious about is how this interacts with the more traditional closed source database and analytic software providers. Have you found support in that or is this felt like competition? How do you get folks like that on board? Or is that not even a consideration and it’s really just an orthogonal standard?
Wes McKinney: Well, I think it’s one of those like the open-source model succeeding stories in the sense that a lot of the adoption and growth and success of the composable data stack—so essentially Arrow, DuckDB, DataFusion, and a collection of related open source technologies—has been really driven by open-source adoption and grassroots bottom-up adoption and pressure on some of the larger, more powerful forces in the ecosystem.
So DataFusion for example is a modular customizable columnar query engine similar to DuckDB. It’s written in Rust but it’s really designed for customizability so rather than being a batteries-included, ready to go system which is more like DuckDB, DataFusion is meant to be customized. It wants you to mess around and modify and add operators to its logical planner layer. It wants you to be able to hack on the optimizer to introduce new features in its SQL dialect.
And the idea of DataFusion is that you could take this off-the-shelf high performance Arrow-native query engine and use it to build your custom database engine or your custom query processing solution. And as time went on, DataFusion just got more and more popular. It got better and better in terms of performance and extensibility. To the point where Apple decided to go all-in on using DataFusion to build its Spark accelerator layer called DataFusion Comet. So now there’s a team at Apple building—accelerating Apache Spark with DataFusion. The creator of DataFusion, Andy Grove, works for Apple and leads that team there.
But that was something where it wasn’t a top-down thing necessarily, but rather people looking in on the evolution of the data stack and seeing like okay these technologies are becoming integrated all over the place and they’re getting better and better over time and they’re attracting more and more contributions from the open source ecosystem. And it’s better to get involved in these projects, hire people to work on them and influence their direction in a beneficial way rather than go some totally different way or build something that’s completely proprietary.
I think another thing that has driven the adoption of these composable data stack technologies has been the trend in open data lakes or what is now called the lakehouse architecture where you have a structured data lake, scalable metadata store like Apache Iceberg. So you can think about it as an evolution or a formalization of some of the ideas from the Hadoop era where originally the way that datasets were managed and their metadata was managed in Hadoop was you store the data in HDFS and then there was a metastore called the Hive metastore—basically a MySQL or Postgres database that contained all of the details about what constituted a table. And whenever you were planning a query you would read data from the Hive metastore and that would tell you what files you need to read to be able to run a particular SQL query with your chosen compute engine. But Hive ran into scalability challenges especially in very large data lakes and so that led to the creation of these new open data lake or lakehouse technologies like Iceberg, Hudi, and Delta Lake which provide a more scalable and high performance approach for some of these really massive data lakes that you find among the biggest companies in the world.
So it’s been interesting but again it’s the success of the open source model. And I think also to see cutting edge research—like CWI, the birthplace of analytic columnar databases, basically CWI and MIT and a handful of academic database research labs—they chose to build DuckDB as an open source project. They could have gone and built another analytic database company and built another commercial analytic database but they chose to build a SQLite type system for analytics to be one of the best batteries-included embeddable SQL engines out there. And now they have a project that is open source, has massive adoption, and you’d be crazy to go and build a brand new embedded database engine from scratch now. Like either you need something customizable and you want to work in Rust, so you can use DataFusion. If you want something that’s batteries included and ready to go out of the box, you choose DuckDB. And that’s what we’re seeing kind of across the board.
Anthony Deighton: Yeah. So the open-source strategy ultimately sort of trumps because it allows for a full range of highly customizable through to batteries included or fully functioning. So let’s shift the conversation a little bit and not totally cast your eye forward but maybe cast your eye to the present which is this whole idea of building on top of this stack, how people generate insights and conclusions off this data.
You know, there’s maybe a working theory that there is no future for data scientists and data engineers because smart models—and being careful not to say LLMs but maybe LLMs—are going to become so good at understanding and working with data that the notion of asking an analyst to tackle a problem will seem silly and rather we’ll just ask our smart agent to figure it out themselves. And you have a bit of a contrarian view here. And my sense, but I don’t want to lead the witness as it were, is that it comes from your experience in dealing with tabular data with complex business logic and both pandas and Arrow. And just trying to connect what we were just talking about to this challenge.
So let’s start with the question of whether you think we’re on the verge of firing all the data analysts and turning it over to ChatGPT or whatever and then maybe we can talk about why or what about tabular data.
Wes McKinney: Yeah, that’s—there’s a lot to that question, many layers to unpack. Well, first one of the things that I’ve observed is that maybe one of the dirty secrets of the term data science and maybe why we’re hearing the term data science and data scientist thrown around less and less these days is that a lot of data science roles or what people were trying to hire data scientists and build data science teams in the early and mid-2010s—a lot of those teams ended up essentially doing business intelligence. So doing engineering on data pipelines and building, doing ETL or reverse ETL or data plumbing ultimately to create a dashboard or a series of dashboards that could be updated. So it was an evolution of the traditional old school BI engineer, ETL engineer building a database to power somebody’s Tableau server instance.
And so I do think that a lot of that work—the mundane building of dashboards and allowing business users to with natural language ask for custom, bespoke dashboards that answer an exact question that they need. I think increasingly that work is going to be done more and more by LLMs, especially as there’s a lot of work happening recently on semantic layers which are something that’s I think frankly necessary in a lot of cases to make LLMs effective at being able to reason about the relationships between tables and to be able to generate correct queries against the data. Without a semantic layer, there’s lots of examples that have been shown about how LLMs can reason incorrectly about the join relationships between tables and make the kinds of errors in writing SQL queries that a first year analyst writing SQL would make—double counting and things like that.
But I think that that ecosystem will become more mature, that semantic layers will become more widely deployed and standardized, and more and more of that dashboard building and custom dashboard building work will be taken care of by agents.
But I think there is still data science work which involves modeling and asking nuanced or subtle questions that require judgment and intuition, having domain expertise and understanding of the business context where the questions are being asked, and choosing the right techniques to be able to build a statistical model or a machine learning model. Maybe you’re trying to determine a causal relationship or do some type of forecasting. And many of these—a lot of this type of data science work is more part science and part art and relying on experience from past modeling or statistics work. And I do think that a lot of that nature of statistical work and data science still requires a lot of human judgment.
And I think—maybe eventually once we have AGI maybe it will be taken over by an AGI statistician, we’ll see—but in the short term I think there still is a need for that. That’s one of the areas where there’s the most human judgment needed, where if you just turn over all of that work to an agentic data scientist it’s likely to run into pitfalls or only explore the types of questions or analyses that the LLMs are well suited for. And so the more complex—there might be some study that you need to run that might require dozens of queries to be run and the results need to be stitched together and compared in lots of different ways. And LLMs are still at a stage where they often struggle to count and so asking them to reason about—like you ran these 35 queries and we need to stitch together the results and then reason about them. Right now a lot of that work is being offloaded to tool calling because LLMs are not great at looking at datasets, which we can talk more about.
But we’re still I think at early days in terms of data scientists being put out of a job by AI agents but we’ll see where things land in a few years.
Anthony Deighton: So I want to maybe make this super basic. It strikes me that large language models—I mean the answer is in the name. It’s a language model and language by its nature is sequential. It also has these slightly arcane rules. I mean all languages have a grammar etc. And they are decidedly not tabular data. And they’re not really even code. So like to the extent that we believe LLMs are good at writing code for example, code is a much closer analogy to language. In fact it’s in a way a simpler version of language because it has a very tight grammar, a very tight syntax—whereas, as many people learning the English language for the first time will note, there’s lots of poorly implemented rules in languages. Whereas tabular data has a very specific set of features that LLMs are just not at all competent with. And you gave a very simple example of counting but even simple notions of sorting and querying and filtering and like basic behaviors are things that are just a mystery to it. Is that—am I framing it fairly?
Wes McKinney: Yeah. Yeah. I mean I think—I haven’t run any studies myself or set up structured evals to get the data for myself, the accuracy results, but I remember early on I think it was in the Claude Sonnet 3.5 or maybe Claude Sonnet 4 era I built a little system to collect and summarize data from git repository history and it would create little tables. And because it was a lot of data to summarize I asked it to analyze the data that was summarized in the tables and it would struggle to do basic arithmetic in combining really small tables of data together. And so that was the first time that a light bulb went off in my head. I was like, “Oh gosh.” Like, these models—they’re for language. They’re not—they struggle to do things like adding or combining datasets unless all the work is delegated to tool calling or writing Python. They’d be better off writing Python code to do the work than actually trying to do the arithmetic or the logic in the language model themselves.
But yeah, you’re absolutely right and there have been some research lately around the retrieval problem. So basically the idea of the retrieval problem is if you present a table, let’s say a spreadsheet or a table of data, let’s say it’s like students in a class and then there’s a bunch of columns about attributes about those students. And so the idea is that you ask the model for this student, could you tell me—let’s say every student has 10 attributes which are stored in the table—and you ask the model like can you tell me for this student what is their attribute C or their attribute F. So just essentially looking at the table and straight up looking up the value in the table.
Anthony Deighton: And the frontier models have gotten to a point where they have high accuracy, over 90% accuracy in the retrieval problem, but the smaller models fail catastrophically at this retrieval problem.
Wes McKinney: And I’m blanking on the name of the blog post, but there have been people that have done studies even trying to determine what’s the best data format to present a table to an LLM, especially the smaller, more efficient, cost-efficient models, to get the best accuracy for the retrieval problem. And it’s a little bit surprising the results—like you would think that CSV format would be a good format for an LLM to look at the data but it turns out that amongst the 10 or 15 different ways you could format a tabular dataset, there’s some weird formats—basically markdown key values I think was one that I saw which was a format that I’d never heard of. I think it was invented for the study where it turned out that presenting the data in this markdown format with each row as a markdown section would yield better retrieval than putting the data in XML or in JSON in the prompt.
So, I mean I think some of it—I’m not an AI scientist but some of it may have to do with the autoregressive nature of the next token prediction design of these models. And so I’m sure that they will get better over time but these large frontier models are really expensive to run and so the hope is that as LLMs advance that the small models can become more and more effective where we can run the model at the edge on our phone or—I just got myself one of NVIDIA’s DGX Spark, one of their little AI mini computers to experiment on and to build some of my own fine-tuned models. And so I’d be really excited if we could do really great work with local LLMs not requiring 30 or $40,000 GPUs, but to really get the performance, you’ve got to run on these really expensive hardware configurations. And so to get cutting edge inference is quite expensive these days.
So it’s an interesting problem. I think there’s a number of companies that are working on foundation models specifically for tabular data—basically an AI approach to prediction and forecasting and regression and things like that. And so I think we’ll definitely see more interest in that area. I’m surprised that the Frontier AI research labs aren’t—maybe they’ve got internal research projects that they haven’t announced. But maybe as some of the hype shifts away from chatbots, more work might shift towards building foundation models for tabular data because ultimately a lot of the value in business datasets does lie and a lot of the value to unlock for businesses does lie in their data. And so to get the most value out of AI in a business context, somehow we’ve got to reconcile this incongruence between current generation LLMs and interacting with tabular datasets.
Even MCP which was developed to provide a standardized interface for LLMs to interact with external systems and tools—it’s not an especially efficient way to expose data to an LLM. Even if LLMs were really good at looking at datasets, MCP is not the interface that you would want to provide a 100,000 row table to a model to look at. And so we’re far—just thinking about all the work that we’ve done, the engineering work that we’ve done on Arrow to achieve high performance interoperability in all these contexts—the AI equivalent of how do we expose data to an LLM looks like caveman tools by comparison.
Anthony Deighton: Well, which is exactly the connection I was hoping you would draw because it’s like we’ve done 10-15 years worth of work to make data interoperable and then we turn to this new world and we just start from scratch. Like it doesn’t make any sense.
Also I was going to ask you—the other failure point it seems to me with these models is that they aren’t deterministic. Like the one thing we can say confidently for any analytical problem is that if you run the analysis twice, you should get the same answer. Like it’s not okay to be like well it’s sort of around two, or today it’s two and tomorrow it’s 20.
Wes McKinney: Yeah. Yeah.
Anthony Deighton: Whereas that’s actually a feature I think of these language models which is—obviously you can adjust the temperature but I think something people like about it is that it gives you different answers every time. Otherwise, it’s just a rules-based system. But I don’t know if that resonates with you.
Wes McKinney: It does. I mean, initially—throughout all of last year I was pretty AI skeptical. Let’s put it that way. And this year my—I think in part because the models have gotten a lot better. And also the emergence of CLI coding agents like Claude Code I think have really—for me that was a big unlock where working—like I wasn’t particularly enthused about using the AI IDEs like Cursor and Windsurf but I think within a couple of weeks of using Claude Code to delegate mundane coding work, refactoring, just stuff that was taking up my time that didn’t seem particularly high value and seeing really quick returns and productivity benefits—I’ve become a big believer and now I use Claude Code almost every day and definitely costing Anthropic money on my max plan because of the couple billion tokens that I consume every month.
But at the same time, the imperfection—if you don’t observe the inconsistency or the non-determinism, it is problematic especially for data work where you need to have an exact answer every time you run the system. And so even the risk that the model—if you have the model call tools and you’re leaving some interpretation of the results to the model and one day you get one answer and another day from the same input you get a different answer, that could lead to a business decision being taken that is detrimental to the business. And that’s problematic.
And I often find myself playing whack-a-mole with the prompts to get consistent behavior out of the models, like particularly creating a consistent development environment where each time I pick up Claude Code that I can count on the agent to predictably do the same things whether it’s mundane things like making sure that the style checks run. But it seems that 20 to 50% of the time, from day to day, without modifying the prompts or the CLAUDE.md or any of the stuff at all, the next day I’ll open Claude Code and it will forget to do things. And I’ll say, “Hey, you forgot to do this. Like CI is failing.” And it’ll be like, “You’re absolutely right. I ignored your instructions that we wrote in CLAUDE.md.”
And so the idea that these tools can just casually forget things that—even with their massive context windows, they have the memory of a goldfish. And so I’m sure that it will get better. And I use these tools every day—they bring a lot of value for me—but I’m also not quite drinking the Kool-Aid and believing that this is the next great step for humanity that’s going to lead to a world without work.
Anthony Deighton: Yeah. And I mean maybe to say it in a really simple way I think you would agree that it’s made you much more productive but it’s not made you obsolete.
Wes McKinney: Obsolete. Yeah. No, if anything my experience—my experience building software feels essential when I’m using these tools because if I wasn’t able to read the code, to review it as though I was reviewing the work of a junior developer and tell it all of the things that it messed up—its architectural problems, the incorrect or missing unit tests, incorrect documentation, incorrect implementations—like I have a lot of experience reviewing other people’s code and I feel like that is one of my biggest assets when I’m working with these coding agents is that I have to—I tell myself when I’m working with Claude Code like treat all the work that is coming out of this agent like a very motivated, very productive junior developer who’s prone to errors and doing things and frankly creating messes.
And so at a glance—to be able to spot design, architectural problems, things that need to be refactored, code duplication, code smells—all that feels essential to getting the most value out of these tools. And I’ve heard from talking to other people that the people that are the AI users who are able to get the most value out of the coding agents are the most experienced developers who are able to bring their experience and judgment to give—not only to write better prompts, to be very specific and articulate about what you’re asking for—but also to be able to judge the output and give high quality feedback so that you can corral, basically move things in the right direction to get what you want.
But I do think that we’re likely to have like a vibe coding epidemic of—you know kind of amplifying Dunning-Kruger syndrome of people building software, AI slop, not reading the output carefully, not doing code review, basically just letting Codex or Claude Code do its thing and then slapping up a pull request without giving it a second thought. And there’s likely to be substantial business losses because developers are deploying vibe-coded software into production without sufficient code review.
Of course you can protect yourself against some of this by taking a test-driven development approach and asking the agent to build a test suite that of course you have to review before you set to implementation work. And in the past I was never really a hardcore test-driven development TDD adherent but now using coding agents I’ve become much more so because each time I sit down with Claude Code I treat it as a defensive exercise where it’s like how do I protect myself from the agent doing incomplete work or insisting that it’s completed the problem or solved the problem in the way that I asked for but has deceived itself into believing that it’s finished the problem.
And so the more test coverage—whether it’s test coverage, automated checks, benchmarking suites—all the defensive things that you would need to do already to create a piece of production software, it’s even more important to do that with these agents. Otherwise any software that they create becomes a huge liability.
Anthony Deighton: Yeah, I think that’s a wonderful insight actually like it’s—these practices become more important because you’re—and I also want to loop it back to something you said before which is the future is a smaller number of really experienced thoughtful architect engineer types marshalling an army of these agents but then, and to your point, layering in this defensive coding practices. Maybe that model starts to feel closer to the future.
I sort of am waiting for the first spaceship crashing into Mars due to a vibe coding error as opposed to a units conversion problem.
Wes McKinney: Right. Right.
Anthony Deighton: Yeah. It’s interesting. I think one of the existential problems is how will junior developers become senior developers? And so the old working model was that you become a senior developer by doing junior work over a long period of time getting good at not only writing code but reviewing code and seeing what good code looks like. And then as your career progresses more of your work transitions from—maybe as a junior developer you’re doing 90% writing code, 10% code review. Maybe a senior developer is doing 10% development, 90% code review.
And so now delegating more of the coding work to these agents means that more and more of our work is shifting to code review. And I think the code review is still going to be a bottleneck. And so even having a senior developer with an army of AI agents, who’s going to review all of that code? I mean, you can have the agents review their own work or you can have—
Wes McKinney: Good luck with that.
Anthony Deighton: Yeah. I mean I have friends that have told me about having Claude Code implement, Codex review or vice versa—basically use the agents to review each other’s slop and to make it better. But yeah I think on the whole we’re going to have fewer software engineers and especially really experienced engineers will spend more of their time reviewing the work, the output of agents.
Wes McKinney: But still there’s a human bottleneck of being able to do code review and assess the work and determine whether or not it should be accepted or whether—in many cases these AI generated pull requests and patches should just be thrown out altogether and start over. But one of the nice things is that the cost of starting over is so much smaller. And so if you see somebody approach a problem and they’ve—it’s basically like the Panama Canal approach versus like going around the bottom of South America.
And sometimes with an AI agent you can generate an impressive amount of code, thousands and thousands of lines of code in an afternoon to hack your way around a problem. But maybe there’s a more elegant solution that the AI agent just didn’t see or wasn’t obvious given all the data in its training set to see—the simpler, more elegant solution. And so if everyone is always using the agents and creating solutions which are sometimes circuitous or kind of missing the more elegant, more maintainable, more sustainable approach, that’s going to lead to these projects where you have 100,000 lines of code that maybe should only be 15 or 20,000. If written by a human, it might be a much smaller codebase that’s easier to maintain and is more robust over time.
Whereas with this large codebase, you start to reach a certain limit where even having the agent look at the codebase starts to become unwieldy—files with thousands of lines, it starts to choke on the input pretty quickly. Like I recently created a little personal finance tool with Claude Code called MoneyFlow for interacting with—I use Monarch Money, the personal finance tool—and now I have a project that I created from scratch with Claude Code so it’s 95 to 99% created with Claude Code with a lot of feedback and code review from me. And it’s pushing, including the test suite and all of the infrastructure and everything, it’s pushing 40,000 lines of code. If I had written it by hand, it probably would be a lot smaller in part because I couldn’t spend nearly the amount of time that it would take to write a codebase that large and I would have cut more corners or made simplifying assumptions in order to accomplish the same things with a lot less code.
So yeah, it’s interesting, but I’m learning day-to-day and I’m no expert by any means, but I feel like the more I use these tools, the better I understand them and the more I get out of them.
Anthony Deighton: Well, Wes, really appreciate you taking the time. This was a bit of a journey from where you started and through to the present day and also thinking about how to think about a future where we’re marshalling agents on our behalf. I really appreciate you taking the time and sharing your thoughts and insights.
Wes McKinney: Yeah, thanks again for having me on.
Anthony Deighton: Thanks for joining us for the latest episode of the Data Masters podcast. You’ll find links in the show notes to any resources mentioned on today’s show. And if you’re enjoying our podcast, please subscribe so you never miss an episode.