Supporting the Arrow Ecosystem at Voltron Data

Episode

Event Data Engineering Podcast

Location Remote

Date November 28, 2022

Watch Episode Episode

This transcript was obtained from the official Data Engineering Podcast website. The summary below is AI-generated.

Summary

In this episode of the Data Engineering Podcast, I join host Tobias Macey to discuss my work at Voltron Data, the Apache Arrow project, and the vision for a more modular and composable data analytics stack.

Voltron Data’s Mission

Voltron Data was formed by bringing together several groups: my team from Ursa Labs/Ursa Computing, leadership from NVIDIA’s RAPIDS projects, and the BlazingSQL project. The company’s mission is making the modern data analytics stack more modular and composable, enabling developers to unlock the value of modern hardware including GPUs, FPGAs, and custom silicon.

Apache Arrow’s Evolution

Arrow started in 2015-2016 as a standardized way to represent tabular data in a language-agnostic fashion. The project has grown from defining a data format to building a complete “multi-language toolbox for building analytical data processing systems.” This includes data serialization, RPC frameworks (Flight), database middlewares (Flight SQL, ADBC), and compute engines (Acero in C++, DataFusion in Rust).

Substrate: A New Intermediate Representation

We’re developing Substrate, an intermediate representation for data analytics operations that sits lower than SQL. It can represent tabular data or dataframe operations beyond what SQL can express, connecting user interfaces to computing engines on the backend. This enables engine choice for developers without having to build custom integrations for each backend.

The Impact of Arrow

Early adopters who have spent 2-3 years building Arrow-native systems are seeing significant business impact: lower resource utilization, better interoperability, improved efficiency and performance, and lower latency. Seeing the dream of Arrow become reality in large-scale data platforms is deeply validating.

Key Lessons

Relationships really matter in building large open source projects. The social dimension is the most difficult but also the most important for building something sustainable over a long period of time.

Key Quotes

“For me, it started out as a personal challenge to see if I could create tools for myself to enhance my productivity because I found my job to be quite tedious and working with data to be much more difficult than I thought it should be.” — Wes McKinney

“The mission of the company, the heart or the soul of the company is making the modern data analytics stack more modular and composable to make it easier for developers and users of analytics or data engineering tools to unlock the value of modern hardware.” — Wes McKinney

“The way that we describe the project these days is we describe the Arrow project as being a multi-language toolbox for building analytical data processing systems.” — Wes McKinney

“We were able to make custom code running in PySpark 10 to a 100 times faster in some cases.” — Wes McKinney

“You can think about Substrate as being something that’s lower level than SQL and can be used to represent tabular data or data frame operations that go outside of what is expressible in SQL.” — Wes McKinney

“The lessons learned are that relationships really matter. It’s not just about writing code and pushing code to GitHub. The social dimension of building these types of projects is the most difficult, but also the most important if your goal is to build something that has large scope and that you want to be sustainable over a long period of time.” — Wes McKinney

“I think the Arrow project as being an agent of change and progress in the open source ecosystem.” — Wes McKinney

“I’m looking forward to a day when we won’t have to think about that. We’ll have written some of our last data connectors, and we can just think about Arrow, and that will make our lives a lot easier.” — Wes McKinney

Transcript

Tobias Macey: Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Macey. And today, I’m interviewing Wes McKinney about his work at Voltron Data and on the Arrow project and its surrounding ecosystem. So, Wes, can you start by introducing yourself?

Wes McKinney: Yeah. Sure. Thanks for having me. I’m Wes McKinney. Many people know me as the creator of the Python pandas project. Started almost 15 years ago, but over the last 7 years, I’ve been primarily focused on the Apache Arrow project and the surrounding open source ecosystem. More recently, I’m the CTO and cofounder of Voltron Data, a data analytics startup where we are offering enterprise support and services around Apache Arrow and doing a substantial amount of open source development in the ecosystem.

Tobias Macey: And do you remember how you first got started working in data?

Wes McKinney: I’ve told the story many times, but I was working in quantitative finance right out of college. I had a math degree, and turned out that I thought I was gonna be doing math and solving partial differential equations and that sort of thing, but it turned out that I was mostly doing data analysis and writing SQL queries and using data frames and things like that. And so I started to get interested in the tools for doing data analysis because I wanted to make myself more productive because I found my job to be quite tedious and working with data to be much more difficult than I thought it should be. And so for me, it started out as a personal challenge to see if I could create tools for myself to enhance my productivity. And I found that I enjoyed building tools and I become very passionate about open source software. And, you know, I love working in the community and building projects and helping progress happen faster.

Tobias Macey: In terms of the Voltron Data business and the kind of focus of it, I’m wondering if you can give some of the overview and some of the story behind how it came to be.

Wes McKinney: The Apache Arrow project started we got the initial group of developers together in 2015 to start the project, formally launched it as a top level project in the Apache Software Foundation in 2016. And we set about, you know, growing the different layers of the stack. And as time went by, we started to observe, you know, more general trends in the interplay between programming languages, you know, data storage, data access, and kind of the data analytics stack, and the role of, you know, the evolution of hardware and computing hardware. So in particular, things like graphics cards, you know, GPUs, FPGAs, and custom silicon.

There were many different groups of developers working on different layers of the stack in and around the Apache Arrow ecosystem. We saw an opportunity to build a unified computing company, bringing together, you know, several of those groups of people. So, respectively, you know, myself and my team from Ursa Labs, which became Ursa Computing, group of leadership from the RAPIDS projects, which had been started at NVIDIA, and the Blazing SQL project, which is a SQL engine built on top of RAPIDS. So we reason that we could build a more integrated and more successful company, you know, working together under 1 roof than than pursuing, trying to grow our different slices of the pie, so to speak. So that’s how the company came together beginning of last year, you know, to build a large team.

Thankfully, we were able to assemble quite a bit of investor capital before the market turned south earlier this year. You know, we’ve been really just heads down, heads down building the last last year and a half, which has been really, really exciting.

Tobias Macey: 1 of the, I guess, kind of meta notes that I’m curious about is how you settled on the name of Voltron as the company name, and how often people wonder what that is in reference to?

Wes McKinney: You know, we like the name Voltron Data. You know, we wanted to evoke, you know, the feeling of, you know, what we’re building being something that is, you know, the whole is greater than the sum of its parts. And, you know, I think the mission of the company, kind of the, you know, the heart or the soul of the company is making the modern data analytics stack more modular and composable to make it easier for developers and users of analytics or data engineering tools to unlock the value of modern hardware and to take advantage of advances in computing capabilities, you know, as they become available.

And so I think we’ve seen, you know, in the world of machine learning and AI, you know, deep learning training, machine learning training, that sort of thing, we’ve seen, you know, significant change to the technology landscape through use of hardware acceleration through, you know, through GPUs, and now we’re seeing TPUs and and custom chips for accelerating machine learning. There’s the same kind of innovation and improvements in computing efficiency can be brought to the other layers of the data processing stack, analytics, machine learning preprocessing, you know, ETL.

You know, we are, you know, we’re really focused on improving the, you know, protocols and standards, like the fundamental technologies that enable that that kind of modularity and composability at the kind of, you know, language data and hardware level. And so if, you know, developers observe, you know, the work that we’re doing in not only in Apache Arrow, but in some of the surrounding projects, like Substrate and and Ibis, which we we can dig more into in this podcast, you can see how we are working on, you know, really hardening, like, these interfaces and protocols between the different layers layers of stack to make it easier for developers to, you know, swap out components and develop a more kind of framework agnostic or, you know, an engine agnostic fashion, if that makes sense.

Tobias Macey: As far as the broader vision of Arrow, you know, it has these immediate benefits of being able to operate as an interchange format between different languages and run times and frameworks, and it has been growing in terms of its scope and its capabilities. And I’m curious if you have any overarching vision for Arrow and its potential impact on the broader data ecosystem and some of the ways that the work that you’re doing at Voltron is aimed at helping to bring forth the realization of that vision.

Wes McKinney: You know, going back 6, 7 years, when we started the Arrow project, I did always have the aspiration of building a more modern computing foundation for data frames and tabular data processing. And so for me, like, expanding the scope of, you know, what we call Apache Arrow has always been, you know, something that I’ve been really motivated to do. But when we started the project, we had to start small. Like, can we, as a community, come to an agreement around how we how we represent tabular data in a framework and language agnostic fashion, such that we can achieve this concept of, like, a universal data frame, which can be used portably across computing frameworks, programming languages, different processing environments so that we can have a basis for beginning to think about that kind of, you know, frictionless modularity and composability at the data level.

Once we did that, we had to move on to building the other layers of stack, which are necessary to build Arrow native applications. And so that’s, you know, the data serialization, building RPC for moving around data efficiently in a distributed system. More recently, you know, we’ve been looking at the protocols and interfaces for interacting with databases in an Arrow native way. And so we’ve got subprojects which are specifically focused on integrating Arrow more natively into database systems. So we make it easier to push Arrow based datasets in and out of databases.

And the other dimension of not only having a data format that is a universal data format, protocols and interfaces for moving it around, protocols for connecting systems together in an Arrow native way. But we also needed to build computing engines to process Arrow data so that we can embed into different systems to do, you know, data cleaning, data preparation, teacher engineering for machine learning, analytics, all those things that you would do with a, you know, SQL engine or a data frame library, that sort of thing. And so as time has shifted, the work in the Arrow project has moved away from building these fundamental protocols and interfaces to more of the, you know, modular, embeddable compute engine development, which has been really, really exciting to see.

Tobias Macey: 1 of the initial motivations for Arrow was to cut down on some of the inefficiencies of that data interchange. I think 1 of the most notable examples is using the PySpark library to interact with the Spark runtime and having to serialize and deserialize the data in between those interfaces, as well as having to translate between the representations of information between Java and Python. And I’m wondering if you can give an overview of the kind of types and scope of impact on engineering productivity and compute efficiency that the Arrow project and the kind of growth thereof is intended to address.

Wes McKinney: I think the Spark example is a really motivating 1 because that was 1 of the first problems, practical problems that we focused on solving with the Arrow project was the problem of making Python on Spark a lot faster. And so if you look at, you know, as a user using PySpark versus using the Spark Java or Spark Scala APIs, there was a significant performance penalty in using Python whenever you wanted to extend Spark with custom Python code that might use pandas or might use scikit learn or, you know, something else in Python ecosystem. So by defining a column oriented, you know, data format, which could be constructed on the JVM side inside the Spark runtime and then sent over to the Python side for executing custom code by having that not only a more efficient data format to move across, but also something that could be interacted with very cheaply on both sides without having additional conversion or serialization.

We were able to you know, this was my colleagues at 2 Sigma and collaborators at IBM. We were able to make custom code running in PySpark 10 to a 100 times faster in some cases. Now, of course, like, you know, there’s many workloads in Spark which has shifted to use the Spark Dataframe API where under the hood, you know, Python code, Java code, Scala code, it gets translated into effectively a SQL query, which gets run by Spark SQL. And so there’s no need for data to ever be transferred into Python. But there still are plenty of use cases where it’s necessary to run custom code, and Spark is used in many cases as a convenience layer for doing parallel and distributed computing with Python. But users shouldn’t have to pay an enormous penalty, have that privilege. And so Arrow has really helped in reducing the overhead, the impedance between those systems and in those cases.

That being said, you know, Spark and Spark SQL are, you know, systems that have been been around for a long time, but Spark SQL was built before Arrow existed. And so, internally, it is not, you know, an Arrow native system, so to speak. So, like, it represents the data that flows around Spark or inside Spark SQL in a data format that is not the same as the Arrow format. And so I think what’s really interesting for thinking about the future is having spark like distributed systems for large scale tabular data processing that are fully aero native end to end. And so you have the ability to extend those systems with custom code written in principle in any programming language that knows about Arrow. And so we enable much more, you know, kind of fair and kind of consistent polyglot experience across the stack where where no programming language is being unfairly penalized, both through as a result of having to, you know, do expensive data serialization at the at the programming language boundary.

Tobias Macey: Because of the fact that you aren’t paying a penalty by virtue of the language or the runtime that you’re choosing, I’m curious how you see that influence the decisions that engineering teams make as to how they want to compose their stack and compose their analyses and some of the ways that that reflects in terms of the skill sets that are necessary to be able to build and maintain these analytical systems.

Wes McKinney: For me, what’s exciting and motivating is is for the users to have the choice and being able to choose the programming language and the types of APIs and user interfaces that make most sense for the systems that they are building and to have more kind of a more natural, you know, let’s call it language integrated querying capability. So I think part of the challenge that we have as system developers is is to make it easier for the programming language interfaces to evolve and innovate independently from the back end compute engines. And so, you know, our goal with what we’ve been building in Arrow is to put, you know, very fast, you know, Arrow native data processing to make that available in a form factor where it can go everywhere. So it can be, you know, deployed in, you know, heavily resource constrained environments where having, you know, very low latency efficient tabular data processing with the in process is highly desirable.

But that we can also, using the same APIs and user interfaces that we use to do local small scale computing at the edge, so to speak, that we can build descriptions of our workloads or our data transformations in a form where they can be serialized and sent into, you know, large clusters for, you know, doing larger scale data processing. And so that’s 1 of the reasons that we’ve been investing pretty heavily in this new project called Substrate, which is building a intermediate representation for data analytics operations that can be used to connect user interfaces and computing engines on the back end.

So you can think about substrate as being like something that’s, you know, lower level than SQL and can be used to represent, you know, tabular data or data frame operations that go outside of what is expressible in SQL. And so it’s our hope that by hardening the interface and making it straightforward for engine builders, you know, compute engine builders to focus on building a substrate interface to their engine. And so then from the API developer standpoint, the user interface developer building Python libraries or Go libraries or Java libraries or Rust libraries.

At the user interface layer, we can just focus on generating substrate rather than having to think about, well, how do I build an interface or an integration with a particular computing engine? Because then whenever there’s a new computing engine that you wanna take use of, maybe to accelerate some part of your data processing workload, you’ve gotta build a new interface to that engine. And so by reducing the surface area of the problem to just, let’s just think about the world in terms of this substrate intermediate representation, that makes it so much easier for us as API developers to build the user experience because we just have this, like, 1 intermediate representation to generate. And then on the back end, you know, the engines can decide how to most efficiently execute the substrate.

Tobias Macey: As far as the overall scope of the Arrow project itself and what is actually contained within the code repository and also the growth of the broader ecosystem around it, They have definitely grown substantially, and I think the most recent release of Arrow right now is version 10. So definitely a lot of development happening there. And I’m wondering if you can give an overview of the current set of capabilities and the, I guess, features that are targeted with Arrow and the related projects beyond just that in memory columnar data representation.

Wes McKinney: I mean, the way that we describe the project these days is we describe the Arrow project as being a multi language toolbox for building analytical data processing systems. And 1 important part of the project is Arrow calendar data format. From there, there’s, you know, a whole set of different software components, which enable you to do things with the Arrow data formats that includes data serialization, inter process communication, remote procedure calls, so building services that need to send and receive Arrow data. So there’s a framework called Flight for building Arrow native data services.

We’ve started building some database middlewares for integrating Arrow into database systems. So there’s a SQL database protocol project called Flight SQL, which provides basically a wire protocol for talking to SQL databases over gRPC using the Flight framework. Another project called ADBC, which is a standardized API for database drivers to provide aero native data access. So it’s kind of orthogonal to Flight SQL, so it has nothing to do with the data protocol or the wire protocol. It’s more about having a standardized API for inserting and selecting arrow datasets from SQL based systems.

We’ve got Compute Engine, so there’s multiple computing engine projects. There’s the Acero, c plus plus Compute Engine. There’s, Rust Data Fusion, Compute Engine. So there’s kind of the, you know, kind of embeddable Compute Engines in multiple programming languages designed for different use cases. You know, it’s an increasingly, you know, diverse and and and federated community of subprojects. So not just the core 10.0 Aero. You know, probably, if you go to apache/arrow on GitHub, you’ll see a large you know, very large Polyglot Git repository.

But we’ve grown several other repositories that house a number of the Rust subprojects as well as the the Julia Aero project lives in its own repository nowadays as well. And so we have some support for around a dozen programming languages. And within each programming language, we have, you know, a stack of libraries, which are there to make it possible for you to build systems that that use the Arrow format or connect to other systems that use Arrow. Of course, some of those libraries are in different levels of maturity. So the Rust libraries, the c plus plus libraries are generally the most featureful and mature, but we’re growing an increasing, you know, amount of support in Go and Java. You know, initially, the project started out. It was just c plus plus and Java, but it’s expanded significantly since then. It can be a little bit difficult for a newcomer to navigate, but I think the community has around a 1000 developers. Maybe around a 1000 different people have contributed to the project over the last 7 years. So the developer community, we’ve done a good job or we’ve put in a lot of effort, I should say, to make the project accessible to new contributors. So that’s through, you know, developer documentation and, you know, efforts to, you know, engage and grow the open source community around it. So it’s not just a small, you know, insular group of developers building all of these things, but that we’re actively trying to make the developer community larger to share the burden of maintaining all of these different software components that have to be, you know, have to have bug fixes and security fixes and make releases periodically.

Tobias Macey: In your work of building this business and helping to create this project and grow it, what are some of the most interesting or unexpected or challenging lessons that you’ve learned in the process?

Wes McKinney: I mean, I would say that, you know, building large open source projects that become depended on by large traction of the ecosystem is is always very difficult. So I have spent yeah. I think the initial the early bootstrapping of the Arrow project was definitely not easy. Required, you know, a lot of, you know, personal and professional sacrifices, you know, on my part. And so I was lucky to have some passionate, you know, true believers supporting the work. So, for example, folks over at 2 Sigma, you know, who employed me and then also were supporters of Ursa Labs, you know, for a couple of years before we, you know, became Voltron Data. Folks at RStudio who really wanted to RStudio, which is now Posit, who saw the potential for building a more polyglot data science stack.

And so I think the lessons learned are that relationships really matter. It’s not just about writing code and pushing code to GitHub. Like, the social dimension of building these types of projects is the most difficult, but also the most important if your goal is to build something that has, you know, large scope and that you want to be sustainable over a long period of time. We still have, you know, social and sustainability challenges in the Arrow projects. Like, we’re Voltron Data, we’re taking on a large amount of maintenance and systems burden supporting the Arrow project, which has been, you know, great in the sense that we’re pumping out releases. Like, we’re improving the CICD infrastructure.

You know, our testing and continuous delivery for the project is better than ever. But then, you know, other people in the open source project would be justifiably concerned about, you know, can we count on Voltron Data to be around and providing this level of support for forever? So there’s a justifiable suspicion about companies being involved and large companies being involved in open source projects. But I think, you know, our goal is to always do right by the community. You know, I’ve always been very, very community minded in in thinking about building these projects. So it’s been interesting and challenging and stressful at times, but it’s also very rewarding. So, you know, ultimately, like, we see this as the Arrow project as being an agent of change and progress in the open source ecosystem. So we’re excited to keep rolling the ball forward and supporting growth, of the ecosystem and making sure that, you know, the developers and the users can be successful building on this new computing stack.

Tobias Macey: Given the fact that Arrow isn’t even necessarily the kind of end user selected utility, this question might be nonsensical. But what are the cases where Arrow or its related projects might be the wrong choice?

Wes McKinney: I think a question that we answer a lot is, you know, Arrow is a storage format or Arrow is a format for data warehousing. And Arrow is not designed to be a replacement for a competitor with or a replacement for Parquet, for example. And so, you know, sometimes people do come to the project thinking like, oh, I’ve heard about Arrow. It’s a data format. Right? So can I, you know, use it to build my data warehouse or build my data lake? And so, you know, occasionally, there’s, you know, some confusion around purpose of the project. But I think as we’ve improved, you know, our developer content and helping folks understand about how, you know, we’re building this companion technology to storage, you know, storage systems like, you know, file formats like Parquet and, you know, large scale metadata management, you know, large scale dataset systems like Iceberg.

I think that’s becoming more clear to users. And and, certainly, like, there’s people doing data engineering or, you know, machine learning that is primarily dealing with, you know, text or unstructured data. In some of those instances, you know, Arrow may not provide a lot of value depending on depending on the nature of the work. But fortunately, a lot of the data that’s processed in the world is fundamentally tabular or at least representable on a tabular format. Most, you know, data generated by modern web applications, mobile applications can be, you know, represented and processed in a tabular format.

And so even though, you know, we don’t strive to be all things to all people, there’s a large fraction of, you know, data analytics or data engineering where Arrow is a relevant technology that can make things, you know, faster, simpler, more efficient.

Tobias Macey: As you continue to build and iterate on the Arrow project and invest in that ecosystem and help to grow the degree of integrations that are available, what are some of the things you have planned for the near to medium term or any projects you’re excited to dig into?

Wes McKinney: Right now, I’m pretty you know, we talked a lot about substrate. I’m very pumped about that. Another, you know, project that I’m really excited about is we’ve got this effort in Arrow called Nano Arrow, which is building a small implementation of the Arrow format and protocol for embedded use. So if you have a system like a database or, you know, like a microservice or, you know, it could be really anything where you want to add the ability to send and receive Arrow data, but you don’t want to take on new library dependencies. This is a project that can be, you know, dropped in and copied into a project in principle in any programming language as long as you have c you know, the ability to call c code. And so we think that that will help expand the adoption of Arrow into places where it has not reached yet. We’re pretty excited about that project, Nano Arrow. Also, really excited about ADBC, like, the standardized database API to be used alongside existing JDBC and ODBC interfaces for talking to databases.

But I think I’ve always had the desire to make it easier to talk to databases and for applications and users to not have to write so much custom code to just get data in and out of SQL databases. And so I think that the ADBC effort gives us a path to, you know, making that a reality so that we can just think about tables and data frames and not so much about, you know, how do I, you know, translate between this database’s wire protocol and my, you know, data frame, data structure. Because god knows, you know, I’ve written and, you know, folks in all of these ecosystems, you know, we’ve all written a ton of code just dealing with converting between data formats. And so I’m looking forward to a day when we won’t have to think about that. We’ll have written some of our last data connectors, and we can just think about Arrow, and that will make our lives a lot easier.

Tobias Macey: Well, thank you very much for taking the time today to join me and share the work that you’re doing at Voltron Data and your experience on working on the Arrow project and helping to illuminate some of the ways that it is being used and the surrounding projects and its growth in the ecosystem. So I appreciate all the time and energy that you and the other members of the Voltron Data and the Arrow teams are doing. So thank you again for your efforts there, and I hope you enjoy the rest of your day.

Wes McKinney: Thanks. Thanks for having me.