Aligning Velox and Apache Arrow: Towards composable data management

neub9
By neub9
4 Min Read

“`html

We’ve partnered with Voltron Data and the Arrow community to align and converge Apache Arrow with Velox, Meta’s open source execution engine. Apache Arrow 15 includes three new format layouts developed through this partnership: StringView, ListView, and Run-End-Encoding (REE). This new convergence helps Meta and the larger community build data management systems that are unified, more efficient, and composable. Meta’s Data Infrastructure teams have been rethinking how data management systems are designed. We want to make our data management systems more composable – meaning that instead of individually developing systems as monoliths we identify common components, factor them out as reusable libraries, and leverage common APIs and standards to increase the interoperability between them.

As we decompose our large, monolithic systems into a more modular stack of reusable components, open standards, such as Apache Arrow, play an important role for interoperability of these components. To further our efforts in creating a more unified data landscape for our systems as well as those in the larger community, we’ve partnered with Voltron Data and the Arrow community to converge Apache Arrow’s open source columnar layouts with Velox, Meta’s open source execution engine. The result combines the efficiency and agility offered by Velox with the widely-used Apache standard

Meta’s data engines support large-scale workloads that include processing large datasets offline (ETL), interactive dashboard generation, ad hoc data exploration, and stream processing. More recently, a variety of feature engineering, data preprocessing, and training systems were built to support our rapidly expanding AI/ML infrastructure. To ensure our engineering teams can efficiently maintain and enhance these engines as our products evolve, Meta has started a series of projects aimed at increasing our engineering efficiency by minimizing the duplication of work, improving the experience of internal data users through more consistent semantics across these engines, and, ultimately, accelerating the pace of innovation in data management

Velox is the first project in our composable data management system program. It’s a unified execution engine, implemented as a C++ library, aimed at replacing the very processing core of many of these data management systems – their execution engine. Velox improves the efficiency of these systems by providing a unified, state-of-the-art implementation of features and optimizations that were previously only available in individual engines. It also improves the engineering efficiency of our organization since these features can now be written once, in a single library, and be (re-)used everywhere. Velox is currently in different stages of integration in more than 10 of Meta’s data systems. We have observed 3-10x efficiency improvements in integrations with well-known systems in the industry like Apache Spark and Presto. We open-sourced Velox in 2022. Today, it is developed in collaboration with more than 200 individual contributors around the world from more than 20 companies

In order to enable interoperability with other components, a composable data management system has to understand common storage (file) formats, network serialization protocols, table APIs, and have a unified way of expressing computation. Oftentimes these components have to directly share in-memory datasets with each other, for example, when transferring data across language boundaries (C++ to Java or Python) for efficient UDF support. Our focus is to use open standards in these APIs as often as possible. Apache Arrow is an open source in-memory layout standard for columnar data that has been widely adopted in the industry

“`

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *