Apache Arrow and Buzz

Have you heard about Apache Arrow? Yes, the cross-language development platform for in-memory data. Mhh...

Written on May 1, 2020

First, a few words about Arrow

Apache Arrow is a very interesting initiative to uniformize the in-memory data layer of data processing engines.

Without Apache Arrow, lot of copy and convert
With Apache Arrow, no copy and convert

It has multiple interesting benefits:

- Promote an efficient data layout in memory. Having data organized in well aligned columns allows the compilers that support SIMD to find optimizations that can increase processing speeds by up to 4 on older processors (SSE instructions) and up to 16 on the latest CPUs (AVX512). If you are interested in a gentle introduction on SIMD programming, take a look at [2].

- Mutualize the development effort of common file format and infrastructure readers. For instance, reading efficiently Parquet files from AWS S3 is a common use case and can be heavily optimized. It would be a shame that each engine keeps investing in its own reader!

The Buzz engine

The Buzz engine is the central project of cloudfuse. The goal is to query massive amounts of cold storage with minimal latency, thus decoupling storage and compute resources for good. This is a true revolution as compute and storage are the two fundamental building blocks of big data, and they have very different usage cycles.

Most current engines were initially conceived to query on premise data on local storage systems such as HDFS or HBase. Many of them were later adapted and optimized to also work on cloud storage. For Hadoop technologies (compatible with AWS S3 since Hadoop 2.6 and with Azure Bob Storage since 2.7), this transformation was easy in terms of development effort because the APIs of cloud storages were quite similar to that of HDFS.

But sadly, limited performances of cloud storage deteriorate greatly the interactivity of those systems. Huge amounts of data need to be transferred from the raw storage to the compute resources, and cloud storage is typically accessed through the network with a very limited bandwidth and high latency. This can be overcomed by scaling the compute layer massively to increase the bandwidth. But another problem rises with traditional analytics systems if you try to scale them aggressively. Because they were meant to run on the same machines as the storage, they shared its lifecycle and were conceived to be long running processes, thus boot time was never really taken into account in their conception. Major cloud providers allow users to circumvent this problem by proposing managed services that give on the flight access to already running clusters of analytical engines, for instance Amazon Athena. Sadly, those typically work as black box systems, impeding users from optimizing internal caching and scaling parameters to their use case, which in time also increases latency and hurts interactivity.

Recently, a lot of effort has been deployed by cloud providers to improve virtualization technologies. Amazon’s Firecracker is a good example [3]. This allows cloud users to provision compute resources much faster. In particular, cloud functions allow users to spin up thousands of cores and terabytes of RAM in a matter of hundreds of milliseconds. In this new paradigm, the possibility of provisioning very temporary compute resources (for instance Amazon Lambda and Google Cloud Functions are billed in 100ms increments) allows to compensate the limited performance of cloud storage by parallelizing massively the data collection process. Instead of having a few dozens of machines adding up their limited network access to the cloud storage, we could now potentially provision thousands of lightweight virtual machines distributed all over the cloud provider’s datacenter that scan the data in parallel.

Mix and match

The shift to the new paradigm where ephemeral cloud functions serve as runtime breaks the major conceptional hypotheses of the legacy analytics engines. This this prevents us from simply forking an existing engine to make it run on cloud functions. But we can't either rewrite everything from scratch, so it is crucial to find a good quality code base that can serve as foundation. Arrow is the perfect candidate:

- It is meant to be a library, so it is architected to be re-used in various contexts.

- It is implemented in low level languages such as C++ and Rust which is perfect for the resource constraints of cloud functions.

- It is a very active project, which means it can only get better!

We are really eager to see how Apache Arrow evolves in the months to come and hope to start contributing actively as soon as possible!

[1] Apache Arrow Website

[2] SIMD tutorial

[3] Firecracker anouncement