Custom allocators for Apache Arrow

Analytics engines are all about moving huge amounts of data from one place to the other. Let's study the benefits of using a custom allocator in this context.

Written on June 6, 2020

In previous posts, we talked about the benefits of Apache Arrow and the challenges of memory allocation in a cloud function environment. We are now going to consider some application side mechanism we can implement to adapt ourselves to these constraints.

Do we need an allocator?

When mapping data to memory, Apache Arrow can usually compute the memory it will need to store the new column. This is particularly easy with fixed-size types, such as integers or floats. This way, it can simply preallocate the right contiguous space in virtual memory. But it can happen that Apache Arrow cannot predict the size of the output. This happens for instance when allocating an array of strings, in particular when the original data is dictionary encoded (e.g. Parquet). In that case, it uses the realloc(3) command to grow the existing allocation. There are two possible outcomes when that happens:

  • The allocator can simply extend the allocated range because the following contiguous addresses in virtual memory are available

  • The allocator needs to allocate the new size in another location of the virtual memory, copy the existing bytes from the already allocated section (or remap them) and free it.

Of course, when memory copy kicks in, it can have important performance impacts. We mentioned earlier that a typical speed for mapping to physical memory in AWS Lambda was around 1GB/s. This can quickly add up to precious highly priced seconds being wasted.

What is an allocator exactly?

An allocator is a memory management library that translates application memory needs (malloc/free in C, new/delete in C++) to system calls. The usual general-purpose allocators, such as the system allocator or Jemalloc, have a large number of complex mechanisms to avoid memory fragmentation and avoid unnecessary locking in multi-threaded contexts [1]. These allocators usually look for compromises so that most common use cases perform well. In a lot of domains, in particular if they are very performance sensitive, it is usual to write custom allocators for specific workloads. For instance, real-time applications often create a memory pool during startup to avoid having to perform expensive OS calls at runtime. Indeed, knowing the memory usage pattern of your application, you can make better assumptions than a general-purpose allocator.

Because memory is the corner stone of Apache Arrow, it offers the possibility to use various general-purpose allocators, such as the default system allocator, but also Jemalloc or Mimalloc. On top of that it does its allocations through a MemoryPool interface. This class is easily extensible because it only has three methods: Allocate, Reallocate and Free.

Specifications of our custom allocator

Because reallocations are the norm for us, we need an allocator that is able to deal with them with a minimal number of copies. One good solution to do that is to detect allocations that are likely to grow and mmap(2) them a huge chunk of memory (e.g. the total size of the cloud function) right away. This can be done at no additional cost because physical memory is only paged in when the memory is actually touched. This way we guaranty that column will have size to grow in the virtual memory.

Another feature we want from our allocator is the ability to create mappings to physical memory proactively. This can be done by preallocating memory and touching it artificially with a command such as memset (at cloudfuse we call this process "memory painting" ;). For instance, during the first 100ms of its life, the function usually needs to connect to external services to get its task or fetch metadata such as the Parquet footer. During this time, it has basically no need for memory, so we can use a background thread to paint as much memory as we can!

Wrap up

Time inside our cloud function is very precious, because it is the component that we scale massively. So more than controlling latency, it is the cost of the query engine that we are reducing massively by cutting out lost milliseconds. Here we saw than we can actually squeeze out quite a lot of performance from better memory management.