SIMD support on Lambda

SIMD has drastic performance impacts. But instruction sets are not supported by any hardware.

Written on May 20, 2020 

A few weeks ago, we talked about Arrow and how it was optimizing memory layout to benefit from SIMD optimizations. We mentioned that not all hardware supported all instruction sets:

Each new version supports more datatypes and/or processes more scalars in parallel (128-bit registers for SSE, 256-bit for AVX2, and surprise... 512-bit for AVX512!). Of course, it takes years for processors with the new instruction sets to arrive at your hardware store and to become generally available in the cloud data-centers. For instance, AWS first communicated in 2017 the availability of AVX512 only on its compute optimized instances [2].

On AWS Lambda, you typically do not have any guarantee on the type of hardware your function is going to be running on. For this reason, we might want to avoid making hard dependencies on given instruction sets. Currently Arrow does not provide the possibility to choose the instruction set at runtime, and by default it builds with SSE4.2. For now, we decided to keep this default setting as it has never generated problems when running on Lambda. This instruction set is more than 10 years old and probably one of the most widespread, so it is an acceptable risk for not.

We regularly check the availability of the different instruction sets within the lambdas we run. You can do so very easily with a simple system call in Python:

import jsonimport subprocess
def lambda_handler(event, context):    cmdline = ["cat", "/proc/cpuinfo"]    print("Run CMD: ", cmdline)    subprocess.check_call(cmdline, shell=False, stderr=subprocess.STDOUT)

The "flags" section contains the supported instruction sets:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd 

We see indeed that AVX2 and AVX512 are not mentioned. Limiting ourselves to SSE4.2 is a bit restrictive but it is sufficient for now. The good news is that runtime SIMD dispatching is a hot topic in the Arrow community, which makes it likely that we will soon be able to perfectly adapt to the instruction sets available during each individual lambda run!


[1] ILLIAC IV first massively parallel computer

[2] AVX512 on EC2 C5 instances