Processing hardware on AWS Lambda

So-called serverless is a very interesting high-level abstraction. But what kind of hardware is actually allocated when we run cloud functions?

Written on June 24, 2020

In recent posts, we talked a lot about memory and allocation strategies. Time has come to take a look at the particularities of the processing unit in cloud functions, and particularly in AWS Lambda.

Lambda's CPU according to the documentation

When you setup a cloud function, you typically have quite a few configurations to tweak. You can set environment variables, timeout, IAM roles, VPC networking... But when it comes to the actual performance of the function, you have only one parameter: memory. It happens that AWS has decided to simplify things a little bit by making all the other resources (CPU, network...) proportional to this one [1]. So according to the documentation, on a 128MB function you will have a tiny fraction of a CPU, 1792MB guaranties you one full vCPU and with the maximum configuration of 3008MB, you should get almost two vCPUs.

Looking at the hardware

Let's take a look at the CPU information exposed by the Lambda runtime:

cat /proc/cpuinfo

processor : 0vendor_id : GenuineIntelcpu family : 6model : 62model name : Intel(R) Xeon(R) Processor @ 2.50GHzstepping : 4microcode : 0x1cpu MHz : 2500.010cache size : 33792 KBphysical id : 0siblings : 2core id : 0cpu cores : 2apicid : 0initial apicid : 0fpu : yesfpu_exception : yescpuid level : 13wp : yesflags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms smap xsaveopt arat md_clear arch_capabilitiesbugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgsbogomips : 5000.02clflush size : 64cache_alignment : 64address sizes : 46 bits physical, 48 bits virtualpower management:
processor : 1vendor_id : GenuineIntelcpu family : 6model : 62model name : Intel(R) Xeon(R) Processor @ 2.50GHzstepping : 4microcode : 0x1cpu MHz : 2500.010cache size : 33792 KBphysical id : 0siblings : 2core id : 1cpu cores : 2apicid : 1initial apicid : 1fpu : yesfpu_exception : yescpuid level : 13wp : yesflags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms smap xsaveopt arat md_clear arch_capabilitiesbugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgsbogomips : 5000.02clflush size : 64cache_alignment : 64address sizes : 46 bits physical, 48 bits virtualpower management:

This information should be taken with care because Firecracker can be configured to expose customized information about the CPU to the guest [2]. It is also interesting to note that cpuinfo displays the same information whether you run a 128MB function or a 3008MB function.

We can see nevertheless that the advertised topology has two physical cores attached (processor 0 has core_id 0 and processor 1 has core_id 1). The actual throttling for smaller configurations is realized at the encapsulating cgroup level [2]. This is interesting, because it means that even with a configuration that entitles you to one vCPU or less (memory smaller than 1792MB), a threaded application can run on two different cores at a time and the throttling in this case is time based.

To test whether the CPU topology complied with what was announced by cpuinfo, we built a small benchmark using core affinity function pthread_setaffinity_np and some toy computationally intensive operations (e.g. std::sin):

std::vector<std::thread> threads(2); // allocate data auto values = new int64_t[ARRAY_SIZE]; for (int i = 0; i < ARRAY_SIZE; i++) { values[i] = i; } for (unsigned thread_nb = 0; thread_nb < 2; ++thread_nb) { // setup thread threads[thread_nb] = std::thread([values, thread_nb] { auto agg_start_time = std::chrono::high_resolution_clock::now(); int64_t sum = 0; for (int i = 0; i < ARRAY_SIZE; i++) { sum += std::sin(values[i]); } auto agg_end_time = std::chrono::high_resolution_clock::now();
// set thread affinity cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(thread_nb * CPU_FOR_SECOND_THREAD, &cpuset); int rc = pthread_setaffinity_np(threads[thread_nb].native_handle(), sizeof(cpu_set_t), &cpuset); if (rc != 0) { std::cerr << "Error calling pthread_setaffinity_np: " << rc << "\n"; } }

Setting CPU_FOR_SECOND_THREAD to 0 forces the two threads to run on the same CPU while setting it to 1 makes the second thread run on processor 1. For configurations with RAM less than 1792MB, the CPU association does not have any impact, the data is processed at approximately 100MB/s in each thread. With larger memory configurations, the execution time remains the same as that of 1792MB if the two threads are associated to the same core. We only see the expected speed up when the threads run on different CPUs. The 3008MB configuration on distinct CPUs runs the aggregation at approximately 175MB/s per thread, which hints that it might not have two full vCPUs.

Wrap up

It is interesting to see how AWS structures the VMs to provide the advertised performances. For CPU allocation, the documentation is quite precise and our tests gave coherent results with the specification. It is worthwhile pointing out that for other resources, such as network bandwidth, it can be much harder to get precise specifications. Some papers such as [3] provide interesting benchmarks, and we are also building our owns at cloudfuse. Feel free to contact us if you are interested!