Our Guinea Pig: Apache Spark
Running a large beast like Spark in a cloud function, is that possible?
Written on September 22, 2022
In our blog post about serverless data we introduced how cloud functions can enable you to interactively scale data processing tools. We presented it as an optimal middle ground between operating a long running cluster of machines and giving up the control of your data to a managed data processing service. This new post is about our first experiment with Apache Spark.
We use Lambda, the cloud function offering of AWS, to get started with our experiments. As an early stage startup, we need to exercise a certain level of focus to be able to create our minimum lovable product as quickly as possible. This means that exploring multiple cloud providers simultaneously is out of the picture. If we need to choose a cloud provider, which one should it be? We are also an open source startup, which means that we build our community as we are developing the very first iterations of our project. So we made the assumption that the bigger the potential user base, the easier it will be to interest community member. As the biggest player by far , AWS steps in as a clear winner. But market size is not our unique argument. AWS has been pushing cloud functions particularly hard the past few years, constantly introducing innovations such as Firecracker , compatibility with containerized images, increased function sizes (up to 10GB)... Even though query engines could be deployed on cloud functions that don't have these features, they definitively make it easier and more performant. Our bet is that extending our work to other cloud providers will be a "simple" retro-fitting work as we grow.
The ecosystem of data processing engines is very dense. We decided to start by listing the query engines that could natively query data from the AWS object store, S3. We narrowed down the list to a small dozen of engines. Apache Spark is probably one of the most heavily used. If we take the number of Github stars, a popular KPI for open source projects, Spark ranks first by far with nearly 35k stars. It is also JVM based, which makes it a pretty bad candidate for ad hoc initialization. But we figured: if we can make Spark work, more or less all engines should work!
As some code is worth many discourses, you should definitively check out our integration repository . We will discuss how this tooling works more in details in a later blog post. Let us walk you through the key takeaways for this time:
We use Lambda Container images as it makes it simpler to package Spark and its dependency. This enables us to simply re-use the official Spark image  as base, a self contained way to run Spark
Unfortunately, the official Spark image doesn't bundle the necessary dependency to query AWS S3. After some trial an error we managed to gather the minimal list of jar dependencies to include
AWS Lambda imposes some limitations on the running container. For instance, the file system is read only, except for the /tmp directory. Another related example is process substitution being disabled (impossible to write to /dev/fd/63).
For now we run Spark in a standalone single node mode. Even in this setup we meet some restrictions on the networking capabilities. The example here is the bind address (spark.driver.bindAddress) that needs to be explicitly set to localhost.
Even though we tried to limit the amount of dependencies that we added to the base Spark image, the final Lambda container image ended up being close to 650MB. With such an important size, we would have expected the provisioning and the download of the image to take a few seconds. We were pleasantly surprised to see that lambda is actually able to get your code running within 2 seconds, usually even closer to 1 second.
To raise the stakes a little bit, we decided that we wanted to measure the performances on a real world use case. For that, we use the NYC Taxi dataset stored on AWS S3. We run a GROUP BY query with an aggregation on one month of data stored in a 120MB Parquet file. This led us to our second noteworthy result: the "cold" vs "warm" duration. With cloud functions, we usually qualify as cold starts executions where the resource that runs the function wasn't provisioned and initialized before its invocation.
We cannot say that 30 seconds is fast for querying a medium sized Parquet file. But when you think about the machinery that is actually started behind the scenes by Spark, this duration becomes almost positively surprising. We are convinced that this duration could be cut at least in half by activating the right tuning nobs. For instance, we could consider taking a snapshot of the Spark context when building the image to load it from disk instead of re-computing it entirely.
It is worth noting that the results, especially for the cold start, are pretty consistent. Over 30 runs spread over almost 24 hours, the gap between the smallest and largest runtime is less than 20%. We can argue this variation is below what a user would usually notice.
Finally the warm start duration is around 5 seconds. We ran this second query on a different but similar Parquet file from the first one, to avoid that caching mechanisms of the engine provide results excessively easily. This order of magnitude is withing the expectations we might have from an interactive query engine, meaning that the user can usually wait as much without being frustrated.