How we designed and open sourced a framework to test our hypotheses about cloud scalability.
Written on October 27, 2022
As discussed in our earlier blog post, our mission is to enable users to run open source query engines in an serverless fashion but in their own cloud accounts. To achieve this, we use the most advanced provisioning capabilities offered by cloud providers. But for the solution to compete with managed data services, we need to ensure that our setup will be able to offer the low latency that is expected by many interactive data analytics tasks. This goes through the validation of many hypothesis about the cloud resources, both in terms of raw performance and consistency. For instance:
Test that the hardware provisioned is consistent and performant enough to start query engines within seconds
Verify that we can provision hundreds of servers within seconds, at any time of the day
Ensure that the network performance to the cloud storage is consistent and scalable
This is why we are working on an open source framework  that automates the deployment of multiple cloud based benchmarks and automates their execution throughout the day.
We chose Terraform and Terragrunt to manage our infrastructure.
CDK, the new toolkit created by AWS for expressing infrastructure stacks, is very performant. But we need to conduct our tests across multiple cloud providers, and this is really where Terraform shines.
Terraform by itself does not manage very well stacks that needs to be broken down into pieces with very different lifecycles. Terragrunt is a thin wrapper that helps maintaining the code DRY in this case.
We have great hopes in the CDK adaptation for Terraform (CDKTF), but we have judged that it still lacks a bit of maturity and hope to migrate to it soon.
Just as our code is open, we want to share the best possible representation of our experimental results. We believe that this is best achieved through public interactive dashboards. These are two examples of dashboards we are maintaining:
Standalone query engine durations on AWS Lambda - execution durations
We explored many dashboarding solution to with the following criteria:
as easy as possible to operate
our experiments generate relatively few datapoints, so the solution should be fast and economic
it should be safe to expose publicly
We tried multiple solutions, such as Grafana with AWS managed backends. The only solution that met our 3 criteria was BigQuery + Data Studio (now Looker).
everything is serverless
the dashboards are relatively interactive and cost really scale with volumetry, so in our case they are negligible
Data Studio has the same sharing mechanism as the rest of the Google suite
The Docker shell
We want to make our benchmarks as easy to reproducible as possible. Terraform, Terragrunt, the AWS CLI, the GCP CLI... dependencies quickly add up. For this reason, we provide a Docker image with all this environment pre-configured so you don't have to bother with all these nitty gritty details! This will come handy when we'll want to shift the benchmarks to be executed on schedule in the cloud. We'll just need to re-use that same Docker image and the cloud resource will have a faithful copy of our local environments. Same applies to Git ops, just reuse the image and you can pilot the benchmark infrastructure from CI runners.