Every story starts somewhere. Cloudfuse's started at Pubstack.
Written on April 2, 2020
Pubstack is a young AdTech startup that helps publishers understand their monetization stack. Online advertisement is a highly volumetric business, which led Pubstack to manage quickly growing data streams at a very early stage. Barely 6 months after its creation, Pubstack was already handling a billion of events each day.
At a high level, what Pubstack does is the same as many businesses. They ingest tons of data in real time, transform it a bit, store it and then extract valuable information from it, when needed. One of the features that revealed to be game changing for Pubstack was the latency with which it allowed its users to query their data. It is a very different experience to explore your data interactively, with metrics appearing within 1 or 2 seconds, and having to think about your next query because you know you will have to wait for 15 seconds, 30 seconds, 1 minute…
The life of an early entrepreneur is a race against time. He needs to prove that its product delivers value quickly and with very limited resources. This means that for a big data startup to be able to bootstrap itself, it needs to carefully choose its tech stack to neither burn its cash nor lose all its precious founders time on technical details. With these constraints, the cloud appears as a natural choice. But it barely narrows down the possible technological choices, even when restricting yourself to one specific provider. There are so many open source solutions, and so many resellers that package them and provide them on the major cloud providers that you could spend your life benchmarking them. The hardest thing in this exploration is that very often, these solutions maintain a veil of vagueness around their limitations which you can only raise by trying them out. Sadly, the magic of big data is that nothing behaves the same at scale on real world data than on your most elaborate benchmark. This means that at some point, you have to make a crucial choice about your stack, knowing that it will likely be very far from optimal. But that’s life!
At Pubstack we made good enough choices, because they allowed us to start a business that is still quickly growing today. It was probably a good mix of luck and hard work; I don’t know in which proportion! The first one was to focus on the largest cloud providers, AWS, GCP and Azure, capitalizing mostly on AWS. Their leading position on the market allows them to provide a very rich ecosystem of managed services that really do scale. On top of that, those services are always evolving and improving, which brings performance boosts with very little work. The second one was to forbid ourselves to use self-managed technologies. Setting up and operating clusters can be time consuming. We could not afford to build up a large operational debt that would blow up on us when the data volumes would scale. We knew that in our business, it would have happened too soon to recover from it!
It is in this setting that the main ideas for Pika and Buzz were born. In all the technologies that we considered, we always had the feeling that something wasn’t quite right. The flexibility of the cloud was underused. In particular there was one feature available at all cloud providers that drew our attention: object storage (S3, GCS…). It is so beautifully simple and versatile. It scales infinitely because everybody is using it for tons of different use cases, which implies that cloud providers have provisioned tons of it. But everywhere we looked, every low latency data querying solution stated the same: “Object storage, of course! Let’s put your backup and cold storage there. It’s great but too slow!”. We ended up using Elasticsearch to store our metrics, with high performance SSDs. With its great distribution of computations, its great caching, and its data locality, we easily reached the query latency we were looking for. We unlocked our business case and created the very interactive platform our clients needed. We left the object storage for use cases where latency is less of a concern. But having a cluster that is always on because its scales slowly is very wasteful. And there is something that seems wrong in the statement saying that “cloud storage is slow”. It does have a bit of latency and the bandwidth with typical VMs is limited. But isn’t it the core idea of most modern data technologies to use commodity hardware and compensate by parallelizing? What if we could read from it with something that scales instantly to thousands of VMs? At AWS there is a good candidate for that office, it’s Lambda.
After conducting our first benchmarks, we confirmed that the duo S3/Lambda was promising. We also discovered that there was quite a lot of research on that topic, with both a team from MIT  and a team from ETHZ  publishing very promising papers on the question (links below).
But pushing a new technology like this is outside the scope of a young and highly specialized company like Pubstack. An early stage startup needs focus to grow quickly and conquer its market! This is why we are here today, trying to see where we can push this idea. We really believe there is a huge opportunity, and we are going to try not to let it slip away!