Modeling and managing time series
Historizing events is a very common problem. But it proves difficult to get a good understanding of the ecosystem that addresses it.
Written on September 3, 2020
According to the current Wikipedia definition : “A time series database (TSDB) is a software system that is optimized for storing and serving time series through associated pairs of time(s) and value(s).” This is a generic but efficient definition, and a lot of use cases fall under it. For instance, systems specialized in log management comply with this definition. Their only specificity is that what they consider as “values” is highly unstructured compared to other systems that, for instance, focus on storing series of simple scalar metrics.
Something we find striking at Cloudfuse is that databases that aim at collecting logs, and in particular Elasticsearch, rarely advertise themselves as time series databases. Let’s try to understand the reason for that.
Instead of comparing databases by their high-level capabilities, which is often deceiving because you find yourself muddled by the wonderful promises of the database marketing teams, we are going to discuss their storage models to give ourselves a better idea of their structural limits. This will help us understand the weaknesses we will encounter when we will scale the systems into production.
Types of series
Let’s try to formalize all the possible forms of time series.
At the core of time series, there is the notion of datapoint that represents an event in time. The only piece of information that is required to define a datapoint is a timestamp. Usually though, you will want to discriminate different kinds of events, so you have to associate your datapoints something like a name.
With a single timestamp, we can only represent punctual events. You have many ways to represent events that are lasting in time:
splitting them into two datapoints, one for the start and one for the end
specifying the event duration or end timestamp in the datapoint
Even though the actual information stored ends up being the same, choosing the representation is usually not trivial because you might make the global event harder to recover.
Then you have the rest of the information conveyed by the datapoint. It can carry pretty much any kind of data. We could try to categorize all this information in many ways. Some systems, like Elasticsearch use data types (Boolean, string, float…). Others, like InfluxDB or Prometheus, consider a datapoint to be a metric name, a set of tags (key value pairs) and a value. This categorization and the decision of how to organize the storage according to these categories is what we call time series modelling. Sadly, there is no silver bullet, every decision will have upsides and downsides, in particular in terms of storage and query performance. This is why the first step in order to model the data correctly is to ask ourselves:
How much can I spend on my system and how can this spending evolve over time?
How will the system be queried and what is the expected latency?
The same way there is an infinity of types of data you might want to store in a datapoint, there are many different types of queries you might want to submit to your time series system. We can conceptualize the set of queries as a spectrum ranging from the following extremes:
Search. You are looking the ability to retrieve one exact datapoints you stored. As your data grows larger and larger, it becomes impossible to retrieve all the data you stored at once so you will need your system to have the capacity to search for the datapoints you are interested in according to a given set of filtering criteria. An example could be the retrieval of the log corresponding to creation of the account for a given user id.
Aggregation: You are looking for the ability to compute KPIs that summarize all your datapoints. You are not interested in extracting concrete datapoints but rather a valuable information that is the sum of a fraction of the information of each of your datapoints. An example could be the retrieval of the total amount of sales realized by your company.
And of course, between these two extremes, you have any possible combination of the two. Typically, you are going to want to extract partially aggregated information on a subset of datapoints. An example would be a query for the top sales by user in December last year. Slicing and dicing the data can be seen as the action of walking on the spectrum interactively!
Modeling for each use cases
For very specific and well-defined use cases, you can build a very simple system that serves you as a datastore. For instance, if you are tracking events, but you will only ever be interested by the total number of events that occurred since you started your system, a simple counter will be enough.
Sadly, your use case will often be more complex and more importantly, the usage of the data is very likely to evolve. This is why we are going to focus here on more general-purpose tools.
We can identify two common ways of storing the data that is commonly used for time series.
Examples: KairosDB, Prometheus, Influxdb
These are the systems that officially advertise themselves as time series databases. They often share a similar data model. They consider a datapoint to be a metric name, a set of tags (key value pairs) and one or multiple values. Values are mostly numerical (float/int). The idea behind this model is to group the values that have the same metric name and tags together and physically store the values and their timestamp contiguously in “rows”. The hypothesis here is that time series correspond to measurements from a fixed number of sources, and one source will emit many datapoints through time. This physical storage has two benefits:
Compact because you store the metadata (metric name + tags) only once for many datapoints
High throughput retrieval if you want many datapoint from the same source because they are stored contiguously
This model is often highly efficient except if the cardinality of the metadata becomes high. In that case you might end up with only few datapoints per “row” and the data ends up scattered on the disk and expensive to retrieve.
Examples: Elasticsearch, Presto/Impala
Here the model is to store every different field as columns of given types. The timestamp is just one column, so are the metric name, the tags and the values. This means that even though you have only two different metrics in your database, you will need to store the metric name for every single datapoint. Luckily, this can be done in a very compact fashion because each field stores its values together in columns. So if you store only two metrics, you can encode each value in with only one bit, the original values can be retrieved with a dictionary, and the bit array can be further compressed with various techniques such as run length encoding .
This columnar organization makes the data less sensitive to the explosion of cardinality, because columns remain independent. If you have a column that represents a tag with many complicated values, this will barely impact queries as long as they don’t use this column. On top of that, you can speed up queries on that complicated column by adding the appropriate indexes (e.g. reverse indices for full text search).
This flexibility comes with a cost compared with metric databases. If most of your queries target only one source (one given metric name + unique combination of tags), you will still need to scan all the datapoints, even if you will only need to read the columns that are used by your query. This issue can be partially mitigated by partitioning the data. Data should always be partitioned by time, otherwise queries, no matter their complexity, would only become slower and slower through time as the dataset grows. You can add one or two more partitioning(s) if you know that some of the columns have limited cardinalities and you know that they will always be used to filter the queries. One example could be the “customer” dimension in a SaaS analytics platform, as each customer would only query its own data.
On the other hand, if you need to slice and dice your data across many dimensions, columnar databases will provide pretty consistent performances because you would keep scanning a similar amount of data. This is very valuable for use cases where you want your users to explore the data more freely. If you manage to provision sufficient compute and bandwidth to the data storage to read the targeted partitions with the expected latency, you secure a very flexible way to expose your time series data.
Even though the use of quickly scaling cloud functions is independent from the modeling, we decided to focus mainly on the columnar data model. We just love the slice and dice flexibility it provides. Our challenge is to make the extra cost in terms of resources acceptable by allocating instantly the right amount of compute for each query.