Time zones, offsets, and fast time series aggregations
November 2023 Home
My research group builds sensors to measure CO2 concentrations and air pollution around Munich. The resulting measurements are typical time series data. If we aggregate a sensor's measurements hourly, we get something like this:
This chart looks simple at first glance. We group the measurements between 07:00 and 08:00 and take e.g. the average. Then we group the measurements between 08:00 and 09:00 etc.
Now, imagine that we have two sensors and we want the same chart calculated over both their measurements combined. What if these two sensors are in different time zones? What if we ourselves are in yet another one; How does an aggregation work in that case? Which measurements do we group together?
You're probably thinking: Felix, all your sensors are in Munich ... and you are, too. What's your problem? I should probably read up on this YAGNI thing, or KISS, or whatever ... Let's dive in!
Definition: time zones and offsets
In the following, we're going to talk a lot about time zones and offsets. These two are not the same! Time zones specify locations (e.g. “Europe/Madrid”); Offsets specify divergences from UTC (e.g. ”UTC+02”). The time zone of a location usually stays the same while a location's offset can change, e.g. due to daylight savings time.
In general, we have two options to aggregate time series data when different time zones are involved: We can aggregate in the user's time zone or in the data's time zone.
Aggregating in the user's time zone
Let's say our first sensor is in Madrid and the second one is in Athens. Imagine it's summer, then Madrid has an offset of UTC+02 and Athens has an offset of UTC+03. We're looking at the data from London, which has an offset of UTC+01 in the summer.
To aggregate in the user's time zone, we group the data by what our local clock in London shows at the time of measurement. Purple crosses represent measurements; The aggregation intervals are shown in green.
To calculate the value for 10:00, we aggregate measurements from when our local clock in London shows a time between 10:00 and 11:00. Note that this is not what the clocks in Madrid and Athens show!
Aggregating in the data's time zone
To aggregate in the data's time zone, we group measurements from when the clocks in the different sensor locations show the same time.
Contrary to before, our own time zone is not relevant here.
To calculate the value for 10:00, we aggregate measurements from when the clocks in the different sensor locations show a time between 10:00 and 11:00. As you can see from the green boxes, these measurements were not recorded at the same point in time!
It's important to clarify if data is aggregated in the user's time zone or the data's time zone. Even though both of the above charts show an aggregation for 10:00, they yield completely different results.
How to choose
For real-time applications, aggregating in the user's time zone is usually the better choice. I'm building a system to manage sensors remotely and in real-time, called Tenta (open source). We want to see if sensors are healthy at the moment, or how many measurements they collected in the last 24 hours or over the last month. The user's time zone is more important than the data's time zone in this case.
For many analytical questions, it's better to aggregate in the data's time zone. Patterns that recur every 24 hours — like the outside temperature rising in the morning and falling in the evening — are called diurnal cycles. These cycles depend on the position of the sun, and start and end at different points in time in different locations. We have to aggregate in the data's time zone to observe them when multiple time zones are involved.
Aggregations are expensive operations. That's a problem for real-time applications (like Tenta). For them, aggregating naively is too slow.
Time series data has some interesting properties, namely that data is predominantly appended and only rarely updated or deleted. If we run the same aggregation repeatedly, we thus mostly get the same result, especially for older data.
We can take advantage of this fact by precomputing aggregations in regular intervals. This way, we already have the result when a request comes in. Some values might be outdated between precomputation runs, but this is often a worthwhile trade-off.
Precomputations work great to optimize aggregations in the data's time zone. However, an aggregation in the user's time zone is dependent on the user's time zone (because we pass the time zone to the query). To serve correct results around the world, we would need to continously precompute the aggregation in every possible time zone.
The Internet Assigned Numbers Authority (IANA) defines around 500 time zones. Many time zones are the same or similar, but even with some trickery that's too expensive.
So, we can't precompute an aggregation in the user's time zone. However, we can precompute an aggregation in UTC and then display the timestamps shifted by the user's current offset! Note the difference to before: Instead of the user's time zone, we use their current offset.
Let's see an example with one user in London and another one in Kathmandu. We use the same aggregation result around the world and only adapt the axis labels:
The reason we can't apply the same trick for time zones, but have the pass them to the query, is that a time zone's offset can change. At the time the offset changes, the length of the aggregation interval is longer or shorter. It's not possible to adapt the result of an aggregation to a different interval length retroactively.
Trade-offs to using the current offset over the time zone
The user's current offset is only an approximation of their time zone. The axis tick labels are adjusted by the user's current offset, not by the user's offset when the measurement was made. If the time zone's offset changes inside the time span the aggregation covers, the labels that precede this change won't reflect that.
Furthermore, if we use the user's current offset instead of their time zone, aggregation intervals with logical start and end times — like days, weeks, or months — don't make much sense anymore. Their logical start and end times (e.g. midnight) may be different points in time in different time zones. It's e.g. not possible to map an aggregated day in UTC to a day in UTC+01. We can use durations (e.g. 24 hours) instead.
I had a lot of fun with this problem! For Tenta, we aggregate most things in the user's time zone and optimize with precomputations as described. If you're working with sensors, don't hesitate to check out Tenta on GitHub!
If you want to dive further into the subject of time, I really liked this article about Unix time and its flaws. I also liked this HN discussion and this one about tracking time on the moon.
Thanks to Paul Moosbrugger, Patrick Aigner, Simon Böhm, Moritz Makowski, and Julia Godart for reading drafts of this.
- It's rare, but a location's time zone can change: China once had five time zones with five different offsets but switched to its current single time zone with an offset of UTC+08 in 1949.
- If you're surprised by Nepal's UTC+05:45 offset, note that offsets do not have to be full hours. Up until 1920, Nepal even used an offset of UTC+05:41:16.