Modern applications have to deal with new constraints:
- memory is still incredibly limited compared to the size of data we’re dealing with
- the data sets are actually not set anymore, but streams, with no start and no end
- the inputs are realtime, but the outputs also need to be realtime
Sampling and Windowing
Of course, the immediate and common answer to these problems would be windowing (only storing data between 2 events) and sampling (only taking random data). Windowing brings latency: you only get the data from the ‘previous’ window, while sampling may lead to loss of data. Sometimes, you want to avoid these two pitfals and that’s when you have to re-invent the maths and algorithms you’re dealing with.
Warning: if you’re reading this from a feed reader, you may want to hop on here, because you’re missing on great illustrations and beautiful math formulaes :)
What’s the average?
Below is a series of points. Let’s assume that this series is a coming from a datastream. We want to compute the average. We only ever show 20 items, but we want to compute the averages over the whole data set (even the points which are not visible anymore…)
The first approach, which we all learnt in school is the following: for each new point added to the series, sum all points and divide by the total number of points:
The problem with this is that if the frequency at which new items are added is higher than time it takes to count and sum all items, we will be unable to compute the average as the dataset becomes bigger and bigger. We are also quickly unable to store all of these items.
That first ambush can be quickly avoided by keeping memory of 2 extra variables: a running count of all items and a running sum of all the items.
For each new item, we increment the running count by 1 and the running sum by its value. This way, the average is trivial to calculate for each new point:
And we can easily add it to our graph:
This looks pretty, but if we wait a little bit, an obvious problem quickly arises: the average will converge toward 0.5 (which is indeed the expected average for our dataset and won’t ever evolve from there). If after years of being around 0.5 our data-stream suddenly starts to increase, it will take a long time for the average to reflect the change.
The problem of running counts actually from our equation: the earliest dots have the same weight as the latest ones. This is probably a situation we want to avoid, because when considering a stream, the most recent data probably has more impact than the oldest data.
Another way to view the rolling average is that each new item will affect the previous average. We can then approximate the average as a weighted average of the previous average with the latest item. These are called exponential moving averages:
The value of is what determines the weight of the historical data. The greater it is, the longer any given value will have an impact on the average.
Here’s now our running recursive weighted average with 2 weights: 10 and 100. You can see how they both follow the trends differently. The will not follow short-term changes, but it will also take a long time to recover from earlier biases.
At Superfeedr, we use recursive weighted averages a lot: they’re the basis of our popularity ranking for example.