Recently, some folks at my company Agero starting discussing strategies for pre-aggregating data that could be used for statistical compuation based on business metrics. Specifically, the question came up as to how we could maintain event counts.
Counting has a long history in the field of probability and statistics and is in many ways the most foundational form of data required for data science teams. A histogram is nothing more than bucketed counts and a probabilty distribution is just the continous version of the same thing. Maintaining counts is often a requirement of useful statistical computation. Thus, there is value in considering how one might count lots of things.
Having drank the serverless cool-aid a while ago, I’ve been using lambda and dynamodb heavily in recent years. I present the following as a possible solution to maintain counts in this type of environment:
First a rough architecture:
The infrastructure is pretty simple. We have a table that recieves our event data. This could be purchase data for a customer, ad impressions for a visitor or any other monotonic event source.
When an event arrives in our event table, dynamodb will automatically invoke our lambda function with the operation type “INSERT” and the payload representing the data that was placed into the event table. Based on this event data we construct a “PutItem” operation to upsert into our count table. For our use case we wanted to know, “how many jobs have been placed for a specific customer in the last 24 hours”.
For this demo I decided that minute level precision was close enough, but one could extend this solution to any level of precision. Really it comes down to how much extra one is willing to pay for additional decimal places of precision. There are trade-offs with any architecture and this architecture trades off a small bit of precision for a huge speed improvement on large datasets. Similarly there are tradeoffs in precision and cost in the real world. For example, one might use a tape measure for carpentry but a micrometer for machining.
Given we require minute level precision, we bucket all counts by both hour and minute. To illustrate why, consider the following:
Given this diagram, the problem we are solving for with different levels of granularity becomes clear. When we move to the next hour in our series, we don’t have a full hour of data. Thus, we need to look back at the historical minutes and fill them in appropriately. This design means that our code will have to fetch between [24,24 + 59] dynamodb keys for every count we wish to calculate. The upper bound of 83 keys helps make our system performance more stable than if we had to make calculations based on 0..n number of events. 83 is also a good number as it can be queried without pagination from dynamodb.
The requirement to query up to 83 keys means that if we have very few events per 24 hour period the advantage of pre-aggregating goes away and we likely wouldn’t need to pre-aggregate at all. Thus, this type of counting is best used when we have a relatively high volume of event data coming in.
This article is not about counting and aggregating generally. It’s intended to help folks aggregate and count things on dynamodb specifically. In this section I’ll dive into some of the techy details. I’m using dynamodb here but this same system could be used on any key value datastore that supports similar operations (Cassandra?).
The first thing we need to do is set up the infrastructure as illustrated in the Architecture diagram above. We use cloudformation to do that. It should be clear that this is creating two tables and one lambda function. The lambda function holds the code for translating events into counts.
Now that you have code and infrastructure in place you can generate some event data and see the result.
Lastly, you can query the counts database for a specific time range, aggregate the counts and do as you wish with that information.
That’s all you really need for a basic dynamodb based counting system. There are many other ways you might construct a counting system and this article aims to offer but one approach. Questions and comments are always welcome.