First Steps of an Analytics Platform With MongoDB

MongoDB excels at analytics. Every week, we have customers asking for insight on building an analytics and metrics platform. We have seen outstanding performance from good practices and have seen issues with common bad practices.

Customers have different types of analytics engines on our hosted MongoDB platform ranging from usage metrics, business domain specific metrics, to financial platforms. The most generic type of metrics that most clients start tracking are events (e.g. “how many people walked into my stores” or “how many people opened an iPhone application”).

A proper schema is the first step to getting off the ground quickly on a platform that will scale.

The Naive Approach

With MongoDB, the first urge is to begin inserting documents quickly. The first documents typically have the following schema:

{
  store_id: ObjectId(), // Object id of a store
  event: "door open", // will be one of "door opened", "sale made", or "phone calls"
  created_at: new Date("2013-01-29T08:43:00Z")
}

To run a query on the eventstore_id, and created_at, you run the following query:

db.events.find({store_id: ObjectId("aaa"), created_at: {$gte: new Date("2013-01-29T00:00:00Z"), $lte: new Date("2013-01-30T00:00:00Z")}})  

These types of queries are deceptive. When you build the query on your local environment, it is fast. When scaling to 10 GB, they become slow. Typically, to increase speed, compound indexes are added for the following:

db.events.ensureIndex({store_id: 1, created_at: 1})  
db.events.ensureIndex({event: 1, created_at: 1})  
db.events.ensureIndex({store_id: 1, event: 1, created_at: 1} )  

Each of these indexes must entirely fit in RAM.  Any new document will have a seemingly randomly chosen "store_id".  An insert command will have a high probability of inserting the document record to the middle of an index.  To minimize RAM usage, it is best to insert sequentially: termed "writing to the right side of the index".  Any new key is greater than or equal to the previous index key.

An Optimized Document Schema

To optimize your document schema, create a time_bucket attribute that breaks down acceptable date ranges to hour, day, month, week, quarter, and/or year.

{
  store_id: ObjectId(), // Object id of a store
  event: "door open",
  created_at: new Date("2013-01-29T08:43:00Z"),
  time_bucket: [
    "2013-01-29 08-hour",
    "2013-01-29-day",
    "2013-04-week",
    "2013-01-month",
    "2013-01-quarter",
    "2013-year"
  ]
}

With the optimized schema, you would create the following indexes:

db.events.ensureIndex({time_bucket: 1, store_id: 1, event: 1})  
db.events.ensureIndex({time_bucket: 1, event: 1})  

With this document schema, we use a practice called “bucketing”. Instead of building a query on a range, we run the query:

db.events.find({store_id: ObjectId("aaa"), "time_bucket": "2013-01-29-day"})  

The draw back of this optimized document schema is every query must include atime_bucket attribute for every non-_id query. However, when querying most reporting systems, a date specification is required

Better Use of RAM

Using the optimized time_bucket, new documents are added to the right side of the index. Any inserted document will have a greater time_bucket value than the previous documents. By adding to the right side of the index and using time_bucket to query, MongoDB will swap to disk any rarely older documents. MongoDB runs with minimal RAM usage. Your “hot data” size will be the most recently accessed (typically 1 - 3 months with most analytics applications), and the older data will settle nicely to disk.

Neither queries nor inserts will access the middle of the index, and older index chunks can swap to disk.

Bonus Points: Using the Aggregation Framework

Using aggregation to find the number of specific events per day, we can run:

db.events.aggregate( {$match: {time_bucket: "2013-01-month"}}, {$unwind: "$time_bucket"}, {$project: {time_bucket_event: {$concat: ["$time_bucket", "/", "$event"]}}}, {$group: {_id: "$time_bucket_event", event_count: {"$sum": 1}}} )  

Disclaimer: I used the $concat operator that will be available in MongoDB 2.4. To use a hosted MongoDB 2.4 Development Branch, take a look at our experimental databases.

When using the aggregation framework, cache a final run for any particular day to a summated collection. Use this summated collection to present data via any application or reporting server.

Follow Up Materials

The premier presentation for MongoDB Analytics is by John Nunemaker of Github. He regularly makes the rounds at MongoDB conferences covering how to do MongoDB analytics properly. 10gen makes these presentations available. MongoDB for Analytics (at Github)

When building a scaling analytics system, look for logical “buckets” for data. Avoid using “$in”, “$gte”, “$lte” operators when possible. MongoDB is fun because it rewards creativity for good schema design.