Posts Tagged big data

MongoDB script to check the status of background index builds

Here is a simple script I’ve found to be quite helpful for monitoring the status of background index builds across shards on a system:

var currentOps = db.currentOp();

if(!currentOps.inprog || currentOps.inprog.length < 1) {
    print("No operations in progress");
} else {
    for(o in currentOps.inprog) {
        var op = currentOps.inprog[o];
        if(op.msg && op.msg.match(/bg index build/)) {
            print(op.opid+' - '+op.msg);
        }
    }
}

Here's the output:

$ mongo mycluster:30000/mydb bgIndexBuildStatus.js 
MongoDB shell version: 1.8.1
connecting to: mycluster:30000/mydb
shard0000:343812263 - bg index build 122042652/165365928 73%
shard0001:355224633 - bg index build 111732254/165568168 67%
Share/Save

Tags: , , , ,

Fun with heatmaps

Recently I’ve been playing with a few new technologies. Some are new to me while most are simply new. My base project to use these technologies is a heatmap visualization of churches in Georgia. While heatmaps in themselves aren’t exactly exciting, having the ability to map more than 10,000 data points at a time in real-time is.

So here are the various technologies I used in the creation of my app.

YQL

To get the data for this project I used a simple Python script and YQL to get a list of locations in each Zip code in Georgia with the term “Church” in them. This yielded approximately 10,000 results, including a few from South Carolina.

MongoDB

I stored the data I got from YQL in MongoDB. Specifically, I used MongoLab’s to host the data because they have a generous free storage limit (256Meg) and can be accessed from a RackSpace server without incurring an additional bandwidth charge.

MongoDB is a JavaScript-based NoSQL solution used by sites like Foursquare. It has built-in support for Geo queries and is built to scale.

node.js

For the application layer I decided to try out node.js. Node is also JavaScript-based, based on WebKit, the engine behind Google’s Chrome and Apple’s Safari browsers. Node is event-based which means it has the potential to be lightning fast.

Canvas

The biggest factor in how well a heatmap solution performs is the graphics package that’s used. After searching around I found a pretty decent PHP heatmap solution, but it used GD and was very slow. I also found a JavaScript solution that used the new HTML5 Canvas element, but it choked when given a significant amount of datapoints to render all at once. So I decided to refactor some of the utility functions from the PHP solution and combine them with the Canvas function.

The great thing about the resulting solution is that it has the ability to run on either the client-side or the server-side. And in the end the heatmap application I built uses both. If the number of data points in a tile is less than a preset threshold defined per browser1 the raw data is sent back to the browser which renders the tiles client-side rather than consuming precious server resources.

Web Sockets

The usual way to serve tiles is by serving images to overlay the map. And while the heatmap solution I developed does serve static PNG files from cache, I decided to use the new HTML5 Web Sockets to make things a bit more interesting. What is great about web sockets is that it allows me to pass events between the server and client very easily. Socket.io made it easy to forget where the server ended and where the client began.

ZeroMQ

As applications scale to multiple threads, processes, and eventually across servers and networks, they need to have a way for each component to communicate with other components in an efficient manner. So add a bit of future-proofing to my solution I decided to use ZeroMQ to pass messages between the front-end web server component and the back-end tile generator component. This allows me to tune the performance of the application in both directions, up or down2.

Raphaël

To add some extra pizzazz to the app I decided to add in the ability to display each individual data point along with some additional detailed information. I found that Google’s native Marker system was a bit slow when it came to displaying over 2,000 markers at a time so I decided to give the Raphaël graphics library a try. The results were impressive. Raphael was not only able to draw thousands of data points on the map seamlessly, but was able to do it with smooth animations. Look for gRaphaël to be employed in future renditions of this heatmap solution.

Conclusion

Every now and then I run across a programming challenge that reminds me why I love doing what I do. These technologies and this project have done that for me. Being able to throw together a large, complex project like this in a relatively quick manner reminded me of Fred Brooks’s comment on why programming is fun.

  1. 1000 for WebKit-based browsers, 250 for Mozilla-based browsers, and 0 for IE because IE still sucks. []
  2. Tuning performance down comes in handy when you are on a shared server with limited resources. []

Tags: , , , , , , , , , , ,

MySQL at Facebook with Mark Callaghan

[HT NoSQL]

Tags: , ,