Posts Tagged hadoop

Simple Scala Map/Reduce Job

I was recently tasked with writing a Hadoop map/reduce job. This job had the requirement of taking a list of regular expressions and scouring hundreds of gigs worth of log files for matches. Since I’ve been leaning more and more towards Scala I wanted to use it for my job but I also wanted to use Maven for my job’s package management to make the job easy to setup and extend. And finally, I wanted to have unit tests for my mapper and reducer and an overall job unit test. The result is this project I posted to GitHub as a template for future projects. I hope it proves as helpful for others as I’m sure it’ll be for me.

Share/Save

Tags: , , , ,

Jeopardy and Hadoop

[HT Alex Popescu]

Tags: , , , ,

Simple HBase query bridge

I’ve recently released a simple json-rpc query bridge (using our own simple json-rpc framework) for HBase at http://code.google.com/p/hbasebridge/

You can use this bridge to query HBase for either the current record or the last few versions of a record.

To see the methods

http://localhost:8080/hbasebridge/rpc?debug=true

Which returns a list of usable RPC methods:

{
  "jsonrpc": "2.0",
  "result": {"method": [
    {
      "class": "com.werxltd.hbasebridge.HBaseInfo",
      "name": "listtables",
      "params": [],
      "returns": "org.json.JSONObject",
      "static": false
    },
    {
      "class": "com.werxltd.hbasebridge.HadoopInfo",
      "name": "clusterstatus",
      "params": [],
      "returns": "org.json.JSONObject",
      "static": false
    },
    {
      "class": "com.werxltd.hbasebridge.HadoopInfo",
      "name": "jobstatus",
      "params": [],
      "returns": "org.json.JSONObject",
      "static": false
    },
    {
      "class": "com.werxltd.jsonrpc.RPC",
      "name": "listrpcmethods",
      "params": [],
      "returns": "org.json.JSONObject",
      "static": false
    },
    {
      "class": "com.werxltd.hbasebridge.TableLookup",
      "name": "lookup",
      "params": ["org.json.JSONObject"],
      "returns": "org.json.JSONObject",
      "static": false
    }
  ]}
}

To list tables:

http://localhost:8080/hbasebridge/rpc?debug=true&method=listtables

Which returns:

{
  "jsonrpc": "2.0",
  "result": {"tables": [
    "mytable"
  ]}
}

To get the current status of the cluster:

http://localhost:8080/hbasebridge/rpc?debug=true&method=clusterstatus

Which returns:

{
  "jsonrpc": "2.0",
  "result": {
    "activetrackernames": [
      "trackernode1:localhost/127.0.0.1:33455",
      "trackernode2:localhost/127.0.0.1:54616",
    ],
    "blacklistedtrackernames": [],
    "blacklistedtrackers": 0,
    "jobqueues": {"queues": [{
      "jobs": [
        {
          "cleanuptasks": [{"state": ""}],
          "complete": false,
          "filename": "hdfs://hadoophdfsnode:9000/data/hadoop/mapred/system/job_201003191557_0442/job.xml",
          "jobpriority": "normal",
          "mapprogress": 1,
          "name": "My mapreduce job",
          "reduceprogress": 0.9819000363349915,
          "runstate": "running",
          "schedulinginfo": "NA",
          "setupprogress": 1,
          "starttime": 1269024863960,
          "username": "hadoop-admin"
        }
      ],
      "name": "default"
    }]},
    "jobtrackerstate": "running",
    "maptasks": 1,
    "maxmaptasks": 116,
    "maxmemory": 2079719424,
    "maxreducetasks": 58,
    "reducetasks": 16,
    "tasktrackers": 34,
    "ttyexpiryinterval": 600000,
    "usedmemory": 969170944
  }
}

Key/Value Query:

http://localhost:8080/hbase_tape/rpc?debug=true&data={"method":"lookup","params":{"table":"tablename","keys":["mykey"]}}

Results:

{
  "jsonrpc": "2.0",
  "result": {"rows": [{"mykey": {
    "family:col": "myvalue"
  }}]}
}

Key/Value query with versions:

http://localhost:8080/hbase_tape/rpc?debug=true&data={"method":"lookup","params":{"table":"tablename","keys":["mykey"],versions:2}}

Results:

{
  "jsonrpc": "2.0",
  "result": {"rows": [{"mykey": {
    "family:col": [{
      "value": "myval",
      "version": 123456789
    }],
    "family:col": [{
      "value": "myoldval",
      "version": 123456788
    }]
  }}]}
}

The code should also provide a handy reference for anyone who wants to learn how to query HBase and scrape Result objects for values without knowing family or column names in advance.

Tags: , , , , , , , ,

Using Python with Hadoop

First, some review

Hadoop is a very powerful MapReduce framework based on a white paper released by Google documenting how they have successfully tackled the issue of processing large amounts of data (on the scale of petabytes in many cases) using their proprietary distributed filesystem, GFS. Hadoop is the open source version of this distributed file system1, heavily supported by companies like Yahoo, Google, Amazon, Adobe, Facebook, Hulu, IBM, RackSpace, etc. and and has a growing number of related projects hosted by the Apache Foundation.

Why we need to learn “yet another language”

Yet, even with all of the buzz and hoopla many people find it difficult to setup and start writing applications capable of levreging the awesome power of an Hadoop cluster, many find the learning curve of Java and the Hadoop APIs very steep.

Fortunately one of the features available in Hadoop is HadoopStreaming which allows programmers to specify any program (or script) as a mapper and/or reducer. Consequently, one of the most popular scripting languages to use alongside Hadoop is Python2.

One of the reasons Python is well suited to this type of work is it’s ability to be functional provided you are careful how you write it. This makes chopping well-written Python map/reduce scripts up into distributable units much easier.

There’s a framework for that

While it is possible to write plain Python scripts, the folks at last.fm have helped create an excellent Python framework for Hadoop called Dumbo to help streamline the process of writing MapReduce jobs in Python. Dumbo seems to be a fairly simple framework with plenty of examples you can adapt to your particular needs.

There’s a framework for that too

Hadoop has many sub-projects, and one that is fairly popular is called HBase which allows a more structured, database-like, approach to storing and retrieving data. An excellent Python framework for quickly parsing data into HBase tables is Zohmg. This framework allows programmers to define tables in a YAML configuration file and corresponding mappers as simple Python scripts.

Bringing it back home

One of the biggest drawbacks to using HadoopStreaming is that it is inherently less optimal than writing MapReduce jobs in Java since the target script or application has to be initialized, the data then has to be serialized, sent to the target application/script, processed, and then sent back (if there are any reducers). All this context switching adds overhead that wouldn’t exist if the MapReduce job were kept in the JVM where Hadoop runs.

Jython is a viable answer for converting existing Python applications into Java bytecode to prevent incurring as much of a performance penalty. This utility can come in handy if you decide that your “quick and dirty” Python script needs to be moved into a production environment.

  1. Technically Hadoop is an umbrella name whereas HDFS is the technical name for the GFS alternative. []
  2. If you aren’t familiar with Python and want to learn, here is an excellent site for diving into the language and here is an excellent video series walking you through the basics. []

Tags: , , , , , , ,

Getting started with Hadoop and MapReduce

Recently I’ve been studying several technologies that appear to form the core of cloud computing. In short, these are the technologies behind such technological marvels as Amazon, Google, Facebook, Yahoo, NetFlix, Pixar, etc.1

Since each of these technologies by themselves is worthy of a new book, and since even those familiar with the common implementation languages of these technologies (like Java and Python), I decided to put together all the resources I’ve found on these technologies in hopes that they will help someone else get started in this fascinating world of distributed or “cloud computing”.

Introduction to cloud computing

One might wonder why they should take the time to learn these technologies and concepts. A fair question to ask considering the amount of time and energy that will potentially be required in order to put any of this knowledge to any functional use. With that in mind I found the following videos particularly helpful in answering the question “why should I care?”:

Hadoop

Hadoop2 is essentially a compilation of a number of different projects  that make distributed computing a lot less painful. The best source of beginner’s information on Hadoop I’ve found has come from these Google lectures as well as from Cloudera‘s training pages:

MapReduce

MapReduce is more of a paradigm than a language. It is a way to write algorithms that can be run in parallel in order to utilize the computing power of a number of computers across a large data set. There are a number of software frameworks that make writing MapReduce jobs a lot easier and in the following videos you will learn how to use some of the most common.

Quickstart packages

As with many complex technologies, just setting up a working environment can be a challenge in itself. One that is enough to discourage the causal learner. To help alleviate the stress of setting up a general Hadoop environment to help you start working with Hadoop and the related cloud technologies, as well to help you gain some useful hands-on experience, here are a few resources to help you get a working Hadoop environment going fairly quickly.

Helpful hint regarding videos: If you are like me and prefer to watch/listen to long lectures in your car or otherwise on the go on your netbook, iPod or other mobile device.  Try looking for the above mentioned videos on Google Video instead of YouTube. Google Video includes a helpful download link that allows you to take a copy of the movie with you.

  1. This article is a continuation of a recent article I wrote on the different approaches to cloud computing taken by Google and Microsoft []
  2. Hadoop was actually inspired by Google, more history and background here. []

Tags: , , , , , ,