Here are some helpful resources if you are looking to learn Python.
Google I/O 2008 – Painless Python Part 1 of 2
Google I/O 2008 – Painless Python Part 2 of 2
Dive Into Python (excellent reference)
Hadoop is a very powerful MapReduce framework based on a white paper released by Google documenting how they have successfully tackled the issue of processing large amounts of data (on the scale of petabytes in many cases) using their proprietary distributed filesystem, GFS. Hadoop is the open source version of this distributed file system1, heavily supported by companies like Yahoo, Google, Amazon, Adobe, Facebook, Hulu, IBM, RackSpace, etc. and and has a growing number of related projects hosted by the Apache Foundation.
Yet, even with all of the buzz and hoopla many people find it difficult to setup and start writing applications capable of levreging the awesome power of an Hadoop cluster, many find the learning curve of Java and the Hadoop APIs very steep.
Fortunately one of the features available in Hadoop is HadoopStreaming which allows programmers to specify any program (or script) as a mapper and/or reducer. Consequently, one of the most popular scripting languages to use alongside Hadoop is Python2.
One of the reasons Python is well suited to this type of work is it’s ability to be functional provided you are careful how you write it. This makes chopping well-written Python map/reduce scripts up into distributable units much easier.
While it is possible to write plain Python scripts, the folks at last.fm have helped create an excellent Python framework for Hadoop called Dumbo to help streamline the process of writing MapReduce jobs in Python. Dumbo seems to be a fairly simple framework with plenty of examples you can adapt to your particular needs.
Hadoop has many sub-projects, and one that is fairly popular is called HBase which allows a more structured, database-like, approach to storing and retrieving data. An excellent Python framework for quickly parsing data into HBase tables is Zohmg. This framework allows programmers to define tables in a YAML configuration file and corresponding mappers as simple Python scripts.
One of the biggest drawbacks to using HadoopStreaming is that it is inherently less optimal than writing MapReduce jobs in Java since the target script or application has to be initialized, the data then has to be serialized, sent to the target application/script, processed, and then sent back (if there are any reducers). All this context switching adds overhead that wouldn’t exist if the MapReduce job were kept in the JVM where Hadoop runs.
Jython is a viable answer for converting existing Python applications into Java bytecode to prevent incurring as much of a performance penalty. This utility can come in handy if you decide that your “quick and dirty” Python script needs to be moved into a production environment.
Recently my wife and I decided to try and wrangle our images into some sort of logical order for easy accessibility. After some thought we decided on a simple system of image-directory/year/month for our images and since our old images were spread out across several folders in no particular order I decided to write a script to copy everything into the right folders.
Here is the Python script I wrote to sort all of our images by creation date into properly ordered folders.1
./picturesorter.py /images/source /images/dest
#!/usr/bin/python import sys, shutil, os, time, tempfile from os.path import join, getsize from stat import * if len(sys.argv) != 3: print "Usage: "+sys.argv+" [source] [target]" sys.exit(-1) if not os.path.isdir(sys.argv): print "'"+sys.argv +"' is not a valid source directory" sys.exit(-1) if not os.path.isdir(sys.argv): print "'"+sys.argv + "' is not a valid destination directory" sys.exit(-1) #Define your system's copy command here copyCmd = "cp -f" def walk( root, recurse=0, pattern='*', return_folders=0 ): import fnmatch, os, string result =  try: names = os.listdir(root) except os.error: return result pattern = pattern or '*' pat_list = string.splitfields( pattern , ';' ) for name in names: fullname = os.path.normpath(os.path.join(root, name)) for pat in pat_list: if fnmatch.fnmatch(name, pat.upper()) or fnmatch.fnmatch(name, pat.lower()): if os.path.isfile(fullname) or (return_folders and os.path.isdir(fullname)): result.append(fullname) continue if recurse: if os.path.isdir(fullname) and not os.path.islink(fullname): result = result + walk( fullname, recurse, pattern, return_folders ) return result def getTime(file): result =  try: st = os.stat(file) except IOError: print "failed to get information about", file else: result = time.localtime(st[ST_MTIME]) return result if __name__ == '__main__': log_fd, logfilename = tempfile.mkstemp (".log","psort_") logfile = os.fdopen(log_fd, 'w+') print "Scanning '%s' for images..." % sys.argv files = walk(sys.argv, 1, '*.jpg;*.gif;*.png;*.psd;*.tif', 0) logfile.write("Found %d images in '%s'...\n" % (len(files), sys.argv)) print "Copying %d images to '%s'" % (len(files),sys.argv) for file in files: fileTime = getTime(file) destination = os.path.join(sys.argv, "%s" % fileTime.tm_year) if not os.path.isdir(destination): os.mkdir(destination) logfile.write("Created directory '%s''n" % destination) destination = os.path.join(destination, time.strftime("%m", fileTime)) if not os.path.isdir(destination): os.mkdir(destination) logfile.write("Created directory '%s''\n" % destination) os.system("%s \"%s\" \"%s\"" % (copyCmd, file, destination)) if os.path.isfile(os.path.join(destination,os.path.basename(file))): print ".", logfile.write("'%s' => '%s''\n" % (file,destination)) else: logfile.write("[FAIL] '%s' => '%s''\n" % (file,destination)) print "Finished copying files, log file avaliable at %s" % logfilename logfile.close()
Recently I’ve been studying several technologies that appear to form the core of cloud computing. In short, these are the technologies behind such technological marvels as Amazon, Google, Facebook, Yahoo, NetFlix, Pixar, etc.1
Since each of these technologies by themselves is worthy of a new book, and since even those familiar with the common implementation languages of these technologies (like Java and Python), I decided to put together all the resources I’ve found on these technologies in hopes that they will help someone else get started in this fascinating world of distributed or “cloud computing”.
One might wonder why they should take the time to learn these technologies and concepts. A fair question to ask considering the amount of time and energy that will potentially be required in order to put any of this knowledge to any functional use. With that in mind I found the following videos particularly helpful in answering the question “why should I care?”:
Hadoop2 is essentially a compilation of a number of different projects that make distributed computing a lot less painful. The best source of beginner’s information on Hadoop I’ve found has come from these Google lectures as well as from Cloudera‘s training pages:
MapReduce is more of a paradigm than a language. It is a way to write algorithms that can be run in parallel in order to utilize the computing power of a number of computers across a large data set. There are a number of software frameworks that make writing MapReduce jobs a lot easier and in the following videos you will learn how to use some of the most common.
As with many complex technologies, just setting up a working environment can be a challenge in itself. One that is enough to discourage the causal learner. To help alleviate the stress of setting up a general Hadoop environment to help you start working with Hadoop and the related cloud technologies, as well to help you gain some useful hands-on experience, here are a few resources to help you get a working Hadoop environment going fairly quickly.
Helpful hint regarding videos: If you are like me and prefer to watch/listen to long lectures in your car or otherwise on the go on your netbook, iPod or other mobile device. Try looking for the above mentioned videos on Google Video instead of YouTube. Google Video includes a helpful download link that allows you to take a copy of the movie with you.