Posts Tagged distributed computing

Diskless computing vs distributed computing

A friend of mine recently asked me about cloud computing, what it was, and the ramifications of it on where we will see technology in the coming years. In his question he demonstrated a common confusion among most people between the difference between cloud computing and diskless computing.

Both of these are interesting areas of computer science, they do sometimes overlap, and they are both going to change computing in general in significant ways as time rolls on, but they are not the same.

Here’s are the differences to help  you can tell them apart.

Diskless computing

Diskless computing is best demonstrated in the Linux Terminal Server Project (excellent project, I’ve use it before to deploy over 150 diskless workstations in a company before) and Microsoft’s pathetic rival, Windows Terminal Services. Sun has their own solution as well and there are countless 3rd party utilities, but the basic idea behind them all is that you have one big computer (or series of computers) that all these “headless” computers connect to in order to retrieve an operating system, store files, etc. For large networks this network model is absolutely amazing.

Cloud Computing

Cloud computing, however, is the concept that you have a large problem that requires a lot of computing power to solve. Rather than buy bigger and bigger hardware, what we’ve found out (going back to Cray supercomputers) is that it is far better to split the problem down into iterative chunks and push those through multiple processors all at once rather than try to get a single processor to process everything. This is called distributed computing.

You might have heard of one of the major platforms for this type of computing, Beowulf, from the popular internet meme “imagine a beowulf cluster of…” Another very popular distributed computing platform (popular because it is far easier to install, operate, and write code for than the Beowulf project) is Hadoop. Hadoop is a project inspired by Google’s implementation of the MapReduce design paradigm written in Java which makes it a lot more portable.

Projects using Cloud Computing

Parallel processing is done today in a wide variety of settings including:

  • 3D rendering farms for companies such as Disney’s Pixar
  • indexing the web with Google, Yahoo, Microsoft, etc.
  • data mining of all sorts with companies like Wal-Mart, etc.

Join in!

There are some very popular projects using distributed computing technologies that regular people with CPU cycles to spare are encouraged to join in on like:

  • SETI@home where you can help process data that might help us identify extraterrestrial signals
  • Folding@home where you can help search for cures to various diseases
  • Genome@home where you can help map the human genome (again), this is tied closely to the folding@home project above
  • Shrek@home which was a pioneer project that a few of us got to participate in
  • others, including fightaids@home to help fight AIDS and lhc@home to process the massive amounts of data coming from the CERN’s Large Hadron Collider

So while diskless computing and cloud computing can have some areas of overlap (I configured the LTSP network I mentioned earlier to assist with the genome@home project when the systems were idle) they aren’t necessarily tied together.


Tags: , , , , , , , ,

An introduction to statistics and data mining

Following my recent post on Hadoop and MapReduce, I want to share a few helpful resources I’ve found in the areas of data mining and statistical analysis. I’ll look into helpful ways of visualizing data later on (including new/improved helpful charting libraries from Google), however this post will deal almost exclusively with the question of how to go about understanding and acquiring helpful sets of data.


Here is a fairly helpful broad introduction to data mining and it’s applications.

Crash course

The best introduction to these subjects I’ve found are a series of “Stats 202” videos done by Stanford professor David Mease:
Statistical Aspects of Data Mining (Stats 202): Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12, Lecture 13


It may surprise you to find this out, but the easiest and fastest tools to use when starting out are generally spreadsheet applications like Microsoft Excel and OpenOffice’s Calc which will help you quickly import and visualize your data.

However, another popular tool for statistics and data mining is the R Project for Statistical Computing which is free and has binaries for Windows, Mac, and Linux. R also includes a helpful “sample” function to help you extract meaningful results from a subset of your data without having to process it all at once.

Know of any other helpful sites or statistical tools? Post them below!

Helpful hint regarding videos: If you are like me and prefer to watch/listen to long lectures in your car or otherwise on the go on your netbook, iPod or other mobile device.  Try looking for the above mentioned videos on Google Video instead of YouTube. Google Video includes a helpful download link that allows you to take a copy of the movie with you.

Tags: , , ,

Getting started with Hadoop and MapReduce

Recently I’ve been studying several technologies that appear to form the core of cloud computing. In short, these are the technologies behind such technological marvels as Amazon, Google, Facebook, Yahoo, NetFlix, Pixar, etc.1

Since each of these technologies by themselves is worthy of a new book, and since even those familiar with the common implementation languages of these technologies (like Java and Python), I decided to put together all the resources I’ve found on these technologies in hopes that they will help someone else get started in this fascinating world of distributed or “cloud computing”.

Introduction to cloud computing

One might wonder why they should take the time to learn these technologies and concepts. A fair question to ask considering the amount of time and energy that will potentially be required in order to put any of this knowledge to any functional use. With that in mind I found the following videos particularly helpful in answering the question “why should I care?”:


Hadoop2 is essentially a compilation of a number of different projects  that make distributed computing a lot less painful. The best source of beginner’s information on Hadoop I’ve found has come from these Google lectures as well as from Cloudera‘s training pages:


MapReduce is more of a paradigm than a language. It is a way to write algorithms that can be run in parallel in order to utilize the computing power of a number of computers across a large data set. There are a number of software frameworks that make writing MapReduce jobs a lot easier and in the following videos you will learn how to use some of the most common.

Quickstart packages

As with many complex technologies, just setting up a working environment can be a challenge in itself. One that is enough to discourage the causal learner. To help alleviate the stress of setting up a general Hadoop environment to help you start working with Hadoop and the related cloud technologies, as well to help you gain some useful hands-on experience, here are a few resources to help you get a working Hadoop environment going fairly quickly.

Helpful hint regarding videos: If you are like me and prefer to watch/listen to long lectures in your car or otherwise on the go on your netbook, iPod or other mobile device.  Try looking for the above mentioned videos on Google Video instead of YouTube. Google Video includes a helpful download link that allows you to take a copy of the movie with you.

  1. This article is a continuation of a recent article I wrote on the different approaches to cloud computing taken by Google and Microsoft []
  2. Hadoop was actually inspired by Google, more history and background here. []

Tags: , , , , , ,