Recently I’ve been studying several technologies that appear to form the core of cloud computing. In short, these are the technologies behind such technological marvels as Amazon, Google, Facebook, Yahoo, NetFlix, Pixar, etc.1
Since each of these technologies by themselves is worthy of a new book, and since even those familiar with the common implementation languages of these technologies (like Java and Python), I decided to put together all the resources I’ve found on these technologies in hopes that they will help someone else get started in this fascinating world of distributed or “cloud computing”.
Introduction to cloud computing
One might wonder why they should take the time to learn these technologies and concepts. A fair question to ask considering the amount of time and energy that will potentially be required in order to put any of this knowledge to any functional use. With that in mind I found the following videos particularly helpful in answering the question “why should I care?”:
- Hadoop, Map Reduce, and Big Data Sets Part 1, Part 2
- O’Reilly Webcast: An Introduction to Hadoop
- Computing in the Cloud – Introduction
Hadoop
Hadoop2 is essentially a compilation of a number of different projects that make distributed computing a lot less painful. The best source of beginner’s information on Hadoop I’ve found has come from these Google lectures as well as from Cloudera‘s training pages:
MapReduce
MapReduce is more of a paradigm than a language. It is a way to write algorithms that can be run in parallel in order to utilize the computing power of a number of computers across a large data set. There are a number of software frameworks that make writing MapReduce jobs a lot easier and in the following videos you will learn how to use some of the most common.
- Cluster Computing and MapReduce Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5
- MapReduce and HDFS
- Introduction to Hadoop at Brandies University
Quickstart packages
As with many complex technologies, just setting up a working environment can be a challenge in itself. One that is enough to discourage the causal learner. To help alleviate the stress of setting up a general Hadoop environment to help you start working with Hadoop and the related cloud technologies, as well to help you gain some useful hands-on experience, here are a few resources to help you get a working Hadoop environment going fairly quickly.
- Introduction to Cloudera’s distribution of Hadoop
- Cloudera’s VMWare training image, perfect for quick-access to hands-on examples preconfigured in Eclipse projects. Requires the free VMWare Player which works great on Linux and Windows.
- OpenSolaris Hadoop LiveCD, works great in VirtualBox, can also install distribution to disk for a more permanent and dedicated development environment
Helpful hint regarding videos: If you are like me and prefer to watch/listen to long lectures in your car or otherwise on the go on your netbook, iPod or other mobile device. Try looking for the above mentioned videos on Google Video instead of YouTube. Google Video includes a helpful download link that allows you to take a copy of the movie with you.
- This article is a continuation of a recent article I wrote on the different approaches to cloud computing taken by Google and Microsoft [↩]
- Hadoop was actually inspired by Google, more history and background here. [↩]
Pingback: Open Source Books: Hadoop, the Definitive Guide
Pingback: An introduction to statistics and data mining « Werx Limited
Pingback: Diskless computing vs distributed computing « Werx Limited