Archive for category it industry

Governments calling citizens to ditch Internet Explorer

Google was recently hit by an exploit McAfee has named “Aurora”. This exploit involves all versions of Internet Explorer (though version 6 is getting most of the attention) which has prompted the governments of France and Germany to warn it’s citizens not to use Internet Explorer at all.

Microsoft initially tried to claim that this exploit was trivial but has since issued an out-of-cycle emergency patch for all versions of Internet Explorer.

Looks like now is the perfect time to switch to one of the more superior browsers like Chrome or Firefox.

Here’s a video detailing how this hack works in action in case you are like me and interested in the juicy technical details:

  • Share/Bookmark

Tags: , , , ,

Passwords revisited

An analysis of 32million leaked passwords provided some interesting insights into the password selection practices of users. Among the key findings are:

  • The shortness and simplicity of passwords means many users select credentials that will make them susceptible to basic forms of cyber attacks known as “brute force attacks.”
  • Nearly 50% of users used names, slang words, dictionary words or trivial passwords (consecutive digits, adjacent keyboard keys, and so on). The most common password is “123456”.
  • Recommendations for users and administrators for choosing strong passwords.

Also, here are the top 10 most commonly used passwords they found:

1. 123456
2. 12345
3. 123456789
4. Password
5. iloveyou
6. princess
7. rockyou
8. 1234567
9. 12345678
10. abc123

I’ve said it before, the first step in computer security is having a strong password policy.

  • Share/Bookmark

Tags: ,

Taming the blogosphere with Google Reader

What are blogs?

Many of you are wondering what the big deal is with blogs. Well here is a short video on blogs and why they are important/useful:

What’s so great about blogs?

Aside from being able to access specialized information put out on a regular basis, there is one other reason I enjoy reading blogs and consider them to be an essential element in our modern forms of communication.

Blogs help you connect with people.

You learn a lot about someone’s character, thoughts, and passions if you follow what they say on their blog. The trouble is that since blogs are generally authored by one person on individual website it can become time consuming and cumbersome to visit each blog you’re interested in to check for and read any new posts.

How can I keep up with blogs?

The easiest tool I’ve found to help bring a variety of different blogs together into one place is by utilizing the RSS feed offered by most blogs.

Google Reader is a web-based RSS reader which requires a Google account and a little bit of setup, but once you get it going its pretty much automated and will allow you to check a number of blogs without having to spend time visiting each and every website to get updates.

Here is a short video to help you get started with Google Reader:

  • Share/Bookmark

Tags: , ,

Hacking your router for effective internet monitoring

The Why: Preamble

Working in the information technology sector, one of the most common questions I get asked by parents is about monitoring internet access of their children.1

Most parents want to know what their children are doing online but also recognize that most off-the-shelf products are just as easy to disable or circumvent (or are far more restrictive/bloated than they want) as they are to install or operate. And sadly, enterprise solutions that capture and control network traffic at the most basic level (making circumvention next to impossible) is still very expensive and therefore out of reach for the average family.

What I needed was a cheap and hackable router that I could modify to send captured URLs to a central source for storage and processing.

The What: WRT54G

Linksys-WRT54G-Ultimate-HackingAfter studying my options I remembered reading a lot about the Linksys WRT54G-series routers and how they were originally based on a heavily modified version of Linux and how Linksys made headlines when it lost a court case regarding the GPLed code it used in their router’s firmware.

So I did a little digging.

What I found was a whole router-hacking subculture built around the WRT54G. While it seems that much of the initial fervor has subsided, many of the packages show a last update time of 2007 or so, the documentation is still valid for the most part. The most popular projects which provide custom firmware are the OpenWRT and DD-WRT. While OpenWRT is the original, I found DD-WRT to be a lot more polished and (as we’ll see later) configurable without much headache.

It’s important to note here that the WRT54G has many variants and its easy to fall into the trap of thinking that any old WRT54G will do but a little diligence and study of the differences between the hardware revisions will certainly save you time and money.

After buying a few different routers and bricking one (a Buffalo AirStation WHR-HP-54G2 ) and a false start with a newer WRT54G v7 (anyone need a highly configurable, albeit not-very-hackable router?) I discovered that the best router for hacking is the WRT54GL (which was designed by Linksys to allow for user modifications).

The How: URLSnarf and custom shell scripts

Space on a router is very limited. On the WRT54GL model I eventually ended up using I had 4Megs of space to work with.

The first order of business was to find a package that could monitor all of the network connections (wired and wireless) on the router and capture requested URLs. For this task I discovered  that URLSnarf, part of the dsniff OpenWrt package, worked quite well.

To install packages I used DD-WRT’s firmware modification kit which allowed me to simply add the scripts and packages I wanted without having to recompile everything.

Next I needed to transform the captured URL into a URLencoded string in order to send it to my monitoring service via a simple wget request. Initially I tried using several variations of user-generated Python and PHP packages but they both took up far more space than I could afford so, instead, I searched for a pure command-line based solution.

After some more digging I found a handy sed substitution script that worked like a charm. The script worked in two parts, the first one being the substitution script (/usr/bin/urlencode.sed):

s/%/%25/g
s/ /%20/g
s/ /%09/g
s/!/%21/g
s/"/%22/g
s/#/%23/g
s/\$/%24/g
s/\&/%26/g
s/'\''/%27/g
s/(/%28/g
s/)/%29/g
s/\*/%2a/g
s/+/%2b/g
s/,/%2c/g
s/-/%2d/g
s/\./%2e/g
s/\//%2f/g
s/:/%3a/g
s/;/%3b/g
s//%3e/g
s/?/%3f/g
s/@/%40/g
s/\[/%5b/g
s/\\/%5c/g
s/\]/%5d/g
s/\^/%5e/g
s/_/%5f/g
s/`/%60/g
s/{/%7b/g
s/|/%7c/g
s/}/%7d/g
s/~/%7e/g
s/	/%09/g

and the command line to use it:

sed -f urlencode.sed

To tie it all together, we can pass captured URLs to it via pipes from the command-line with:

urlsnarf | sed -f urlencode.sed

At this point, the only missing link of the capture chain is a script to continually read from the command line and send the urlencoded capture data to our storage application (described in the next part). For this task I used the following script (/usr/bin/urlmon.sh):

HOSTNAME=`hostname`
while read url; do
    DATE=`date +%s`
    echo $(wget -q -O- "http://myapp.appspot.com/log?l=$url&h=$HOSTNAME&t=$DATE")
done

exit 0

Finally, we need to have the router start listening for URLs as soon as it is booted. In a Linux environment this is generally done by init scripts. Since our router has limited capabilities, we don’t need to write a full init script. Here is the slimmed down init script I used (/etc/init.d/S50urlmon):

#!/bin/sh

/usr/sbin/urlsnarf -v "/(192.168.1.1|https\://myapp\.appspot\.com)/" | sed -f /usr/bin/urlencode.sed | /usr/bin/urlmon.sh

The Where: Google App Engine

I’ve been itching to try out Google’s App Engine for a while now and this project seemed to be a great fit since I didn’t know how much data to expect and I needed my receiving/processing/display application to be highly available and scalable. Especially if this works well enough that others might want to use it.

Since my initial phase is to merely capture the URLs requested from devices behind the router, and since the capture process should be as efficient and lean as possible (I don’t want the router to take very long logging a URL when it’s primary job is to retrieve that URL for the initial requester) I decided to make a simple Java servlet which simply takes the URLencoded log line generated by URLSnarf.

Google App Engine uses Java Data Objects enhanced by DataNucleus to store data in Google’s massive cluster. Here is the annotated JDO (LogLine.java) I used to store the captured URL:

import javax.jdo.annotations.IdGeneratorStrategy;
import javax.jdo.annotations.IdentityType;
import javax.jdo.annotations.PersistenceCapable;
import javax.jdo.annotations.Persistent;
import javax.jdo.annotations.PrimaryKey;

@PersistenceCapable(identityType = IdentityType.APPLICATION)
public class LogLine {
	@PrimaryKey
	@Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
	private Long id;

	@Persistent
	private String host;

	@Persistent
	private Long time;

	@Persistent
	private String line;

	public void setId(Long id) {
		this.id = id;
	}

	public Long getId() {
		return id;
	}

	public void setLine(String line) {
		this.line = line;
	}

	public String getLine() {
		return line;
	}

	public String getHost() {
		return host;
	}

	public void setHost(String host) {
		this.host = host;
	}

	public Long getTime() {
		return time;
	}

	public void setTime(Long time) {
		this.time = time;
	}
}

And here is the servlet that processes the GET request (containing the captured URL in Apache Common Log format)

import java.io.IOException;
import java.net.URLDecoder;

import javax.jdo.PersistenceManager;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.werxltd.webmon.data.LogLine;

public class Log extends HttpServlet {
	private final static long serialVersionUID = 3;

	public void doGet(HttpServletRequest req, HttpServletResponse resp)
    	throws IOException {
			try {
				resp.setContentType("text/plain");

				LogLine logline = new LogLine();
				String logStr = URLDecoder.decode(req.getParameter("l"));
				logline.setLine(logStr);
				logline.setHost(req.getParameter("h"));
				logline.setTime(Long.parseLong(req.getParameter("t")));

				PersistenceManager pm = PMF.get().getPersistenceManager();
				pm.makePersistent(logline);
				pm.close();	

				resp.getWriter().println("OK");
			} catch (Exception e) {
				e.printStackTrace();
				resp.getWriter().println("FAIL");
			} finally {

			}

	}
}

The future

This project is still in it’s early stages. There is no real way to view the captured data just yet, though I plan on incorporating Polliwog, and the router software hasn’t been tested as much as I would like. I’m also leery of any security holes I may have introduced.

So if you have any suggestions or would like to know more, feel free to leave a comment below!

  1. Most actually ask about “controlling what their kids see online” but I generally argue for a observe-only approach as it helps open lines of communication with your child whereas silently blocking “bad” sites will only start a silent war which will only frustrate you once they do find a suitable workaround, such as a proxy. []
  2. I might have had better luck had I seen this helpful guide. Oh well, this gives me a future project in figuring out how to de-brick my WHR-HP-54G []
  • Share/Bookmark

Tags: , , , , , , ,

Beginner’s guide to load testing

Recently I got tasked with load testing an internal system and producing statistics for the team to show how well it will scale once it is put into production.

After some intense research I decided to go with “The Grinder” which allows multiple tests to be run by multiple machines which can all funnel their collected statistics back up to a central “console”. Tests are written in Python which, in turn, gets fed through Jython and converted into native Java bytecode to be run by participating Grinder agent instances. Grinder works on a single, user-definable port, for both pushing scripts to listening agents as well as gathering statistics from tests.

Initially I decided to try and capture the results of each individual test in a MySQL database but abandoned that idea when the tests ended up overloading the MySQL database server before the web app we were primarily testing. Logging results also proved to be an interesting feat since it swampped the agent’s filesystems after less than an hour (we were running multiple processes and threads) as well.

We eventually settled on simply capturing the combined statistical data at the root console level (the way Grinder is designed) and displaying it via a plugin in Hudson.

Overall Grinder worked great for testing the load of our web app (which passed with flying colors). And since Grinder works natively in Java we are also planning on testing specific Java classes directly in the future as well as their overall performance through a web based front-end such as a servlet.

  • Share/Bookmark

Tags: , , ,

Social justice and technology

Recently I attended the Atlanta Linux Fest where a keynote (deceptively entitled “Standing Out in the Crowd“) was given, and I still don’t understand why they felt the need to make this a keynote when there was already a workshop scheduled on the same subject. I, and others at the conference, tweeted and tried to make known that we didn’t appreciate the importation of social justice concerns into an area that, by all rights and purposes, ought to be coldly logical.

At first I thought this was merely a fluke. That some random feminist organization managed to persuade the organizers of the event into allowing them to come and try and persuade us that we are somehow all intolerant and misogynistic simply because fewer1 women are members. However, after seeing an article to the same effect on Slashdot recently I started looking into the so-called feminist open-source movement and found it to be neither new nor isolated.

Now while I have no problem with people pursuing their own agendas per se I do have a big problem when it comes to pursuing an agenda that is purely fabricated and detrimental to what it is purporting to help. Feminism, for example, has been a corrosive cancer to every single thing it has touched and I hold no illusions to what sort of effects it will have in the realm of science and technology. I also have a big problem when, in the name of tolerance and equality, feminism manages to wield a giant club beating down any dissenters in an effort to make everyone get in line. Get in line with what, though? The isolated incidents of sexism and questionable jokes/comments are not enough to convince me that the issues that need to be address fall primarily within the domain of gender-equality. As one attendee of the Atlanta Linux Fest so poignantly put it: “Aren’t these problems really human problems as opposed to problems that only women face?”

Social justice seems to be a pretty effective means of “turning the tables” by being just as biased as those you are decrying without fear of reprisal. After all, who’s going to dare speak out against the supposed victims? Interestingly enough, this whole movement has also found it’s way into the current US administration’s agenda.

When they said they were going to champion science and technology they failed to mention they were only interested in the social aspect, as if that is what is harming the competitiveness of the US tech industry.

  1. Which is still more than none. []
  • Share/Bookmark

Tags: , , , , ,

Diskless computing vs distributed computing

A friend of mine recently asked me about cloud computing, what it was, and the ramifications of it on where we will see technology in the coming years. In his question he demonstrated a common confusion among most people between the difference between cloud computing and diskless computing.

Both of these are interesting areas of computer science, they do sometimes overlap, and they are both going to change computing in general in significant ways as time rolls on, but they are not the same.

Here’s are the differences to help  you can tell them apart.

Diskless computing

Diskless computing is best demonstrated in the Linux Terminal Server Project (excellent project, I’ve use it before to deploy over 150 diskless workstations in a company before) and Microsoft’s pathetic rival, Windows Terminal Services. Sun has their own solution as well and there are countless 3rd party utilities, but the basic idea behind them all is that you have one big computer (or series of computers) that all these “headless” computers connect to in order to retrieve an operating system, store files, etc. For large networks this network model is absolutely amazing.

Cloud Computing

Cloud computing, however, is the concept that you have a large problem that requires a lot of computing power to solve. Rather than buy bigger and bigger hardware, what we’ve found out (going back to Cray supercomputers) is that it is far better to split the problem down into iterative chunks and push those through multiple processors all at once rather than try to get a single processor to process everything. This is called distributed computing.

You might have heard of one of the major platforms for this type of computing, Beowulf, from the popular internet meme “imagine a beowulf cluster of…” Another very popular distributed computing platform (popular because it is far easier to install, operate, and write code for than the Beowulf project) is Hadoop. Hadoop is a project inspired by Google’s implementation of the MapReduce design paradigm written in Java which makes it a lot more portable.

Projects using Cloud Computing

Parallel processing is done today in a wide variety of settings including:

  • 3D rendering farms for companies such as Disney’s Pixar
  • indexing the web with Google, Yahoo, Microsoft, etc.
  • data mining of all sorts with companies like Wal-Mart, etc.

Join in!

There are some very popular projects using distributed computing technologies that regular people with CPU cycles to spare are encouraged to join in on like:

  • SETI@home where you can help process data that might help us identify extraterrestrial signals
  • Folding@home where you can help search for cures to various diseases
  • Genome@home where you can help map the human genome (again), this is tied closely to the folding@home project above
  • Shrek@home which was a pioneer project that a few of us got to participate in
  • others, including fightaids@home to help fight AIDS and lhc@home to process the massive amounts of data coming from the CERN’s Large Hadron Collider

So while diskless computing and cloud computing can have some areas of overlap (I configured the LTSP network I mentioned earlier to assist with the genome@home project when the systems were idle) they aren’t necessarily tied together.

  • Share/Bookmark

Tags: , , , , , , , ,

An introduction to statistics and data mining

Following my recent post on Hadoop and MapReduce, I want to share a few helpful resources I’ve found in the areas of data mining and statistical analysis. I’ll look into helpful ways of visualizing data later on (including new/improved helpful charting libraries from Google), however this post will deal almost exclusively with the question of how to go about understanding and acquiring helpful sets of data.

Introduction

Here is a fairly helpful broad introduction to data mining and it’s applications.

Crash course

The best introduction to these subjects I’ve found are a series of “Stats 202″ videos done by Stanford professor David Mease:
Statistical Aspects of Data Mining (Stats 202): Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12, Lecture 13

Tools

It may surprise you to find this out, but the easiest and fastest tools to use when starting out are generally spreadsheet applications like Microsoft Excel and OpenOffice’s Calc which will help you quickly import and visualize your data.

However, another popular tool for statistics and data mining is the R Project for Statistical Computing which is free and has binaries for Windows, Mac, and Linux. R also includes a helpful “sample” function to help you extract meaningful results from a subset of your data without having to process it all at once.

Know of any other helpful sites or statistical tools? Post them below!

Helpful hint regarding videos: If you are like me and prefer to watch/listen to long lectures in your car or otherwise on the go on your netbook, iPod or other mobile device.  Try looking for the above mentioned videos on Google Video instead of YouTube. Google Video includes a helpful download link that allows you to take a copy of the movie with you.

  • Share/Bookmark

Tags: , , ,

Getting started with Hadoop and MapReduce

Recently I’ve been studying several technologies that appear to form the core of cloud computing. In short, these are the technologies behind such technological marvels as Amazon, Google, Facebook, Yahoo, NetFlix, Pixar, etc.1

Since each of these technologies by themselves is worthy of a new book, and since even those familiar with the common implementation languages of these technologies (like Java and Python), I decided to put together all the resources I’ve found on these technologies in hopes that they will help someone else get started in this fascinating world of distributed or “cloud computing”.

Introduction to cloud computing

One might wonder why they should take the time to learn these technologies and concepts. A fair question to ask considering the amount of time and energy that will potentially be required in order to put any of this knowledge to any functional use. With that in mind I found the following videos particularly helpful in answering the question “why should I care?”:

Hadoop

Hadoop2 is essentially a compilation of a number of different projects  that make distributed computing a lot less painful. The best source of beginner’s information on Hadoop I’ve found has come from these Google lectures as well as from Cloudera’s training pages:

MapReduce

MapReduce is more of a paradigm than a language. It is a way to write algorithms that can be run in parallel in order to utilize the computing power of a number of computers across a large data set. There are a number of software frameworks that make writing MapReduce jobs a lot easier and in the following videos you will learn how to use some of the most common.

Quickstart packages

As with many complex technologies, just setting up a working environment can be a challenge in itself. One that is enough to discourage the causal learner. To help alleviate the stress of setting up a general Hadoop environment to help you start working with Hadoop and the related cloud technologies, as well to help you gain some useful hands-on experience, here are a few resources to help you get a working Hadoop environment going fairly quickly.

Helpful hint regarding videos: If you are like me and prefer to watch/listen to long lectures in your car or otherwise on the go on your netbook, iPod or other mobile device.  Try looking for the above mentioned videos on Google Video instead of YouTube. Google Video includes a helpful download link that allows you to take a copy of the movie with you.

  1. This article is a continuation of a recent article I wrote on the different approaches to cloud computing taken by Google and Microsoft []
  2. Hadoop was actually inspired by Google, more history and background here. []
  • Share/Bookmark

Tags: , , , , , ,

Ubuntu netbook remix review

Spurred on by a friend of mine, I recently converted my Acer Aspire One netbook over to Ubuntu Linux, specifically the Netbook Remix version.

Before I made the switch permanent, I was able to try out a “live” version of the Ubuntu system and was pleased to find that it not only picked up all the hardware in my netbook properly but that it also picked up the Bluetooth nano device I use to connect my mouse, phone, headphones, etc. I was also pleasantly surprised to find that proprietary things like mp3  playback and video codecs weren’t hard to find and install.

I was so impressed that I decided to switch my laptop over, even though the contract I am currently on requires Windows-specific plugins. What I found, though, is that even installing the Linux version of Paralells, Sun’s VirtualBox, was relatively painless.

I’ve always loved Debain but have shied away from it as a primary operating system (apart from servers, that is) because of my memories of poor hardware support in Linux distributions of the past. Not any more, though! I think Ubuntu (as well as other distros) are proving they have what it takes to compete in a real way.

The only thing left to figure out, however, is what I’m going to do to sync my phone (Treo 755p) and figure out how to use my phone as a modem via bluetooth. Oh well, what’s life without a challenge?

  • Share/Bookmark

Tags: , , , , ,