Posts Tagged data mining
The setup from Intelligence Squared:
On Christmas Day, 2009, twenty-three-year-old Umar Farouk Abdulmutallab attempted to blow up Northwest Airlines Flight 253 using explosives hidden in his underwear. A string of missed opportunities and errors by government security agencies culminated in what President Obama would declare a “systemic failure.” Is scanning everyone with expensive, high-tech equipment the best use of limited resources? Or should we use the information that we have—the knowledge that, while all Muslims are not terrorists, most terrorists are Muslim.
I think this debate should be re-framed: Should law enforcement use every tool at their disposal, which includes profiling, or should they refrain from using tools that may offend some people.
In the beginning the moderator concedes the main point, that the majority of recent (within the past decade) terrorist attacks have been committed or attempted by men who have a common tie to Islam. If this is true (a fact that was never disputed), then it makes the validity of including it as a metric a foregone conclusion.
In fact, the only objections given by the opposition were
- Judging people based on nationality is not sufficient to determine whether someone is likely to be a terrorist
- Not all terrorists are Muslims
- Not all Muslims are terrorists
- Its a violation of civil liberties to question a certain group more than others
To these, the responses were given
- Religion and race are not the only metrics used and the agents involved aren’t the only ones doing the analysis
- The majority of terrorists in the past decade (or more) have been Muslims
- The size of the overall population is irrelevant, what matters is the statistical likelihood that a terrorist will match the overall profile
- Civil liberties aren’t violated by mere suspicion. They aren’t even violated by extra law enforcement attention (interrogation, scans, etc.)
- It makes us less safe to waste law enforcement resources on “random” searches.
What this debate really highlights is how most people, even supposed “experts”, either don’t understand how statistical analysis works or deliberately choose to misconstrue the facts. It also highlights how our culture’s myopic drive towards political correctness makes us less secure as a result.
Following my recent post on Hadoop and MapReduce, I want to share a few helpful resources I’ve found in the areas of data mining and statistical analysis. I’ll look into helpful ways of visualizing data later on (including new/improved helpful charting libraries from Google), however this post will deal almost exclusively with the question of how to go about understanding and acquiring helpful sets of data.
Here is a fairly helpful broad introduction to data mining and it’s applications.
The best introduction to these subjects I’ve found are a series of “Stats 202” videos done by Stanford professor David Mease:
Statistical Aspects of Data Mining (Stats 202): Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12, Lecture 13
It may surprise you to find this out, but the easiest and fastest tools to use when starting out are generally spreadsheet applications like Microsoft Excel and OpenOffice’s Calc which will help you quickly import and visualize your data.
However, another popular tool for statistics and data mining is the R Project for Statistical Computing which is free and has binaries for Windows, Mac, and Linux. R also includes a helpful “sample” function to help you extract meaningful results from a subset of your data without having to process it all at once.
Know of any other helpful sites or statistical tools? Post them below!
Helpful hint regarding videos: If you are like me and prefer to watch/listen to long lectures in your car or otherwise on the go on your netbook, iPod or other mobile device. Try looking for the above mentioned videos on Google Video instead of YouTube. Google Video includes a helpful download link that allows you to take a copy of the movie with you.