Following my recent post on Hadoop and MapReduce, I want to share a few helpful resources I’ve found in the areas of data mining and statistical analysis. I’ll look into helpful ways of visualizing data later on (including new/improved helpful charting libraries from Google), however this post will deal almost exclusively with the question of how to go about understanding and acquiring helpful sets of data.
Here is a fairly helpful broad introduction to data mining and it’s applications.
The best introduction to these subjects I’ve found are a series of “Stats 202” videos done by Stanford professor David Mease:
Statistical Aspects of Data Mining (Stats 202): Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5, Lecture 6, Lecture 7, Lecture 8, Lecture 9, Lecture 10, Lecture 11, Lecture 12, Lecture 13
It may surprise you to find this out, but the easiest and fastest tools to use when starting out are generally spreadsheet applications like Microsoft Excel and OpenOffice’s Calc which will help you quickly import and visualize your data.
However, another popular tool for statistics and data mining is the R Project for Statistical Computing which is free and has binaries for Windows, Mac, and Linux. R also includes a helpful “sample” function to help you extract meaningful results from a subset of your data without having to process it all at once.
Know of any other helpful sites or statistical tools? Post them below!
Helpful hint regarding videos: If you are like me and prefer to watch/listen to long lectures in your car or otherwise on the go on your netbook, iPod or other mobile device. Try looking for the above mentioned videos on Google Video instead of YouTube. Google Video includes a helpful download link that allows you to take a copy of the movie with you.