"The future belongs to the companies and people that turn data into products" says Mike Loukides on O’Reilly Radar. His "What is data science?" article, is an interesting read, that can be found here http://oreil.ly/dknxJV.
I’ve used some of the tools mentioned like the Python programming language and the Beautiful Soup library to clean up HTML. It has allowed me to deliver some analytics that have combined together data from multiple sources over the internet to project some predictions about the future. In one customer assignment, I combined together Australian Federal Government data, with local State based Population Projections to effectively create a wealth of data about future market shares. This was all done with Python and Beautiful Soup on my Mac Book Pro. I didn’t need my own database or data warehouse as I was working with thousands of bits of summary data that was readily available over the internet.
In other activities, I’ve used Apache Hadoop and its Map Reduce framework to process Australian financial market statistics on the full trading day history of all 2000+ listed companies on the ASX. I’ve also recently investigated Apache Mahout with its machine learning capabilities and am in the process of learning Apache Pig & Apache Hive to store and process data on top of Apache Hadoop.
All this software is free open source and scales to process large volumes of data on commodity infrastructure.
However, some strong analysis and programming skills are required. I’m working on advancing my knowledge also of statistics that are pertinent to these endeavours. In the past I’ve found the O’Reilly’s book Programming Collective Intelligence to be excellent.
I agree with the Hal Varian quote also mentioned in Mike’s post "The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades."