If you are just starting working with Hadoop and Big Data, you may be at a loss for data to experiment with. Luckily, there is an abundant supply of freely available data sets on the Internet. Here I will highlight a few of the sources I have found out about, and I’ll add more as I find them.
InfoChimps is a company of data scientists, cloud computing and open source experts who provide solutions for their customers to make Big Data platforms. They provide over 11,000 freely available data sets for you to download. Everything from an Excel readable list of crossword puzzle words to UFO sighting data sets are here.
Interested in movie information and movie review datasets? GroupLens (a research lab in the Department of Computer Science and Engineering at the University of Minnesota, Twin Cities) has compiled data sets of varying sizes of movie reviews from a large number of reviews. Also available are other recommender data sets on different topics. Check it out here.
An interesting website in theinfo.org, where you can download large numbers of public records. Organization is almost non-existent, with no search function. You click on a couple of dots and are presented with a court documents. Interesting if you are in need of random data for a project.
Finally, one of the largest repositories of freely available data sets is provided by the US Government. Encompassing over 100,000 sets including subjects like Real-time 911 Fire Calls in Seattle to a cross reference of domestic and foreign companies doing business with the US Government, its a treasure trove of haystacks with numerous needles ready for you to discover.
Updated 8/22/2014 – Just found out about a website offering over 15,000 data sets of public information to help people learn how (UK) government works. Available at data.gov.uk there are data sets across several broad categories and in various formats. A pretty cool feature is they also offer links to apps built using these data sets.
If you have any other sites you’ve found, drop me a line on the contact page, and I’ll include them in a future post!