Datasets

  • UCI Machine Learning Repository – A repository of more than 200 data sets for machine learning and data mining
  • Movie Ratings Data – Real movie ratings data from www.movielens.org Web site. Contains ratings on 1600+ movies by 1000 users
  • Kaggle.com Competition Data Sets – Data sets from a variety of competitions. Also a good source for class project ideas.
  • Stanford Large Network Dataset Collection – A variety of network data sets, including data from social networks, product reviews, online communities, etc.
  • Yelp Data Set Challenge – Reviews and check-in data on thousands of businesses.
  • Million Song Dataset – Freely-available collection of audio features and metadata for a million contemporary popular music tracks.
  • Public Data sets on Amazon Web Services – Large public data sets (including data sets for US Census, Wikipedia, Freebase, human genome project), ready for big data analytics on the cloud.
  • Data.gov – Publically available data sets from Federal, State, and local government, including economic, geological, demographic and many other types of data sources. This site also includes a list of other Open Data Sites with similar publicly available data sources from various cities, states, and countries.
  • KDnugget’s list of data sets for data mining
  • Infochimps Data Market – Thousands of data sets, including data from various social networks and collaborative tagging sites such as Twitter, Delicious, Last.fm, MusicBrainz, as well as data sets from many other domains.
  • Preprocessed DePaul CTI Web Usage Data – Cleaned, filtered, and sessionized data of visits to the main CTI site during a 2 week period in April 2002. The data also includes basic statistics on users and sessions.
  • Cleaned DePaul CTI Web Usage Data – The full cleaned CTI Web usage data for April 2002. This data set has been cleaned (including spider removal) and converted into tab delimited format. However, no user identification, sessionization, or other data preparation steps have been performed.
  • Non-Preprocessed DePaul CTI Web Usage Data – The full CTI Web usage data for April 2002. The only cleaning step performed on this data was the removal of references to auxiliary files (e.g., image files). No other cleaning or preprocessing has been performed. The data is in the original log format used by Microsoft IIS.
  • Data Portal – City of Chicago – Various datasets maintained by The City of Chicago Administration about the city.
  • US Census Bureau – Data portal of the US Census Bureau.
  • mldata.org Machine Learning Dataset Repository – Data portal for an open-source dataset repository. mldata.org’s purpose is to make machine learning methods, datasets, and software reproducible and available to the public.
  • Auton Lab research datasets – datasets used in published papers by the Carnegie Mellon Auton Lab researchers.
  • University of Munich, Department of Statistics datasets – Data portal for datasets used and maintained by the University of Munich. All files have dataset descriptions, variable descriptions, and download links.
  • FAOStat – Data portal for the Food and Agriculture Organization for the United Nations.