Resources

  • Tools

    • Stanford Network Analysis Platform (SNAP): a general purpose network analysis toolkit.

    • NetworkX: a toolkit for the creation, manipulation, and study of the structure, dynamics, and functions of networks.

    • Pytorch Transformers: a library of pre-trained language models for text representation.

    • Vowpal Wabbit (VW) toolkit: a great toolkit for machine learning.

    • Twitter Scraper: a simple script to crawl Tweets, uses Beautifullsoup to parse the retrieved content.

    • Twitter API: the nuts and bolts of Twitter API.

    • Tweet-User Crawler (streaming API): returns tweets matching a list of keywords “or” user ids. Requires keys and access tokens from twitter api. Here are sample keywords and users files. Can be simply updated to obtain follower/followee relations/tweets, see tweepy documentations.

    • SPAM: a toolkit for sequential pattern mining.

  • Datasets

    • Open Graph Benchmark (OGB) datasets: a collection of large-scale and diverse benchmark datasets for machine learning on graphs.

    • Stanford Large Network Dataset Collection: great collection of various network datasets.

    • Stanford Drone Dataset: When humans navigate a crowed space such as a university campus or the sidewalks of a busy street, they follow common sense rules based on social etiquette. This large scale dataset contains images and videos of various types of agents (pedestrians, bicyclists, skateboarders, cars, buses, and golf carts) that navigate in a real world outdoor environment such as a university campus.

    • Clickstream: a dataset of Wikipedia clickstreams - showing how people read Wikipedia by tracking the links they click on.

    • The Twitter Stream Grab: large collection of tweets in JSON format crawled from the general twitter stream.

    • Amazon Reviews: large collection of Amazon reviews and metadata from Amazon.

    • Reddit Submissions and Comments: large collection of Reddit posts from Reddit.

    • Churn Dataset: a dataset of tweets about several telco brands as well as the social graph and discussion threads.

    • Movie Corpus: complete scripts of 1068 movies scraped from imsdb.com.

    • SEMEVAL Sentiment Datasets: datasets for twitter sentiment classification.

    • 20 Newsgroups: The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc.

    • USDA Branded Food Products Dataset: Contains food description, USDA standardized nutrition facts, and ingredients for a large number of foods.

    • County Health Rankings and Roadmaps data: Aggregates of county-level health factors from a wide range of sources, including the Behavioral Risk Factor Surveillance System, American Community Survey, and the National Center for Health Statistics.

    • U.S. Census State-Based Counties Gazetteer: List of all geographic areas for selected geographic area types - include geographic identifier codes, names, area measurements, and representative latitude and longitude coordinates.

    • OMIM: Online Mendelian Inheritance in Man dataset, a catalog of human genes and genetic disorders containing phenotype-gene relations.

    • FIRE: Fundus Image Registration Dataset

    • Frames: a new human-generated dataset consisting of 19,986 turns that can be used to help train deep-learning algorithms on natural conversations. These text-based conversations were recorded between two humans, simulating the conversation between a vacation seeker and a travel agent.

    • M2M: a dataset of Machine to Machine simulated dialogue.

    • DSTC2&3: A large number of dialogs related to restaurant search and a small amount of labelled data in the tourist information domain.

    • MultiWOZ Corpus: Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics.

    • Geographical Analysis Spatial Data: Contains 3,107 observations on U.S. county votes cast in the 1980 presidential election.

    • German Traffic Signs: German Traffic Sign Detection Benchmark (GTSDB), IJCNN 2011.

    • Google Books Ngrams: n-grams found in sources printed between 1500 and 2008.

    • Hacker News: the comment dump for Hacker News.

    • Hate Speech Identification: A sample of Twitter posts that have been judged based on whether they are offensive or contain hate speech.

    • Hidden Beauty of Flickr Pictures: 15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.

    • High-Resolution Settlement Layer: The High Resolution Settlement Layer (HRSL) provides estimates of human population distribution at a resolution of 1 arc-second (approximately 30m) for the year 2015

    • HMDB51 dataset: a large human motion database, 5.6k videos

    • Human Activity Recognition with Smartphones: Sensor data for recognizing human activity - walking, sitting, etc.

    • ImageNet: The ImageNet project is a large visual database designed for use in visual object recognition software research.

    • IMDB dataset: 50K movie reviews with their sentiment labels.

    • Kepler Data Products: see Automatic Classification of Kepler Planetary Transit Candidates. McCauliff, Sean D., et al. The Astrophysical Journal 806.1 (2015)

    • KITTI Vision Benchmark Suite: Computer vision benchmarks: stereo, flow, odometry, object detection or tracking Labeled Faces in the Wild. 13,000 named faces for facial recognition.

    • LAKH MIDI Dataset: large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).

    • Lamem: Large-scale Image Memorability

    • LASIESTA: Labeled and Annotated Sequences for Integral Evaluation of SegmenTation Algorithms

    • LibriSpeech ASR corpus: LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.

    • cancer/tumor image dataset: Tumor and nontumor samples, used to recognize cancer.

    • Militarized Interstate Disputes: Nearly 200 years of international threats, conflicts, etc. for modelling or prediction. Includes action taken, level of hostility, fatalities, and outcomes.

    • MS MARCO: A dataset with 100K questions from real users, passages from web pages that could answer the question, and human generated natural language answers

    • MSCOCO: Image segmentation and object recognition

    • Mushroom Identification: For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.

    • MusicNet: A collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%.

    • National Survey on Drug Use and Health: prevalence and correlates of illicit drugs, alcohol, and tobacco in the United States.

    • COCO-Stuff dataset: COCO-Stuff semantic segmentation dataset

    • NBA & MLB Stats: Current and past season stats for teams and players for fantasy sports predictions.

    • NewsQA: Maluuba's News QA is a new machine reading comprehension dataset for developing algorithms capable of answering questions requiring human-level comprehension and reasoning skills. This dataset of CNN news articles has over 110K Q&A pairs. Questions are written by humans in natural language. Questions may not have answers and answers may be multiword passages.

    • NORB 3D Object Recognition: Binocular images of 50 toy figurines for 3D object recognition from image.

    • North American Bat Ranges: Portrays of current understanding of the distributions of United States and Canadian bat species during the past 100-150 years

    • Numenta Anomaly Benchmark (NAB): This repository contains the data and scripts comprising the Numenta Anomaly Benchmark (NAB) - a benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications.

    • NYU Depth Dataset: Indoor Segmentation and Support Inference from RGBD Images, ECCV’12.

    • Enron Email Corpus: The Enron Corpus is a large database of over 600,000 emails generated by 158 employees of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.

    • One Million Songs: Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.

    • Online News Popularity: Statistics associated with articles published by Mushable

    • Elektra: over 20 different autonomous driving datasets: pedestrians, semantic segmentation, stereo…

    • Cornell Movie Dialogs Corpus: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts.

    • Oxford Robotcar Dataset: 1 year and approximately 1000km of recorded driving with over 20 million images collected from 6 cameras mounted to the vehicle, along with LIDAR, GPS and INS ground truth. Data was collected in all weather conditions.

    • PoseNet: PoseNet was trained with the Cambridge Landmarks Dataset. This is a large urban relocalisation dataset with 6 scenes from around Cambridge University containing over 12,000 images labelled with their full 6-DOF camera pose.

    • Cityscapes Dataset: Targets semantic understanding of urban street scenes. Great for visual perception applications in automotive industry (ADAS, self-driving).

    • Pratheepan dataset: Human Skin Detection dataset

    • ProductHunt: This dataset contains all Product Hunt users, topics, and a datadump of all posts from 11-24-2014 to 11-23-2016 (as of 5pm 11-23-2016).

    • Record of Heart Sound: Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc.

    • Residential Fire Fatalities in the News: Between January 1, 2016 and December 20, 2016 2158 civilian home fire fatalities were reported by U.S. news media

    • Broad Bioimage Benchmark Collection (BBBC): Collection of freely downloadable microscopy image sets. In addition to the images themselves, each set includes a description of the biological application and some type of “ground truth” (expected results).

    • Sign Language: A sign language dataset - contrary to popular belief, sign language is not international and these languages are not completely based on the spoken language in the country of origin.

    • Caltech Pedestrian Detection Benchmark: The dataset consists of approximately 10 hours of 640x480 30Hz video taken from a vehicle driving through regular traffic in an urban environment. About 250,000 frames (in 137 approximately minute long segments) with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated (temporal correspondence between bounding boxes and detailed occlusion labels).

    • SMS Spam Collection: A collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering.

    • SYNTHIA: 500.000 frames of annotated video from a virtualcity. labels for stereo, optical flow, etc.

    • TED-LIUM: English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM)

    • Air Quality: Air Quality in New York City

    • Endangered Species Act Critical Habitat: Fisheries Data: Critical Habtat for each species.

    • UCF101 dataset: UCF101 a trimmed video datasets for human action recognition, 13k videos

    • UFO Reports: 80,000 historic reports for classification or regression. This dataset has been standardized from the source data at nuforc.org.

    • VGG Face Dataset:  2.6 million “in the wild” facial images from  2600 labelled subjects. Only URLs to publicly available images and face bounding boxes provided.

    • 4D Light Field Dataset: A synthetic light field dataset with 24 scenes. Data provided for each scene: 9x9x512x512x3 light fields as individual PNGs; config files with camera settings and disparity ranges. (HCI Heidelberg & CVIA Konstanz).

    • Volcanoes on Venus: Images of small volcanoes in the large set of Venus collected by the Magellan spacecraft from 1990 to 1994.

    • VQA: visual question answering dataset.

    • Wind: Daily average wind speeds for 1961-1978 at 12 synoptic meteorological stations in the Republic of Ireland. See a short desciption of the data here.

    • Yahoo Instant Messenger Friends Connectivity Graph: Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers.

    • YouTube Bounding Boxes: A manually annotated video dataset consisting of 5 million bounding boxes spanning 23 object categories, densely labeling segments from 210,000 YouTube videos. The human-labelled annotations contain objects as they appear in the real world with partial occlusions, motion blur and natural lighting.

    • Dataset of Object Scans: Over 10,000 objects densely scanned and reconstructed. Data captured from the real world by non-technical operators.

    • AAN Anthology Network Corpus: Contains more than 20k papers along with venue and author information as well as citation and collaboration networks.

    • Machine Learning Datasets: links to many interesting datasets.

    • UCI ML Repo: UC Irvine Machine Learning Repository! Containing more than 400 datasets.

    • Allen Institute for Artificial Intelligence Datasets: Datasets for computer vision, reasoning and inference, question answering, and natural language understanding.

    • Yahoo! WebScope: great collection of graph and social datasets.

    • Bureau of Labor Statistics: Dozens of longitudinal datasets provided by the US Department of Labor (CPI, PPI, employment, population, pay, etc.)

    • Element List Scientific Data Directory: An online repository of links to free, publicly available scientific datasets, mostly from university, industry, and government research programs.

    • Use Google's dataset search tool to discove more datasets!