Machine Learning Datasets: Where to Find Them and What to Look For
Table of Contents
- Free or Not? Sort Machine Learning Datasets by Availability
- Datasets for Machine Learning Categorized by the Types of Data
- Datasets for Machine Learning Based on the Types of ML and Annotation Tasks
- Large and Small Specific Types of Datasets for Machine Learning
- A Few More Words About Datasets for Machine Learning: Let’s Recap
In our recent article, we’ve talked about ML datasets. We offered their definition and info on what they were and why they were a crucial component for any AI project. We also covered the major features of great datasets and why both quality and quantity matter.
In a nutshell, machine learning datasets are required to properly train an ML algorithm for accurate prediction tasks of different difficulty levels. There are a variety of purposes for collecting (or finding) a good ML dataset. Yet it’s worth remembering that such datasets are usually quite expensive and hard to come by. For supervised and semi-supervised learning, a dataset should additionally be annotated. For unsupervised learning, while not requiring data labeling, creating high-quality datasets still may be costly and labor-intensive.
In this article, we’ll take a step further and give a short (yet hopefully exhaustive) summary of some popular and interesting datasets for machine learning complete with links and a feature overview for your convenience.
Free or Not? Sort Machine Learning Datasets by Availability
When talking about the lists of machine learning datasets, the first option that comes up is usually public datasets. These are open-source datasets for machine learning that are free to use by anyone for the purposes of ML research or training the algorithms for AI projects.
Despite the initial instinct, such sample datasets for machine learning are not rare. There are a variety of datasets available for the public, and there are resources that specialize in collecting such datasets and offering them for free download and use. It’s a simple and easy way to get the data you need to train and test your ML model, which is perfect for beginners.
However, it’s worth remembering that, while being free and requiring little to no effort to collect, these datasets are usually too basic or, on the contrary, quite unique. This means that they rarely can suit the purpose of a specific AI project. It’s worth remembering that such datasets were initially collected for different purposes. And while your project may benefit greatly from using a basic free dataset to practice machine learning, it might still need certain tweaks and changes.
Pro tip: the major concern here is to make sure that preparing a free dataset for machine learning will not take up more time and resources than collecting a fitting one from scratch.
However, looking through the databases of such sets is always a good place to start your search. Here are a few suggestions where you could find online ML datasets:
- US 1, US 2, EU, and UK data projects are official sources of datasets that offer most various and official data from US, EU, and UK institutions. The fields include but are not limited to economics, education, health care, science, environment, etc. However, while the data is official, it might not be full and require further research.
- For a wider range of general datasets, CMU offers quite a long list of datasets, big and small. On the other hand, if you need an interesting dataset for your machine learning project, try looking at Kaggle as it offers a wide range of niche datasets that will surprise you with their variety. But take care to check how clean your dataset of choice is: these datasets are commonly user-contributed.
- For flexible, comprehensive, and convenient research, go to well-known resources like Google Dataset Search, AWS Registry of Open Data, Microsoft Azure Public Datasets, Reddit, or Github. These repositories offer great collections that arguable satisfy the most unique and wide range of machine learning purposes.
Now that we’ve covered a few of the benchmark dataset repositories, let’s dig a little deeper and find out how we can facilitate your search of a dataset depending on (1) the type of data that you have and (2) the type of ML or labeling task that you want the dataset to perform.
Datasets for Machine Learning Categorized by the Types of Data
There are different datasets that cater to your needs, depending on the type of data that you require (image, video, text, audio, or other). We’ll give but a few of the most well-known or those that we like best since there’s a lot of them and we cannot possibly cover all.
Pro tip: most of these datasets can also be used for deep learning tasks, which makes them invaluable for modern Ai projects.
Computer Vision: Image and Video ML Datasets
- ImageNet is kind of a fall-back option for a machine learning dataset if you’re starting to work on a computer vision task.
- Open Images from Google offers a generous choice of annotated images with over 6k labels.
- COIL100 is an interesting dataset of 100 objects captured from different angles, comprising together a 360-degree view of each object.
- xView, a massive dataset for overhead images.
- Color detection dataset for over 860 color names.
- MNIST, a famous dataset of handwritten digits that can be separated into its own category as it boasts extensive popularity for the purposes of training and testing ML algorithms.
- Kinetics-700, a video dataset of 700k YouTube videos, and YouTube8M that offers over 6 million.
Natural Language Processing: Text and Audio Datasets
- The Big Bad NLP database from Quantum Stat covers quite a few popular NLP tasks and is real-time and regularly updated, which is crucial for the advances in deep learning.
- Linguistic datasets for different languages like Chinese, Hindi, Dutch, Vietnamese, Russian, Turkish, Portuguese, etc. There are also several or even multi-lingual speech recognition datasets.
- Sound datasets to recognize urban noises, animal sounds, and music.
- Platform and brand-based NLP datasets from Google, Gutenberg, Wikipedia, Amazon, Jeopardy, etc.
You might also be looking for a specific format to download, like JSON datasets, which is also a great way to sort through the available and relevant datasets.
Datasets for Machine Learning Based on the Types of ML and Annotation Tasks
Similar to the categorization of datasets based on the types of data, they are also often collected based on the types of tasks that need to be performed.
Image data serves as a great example to illustrate this fact as it is very popular, flexible, and widespread.
- Facial recognition: IMDB-Wiki, Labeled Faces in the Wild, FERET.
- Bounding boxes: our favorite cats and dogs dataset, celebrities, specialized medical datasets like malaria one, vehicle license plates, house numbers, manga, etc.
- COCO for object recognition and recognition in context.
- Image classification for medicine, agriculture, architecture, and even something as specific as people eating food or cracks in concrete.
- Linear regression: from New York stock prices to cancer stats to fish market assortment.
- OCR and handwriting recognition: aside from the aforementioned MNIST, there are also Devangri, Street View Text, Natural Environment OCR, Chars74K, and others.
- Speech recognition datasets for command, accent, gender, and even celebrity recognition. These also include platform-specific datasets like the ones from TED or YouTube. Free Spoken Digit Dataset is a good example of a dataset used to train deep learning algorithms.
- VoxCeleb for speech generation.
- Sentiment analysis: Lexicoder, Stanford Sentiment Treebank, Sentiment140.
- Dataset for creating chatbots: WikiQA, Twitter customer support, Cornell’s movie dialogs, etc.
- Recommendation systems for movies, music, and data collected from popular websites.
- Named Entity Recognition: CoNLL 2003, Annotated Corpus for NER, Resume Entities for NER, MIT Movie Corpus, etc.
Large and Small Specific Types of Datasets for Machine Learning
There are also highly specialized datasets and samples that wouldn’t fit just any machine learning task. For a sweet finale, we’ll link you to a few of these specific datasets that just might suit your AI project’s purposes:
- Riding the wave of relevance, coronavirus datasets from general and regularly updated to more specific like COVID-related tweets and CT scans.
- Environmental datasets for climate change, greenhouse gas emissions, and sea ice extent.
- Indoor scene recognition with over 60 categories of indoor features.
- Dog breed recognition from Stanford for 120 breeds.
- Stock market datasets, from general historical data to very niche datasets of a single company’s history of stock prices.
- Robotics datasets from Github, Berkeley, and the one known as Radish.
- Wine quality dataset.
- Fake news detection dataset.
- Supply chain datasets that vary from clothing brands to automotive logistics to shipping and more.
- Autonomous driving datasets with info on vehicles and street-related objects, urban scenes, driving videos, traffic signs, etc.
- Geographic datasets to create maps, analyze the stats, and observe the relevant data by country.
- Life, deposit, medical, and home insurance or gross claims payments datasets.
- Crime datasets that are commonly region-specific, like London, UK, Chicago, IL, or Vancouver, BC.
- Film datasets like the one from IMDB, the one for film subtitles, or something extremely fun and specific, like cats in movies, movie body counts, or Indian cinemas.
- A very niche but very cool Titanic dataset.
- A variety of financial datasets from Worldbank, IMF, AEA, and Google Trends.
A Few More Words About Datasets for Machine Learning: Let’s Recap
As you can see, there are a lot of machine learning datasets collected for a variety of purposes and based on very different categories, from the types of data to the types of annotation and machine learning tasks. So, if you are thinking about where to get the dataset for your AI project, it’d be wise first to take a look around: maybe the perfect dataset is already waiting for you.
A few things that you need to pay attention to are the following:
- Is it free or not? Open source datasets for machine learning are not rare. However, they may be too general or too specific to fit your project’s goal. Make sure you’re ready to make the necessary changes if you decide in favor of a ready-to-download dataset instead of collecting one yourself.
- What type of data do you need? Think about the purposes of your algorithm and what type of data you require. Then you can look through the available datasets to find the best fit.
- What type of machine learning and annotation tasks are you looking for? Facial recognition is very different from text classification or autonomous driving. The good news is that there are datasets for all of these, and many more, from general to specific tasks.
- Is your task very specific? Usually, the more specific the task, the harder it is to find the appropriate dataset that is ready to be used. Besides, designing an ML algorithm might also depend on the dataset to be used, so it’s best to work both on the algorithm and finding and processing the data simultaneously.
Also, consider if you need an annotated dataset (for supervised or semi-supervised machine learning tasks) or if you’ll need no labeling (unsupervised learning tasks). It may be so that you’ll find a suitable dataset that doesn’t have the necessary labels. In that case, you’ll need to spend time and resources on annotation.
It may be quite a fussy and labor-intensive task, however. If you wish to avoid the effort, don’t hesitate to contact us, and we’ll help you with both collecting and annotating the data that you need.
Written by
Iryna is one of the dedicated members of the Label Your Data content team who has put all her efforts in developing our knowledge base. Iryna is a seasoned technical writer with wide-ranging experience in artificial intelligence, machine learning, and deep learning. She has been studying the basics of data annotation for many years and is now sharing her expertise on our blog. The technical realm is a true passion of hers, so make sure to check out other articles written by our talented Iryna!