Image by Author
“How much can anyone really care about sepal length?” my friend complained to me over coffee a few days ago. She was referring to the built-in `iris` dataset in R, which first debuted way back in 1936. “Why do college professors try to teach us data science with crappy, boring, pointless data when there’s so much great data out there for data science projects?”
She’s right. It’s really tough to motivate yourself to learn data science, or do data science projects when your data is boring or meaningless to you. I know I struggled to motivate myself to learn data science until I found some good crunchy data that interested me.
In this article, I’m going to break down 10 amazing websites where you can grab some really awesome data for data science projects. The purpose will be to showcase a variety of data that might appeal to you. Ultimately, these websites should help you find data you care about, do a cool data science project, and use that to get a job.
How did I Vet these Data Sources?
If you see a website in this article, it’s because the data it contains is:
- Freely available. You won’t have to pay for it.
- Community-oriented. It’s not just going to just be a file; there will be some commentary and explanation around it.
- Cool. It’s something that someone, somewhere will care about. Maybe you!
- Clean-ish. You’ll get to practice the fun part of data science – analyzing, visualizing, sharing, and so on.
- Language-agnostic. You can dig into these with Python, R, SQL, or any other language you like.
10 Websites to Get Awesome Data for Your Data Science Projects
Let’s dig into the best websites to find data that you’ll actually care about and want to explore using data science.
Google Dataset Search | Super broad, varying quality |
Kaggle | More limited, but lots of context and community |
KDNuggets | Specific for AI, ML, data science |
Government websites | Wide variety, resources to learn |
Pudding.cool | Pop culture, essays |
538 | Sports, politics, clean data |
Tidy Tuesdays | Messy data, great community |
GitHub | Huge amount of searchable data with commentary, variable quality |
Buzzfeed | Pop culture, essays, rigorous science |
Awesome Public Datasets | Wide variety, only datasets, no commentary |
1. Google’s Dataset Search
I’m cheating a little bit, because this isn’t really a website for datasets, but rather a search engine for data sets. But it’s too good not to include.
Google’s Dataset Search is just like Google but for data sets. You type in your query, and Google returns as many datasets as it has on that subject.
For example, searching “cats” brings me over one hundred datasets, including a dataset containing over 9,000 images of cats.
Source: Google Dataset Search
What I love about this website:
- It’s super versatile. You will almost certainly find something you care about.
- It’s instantly applicable. This website includes other papers that have used this dataset, so you can see what interesting things other people have done with the data already.
- You can toggle to only include free datasets.
- It pulls out the context for you, so you get a bit of an explanation of what this dataset is and why it was collected.
It’s a great place to start.
2. Kaggle
Kaggle’s Datasets is also a search engine, but it’s both more limited and more focused.
It’s more limited because it only contains datasets that people have published with Kaggle. But it’s more focused because the datasets aren’t just whatever random set of numbers Google scraped. Kaggle is a home for data science competitions, so the datasets it collects are extremely relevant to data science.
This allows you to filter by your specific interest. For example, I can stumble across that same cat dataset if I searched “cat” with the “computer vision” filter on.
Source: Kaggle Datasets
What I love about this website:
- The community aspect is so strong. Clicking on that cat dataset shows six other folks asking questions about the dataset – and getting answers.
- Lots of example projects. You can also see what other people have built or coded around that data.
- You can go the other way around, too – check out their competitions and see if anything interests you, then use the accompanying dataset.
3. KDNuggets
This may come as a surprise to you, but KDNuggets curates a great set of datasets. These datasets are specifically for Data Science, Machine Learning, AI & Analytics, so they’re
Many of these aren’t KDNuggets exclusives, but it’s a good list to poke around in. It’s worth noting that when you sign up to be a KDNuggets email subscriber, you also get access to World Data AI which itself contains 3.5 billion datasets.
Source: KDnuggets Datasets
What I love about this website:
- Data specific for data science. Many of these datasets are curated for other purposes, but these are all here specifically because they’re good for AI, machine learning, and data science.
- Quick description of each set. Just a little bit of context to help you decide if it’s the right dataset for you.
4. Government websites
I could easily expand this list of websites to get datasets to about a million simply by individually listing each of the government websites I like to use to get data. I won’t. Instead, I’ll offer a small list here:
- http://datasf.org/
- http://data.gov.uk
- https://www.usa.gov/About/developer-resources/1usagov.shtml
- https://www.census.gov/data/datasets.html
Governments are constantly collecting data to do studies, and many of them publish that data online.
Source: The US Census Bureau
What I love about these websites:
- The data is used for studies, so it’s typically pretty clean and well-organized.
- The data has a real use case. Someone collected it for a real, government-related reason.
- It’s typically very current data.
- There are often some cool stories around the data.
- Many governments have invested resources into showing you how to access or use the data, like the Census Bureau.
5. Pudding.cool
If you like your data to come with a heady dose of pop culture, look no further than Pudding.cool. This website looks at topics as varied as repetitive pop lyrics, women’s pockets, and how The Big Bang Theory gets censored by the Chinese government.
This is more of a digital magazine writing longform essays about culture, showing a lot of data alongside. I’m including it here because they tell awesome stories and share their data.
Source: The Pudding
What I love about this website:
- Awesome, interesting data.
- Shares data and scripts.
- Lots of things you might care about IRL.
6. 538
Another essay-driven pop culture website with freely available data you can purloin. They focus more on sports and politics. It’s less data-driven, but I’m giving it a spot on this list because it still curates and shares datasets.
Source: FiveThirtyEight Data
What I love about this website:
- Intelligent stories, backed up with data, you can dig into.
- The data is in clean, CSV format.
- The data sources are highly reliable.
7. Tidy Tuesdays
Now, the reality of the matter is that data often isn’t tidy at all. Tidy Tuesdays isn’t exactly a website with datasets per se, but it’s a weekly event and community with an emphasis on using data science to explore untidy data.
Every week, a new dataset drops. Participants are encouraged to share their cleaning techniques and visualizations with each other on GitHub and Twitter.
Source: TidyTuesday GitHub
What I love about this website:
- The community is incredible. Every week you’ll learn something new.
- It’s so convenient. Don’t go hunting for datasets. Get the weekly drop.
- Challenging, untidy data. The data you get IRL will rarely be as sanitized as the other data on this list. Tidy Tuesdays helps you learn how to handle messy data.
8. GitHub
GitHub is the home of a lot of data. You can easily search, filter, and download data to play around with on your own. However, the data quality is highly variable. Because anyone can upload data, it’s not always in great condition.
However, I feel the benefits make up for that.
Source: GitHub Cat Data
What I love about this website:
- You can filter by language, such as Python, Javascript, or other.
- There’s a ton of data.
- Usually the data comes with some kind of commentary or code you can check out.
9. Buzzfeed
Buzzfeed doesn’t just do quizzes that comment on the human condition by asking you to build a salad. It may not be as well known for this, but Buzzfeed does a lot of quality data journalism.
It’s all open source, too.
Source: BuzzFeed News GitHub
What I love about this website:
- Interesting data, pre-cleaned, and with well-written commentary in the form of articles attached.
- Heavier topics. There’s an emphasis on more complex topics such as politics and health, but there’s a lot more, too.
10. Awesome Public Datasets
I’m ending this list with a pretty self-explanatory title: Awesome Public Datasets. This repo lives on GitHub and contains (mostly) free datasets to explore. They come from online datasets, user suggestions, and research papers.
Source: Awesome Public Datasets GitHub
What I love about this website:
- There’s a Slack group you can join!
- Huge variety in topics. Agriculture, finance, museums. You’re bound to find something that takes your fancy.
- Well-curated. The datasets are high quality.
These Websites offer Amazing Data Science Datasets
Dig in, you’ll certainly find not just data you can get your feet wet with, but also community, inspiration, and code you can use to learn and grow as a data scientist.
With such a huge variety of data available to you, you should never feel like you’re settling for less interesting data. Always look for data that inspires you or makes you excited to investigate it. Hopefully this list gives you a few starting points to do just that.
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.
- 10 Amazing Machine Learning Projects of 2020
- 5 Data Science Projects to Learn 5 Critical Data Science Skills
- 14 Data Science projects to improve your skills
- How to Successfully Deploy Data Science Projects
- Top Data Science Projects to Build Your Skills
- Data Science Projects That Will Land You The Job in 2022