Publicly Available Datasets Sources

A concise overview of various datasets and repositories across different domains, including government, finance, healthcare, NLP, and computer vision, along with domain-specific repositories like AWS.

  1. Google Dataset Search: A search engine to find datasets across the web.

  2. Kaggle: A platform for data science competitions with a vast collection of datasets.

  3. UCI Machine Learning Repository: A collection of datasets for machine learning research.

  4. AWS Public Datasets: A repository of datasets hosted on Amazon Web Services.

Datasets for Specific Domains:

  • Computer Vision: ImageNet, CIFAR-10, MNIST

  • Natural Language Processing: Wikipedia, Common Crawl, Gutenberg Corpus

  • Healthcare: MIMIC-III, PhysioNet

  • Finance: Yahoo Finance, Quandl

Other Resources:

  • Papers With Code: A website that links research papers with their corresponding code and datasets.

  • Awesome Public Datasets: A curated list of datasets on GitHub.

Accessing Datasets in Colab:

You can access these datasets in Colab using various methods such as:

  1. Downloading: Download the dataset directly from the source and upload it to your Colab environment.

  2. Mounting Google Drive: Mount your Google Drive to Colab and access datasets stored there.

  3. Using APIs: Many platforms provide APIs to access their datasets directly within Colab.

  4. Using Libraries: Some libraries, like TensorFlow Datasets, provide pre-built functions to load popular datasets.

Last updated