ProductizeML
  • ProductizeML
  • Introduction
    • Objectives
    • About the Course
    • Guidelines
    • Syllabus
    • After Completion
  • Machine Learning
    • Why ML, and why now
    • Supervised Learning
    • Unsupervised Learning
    • Deep Learning
    • ML Terminology
  • Data Management
    • Data Access
    • Data Collection
    • Data Curation
  • Train and Evaluate
    • Framework and Hardware
    • Training Neural Networks
    • Model Evaluation
  • Productize It
    • ML Life Cycle
    • Business Objectives
    • Data Preparation
    • Model Development
    • Train, Evaluate, and Deploy
    • A/B Testing
    • KPI Evaluation
    • PM Terminology
  • Resources
    • Readings
    • Courses
    • Videos
  • Hands-On
    • Python for Machine Learning
      • Python Installation
        • MacOS
        • Linux
Powered by GitBook
On this page
  • Availability and challenges
  • Sources
  • Readings

Was this helpful?

  1. Data Management

Data Access

You will learn: ways of accessing and sources of data.

PreviousML TerminologyNextData Collection

Last updated 2 years ago

Was this helpful?

In all ML scenarios, data plays a crucial role in the overall solution since it can be seen as the training process' "fuel". In other words, we can have access to a really effective and promising training algorithm, but if we feed inconsistent or impractical data, the outcome will not be as.

In this section, we will describe ways of getting access to data, either via a public access, collection of custom data, and best practices to curate it before passing it to the training algorithm.

Availability and challenges

When searching for a dataset to work with for a product that will be publicly deployed, there are some considerations to take in mind before its deployment.

Sources

The best option to guarantee that our dataset fulfills all these previous considerations is to use data marketplaces and dataset searching tools that provide public, structured and licensed data.

Kaggle's Dataset offers access to +40,000 structured and public datasets, and it offers collaborative dataset publication so it can get extended by other users

Readings

Firstly, in many ML problems, data can be a limitation in terms of size and domain representation. Despite this, we should guarantee to have access to large sets so the model can understand the whole problem as well as a representative of the domain we want to become experts at. Note that in some problems these requirements are hard to achieve, especially those for which labeled data can be expensive like in medical fields. As a solution, we will propose data collection workarounds like or generating our to fight the unbalance of classes.

Secondly, this data needs to be clean, organized, and structured since most of the ML algorithms that succeed rely on supervised learning methods, and therefore we this data to be properly labeled. We will dive more into this second challenge and propose solutions to overcome it in .

Finally, and really importantly, the data that we use in our product should always guarantee privacy consent from its original sources in order to be regulatory compliant with data protection regulations such as .

Some well-known options include Google's , , and . These repositories provide access in a wide variety of subjects that go from common ML applicable domains such as business, finance, healthcare, earth, and climate sciences, natural language and image processing, to other less common like art, education, law, or social sciences.

Data Curation
GDPR
Dataset Search
Harvard Dataverse
Kaggle's Datasets
Datasets — ML Glossary documentation
Logo
The world’s most valuable resource is no longer oil, but data. Illustration by David Parkings.
The world’s most valuable resource is no longer oil, but dataThe Economist
labeling tools
synthetic data
Logo