ProductizeML
  • ProductizeML
  • Introduction
    • Objectives
    • About the Course
    • Guidelines
    • Syllabus
    • After Completion
  • Machine Learning
    • Why ML, and why now
    • Supervised Learning
    • Unsupervised Learning
    • Deep Learning
    • ML Terminology
  • Data Management
    • Data Access
    • Data Collection
    • Data Curation
  • Train and Evaluate
    • Framework and Hardware
    • Training Neural Networks
    • Model Evaluation
  • Productize It
    • ML Life Cycle
    • Business Objectives
    • Data Preparation
    • Model Development
    • Train, Evaluate, and Deploy
    • A/B Testing
    • KPI Evaluation
    • PM Terminology
  • Resources
    • Readings
    • Courses
    • Videos
  • Hands-On
    • Python for Machine Learning
      • Python Installation
        • MacOS
        • Linux
Powered by GitBook
On this page
  • Labeling tools
  • Synthetic creation
  • Readings

Was this helpful?

  1. Data Management

Data Collection

You will learn: ways of collecting data.

PreviousData AccessNextData Curation

Last updated 2 years ago

Was this helpful?

Yet working with a public dataset can be difficult to ensure business advantage as a unique asset from potential competitors, so in order to differentiate from others' approaches, someone may consider building our own dataset via user's data collection and labeling, what is known as data flywheel, or even generating synthetic data from existing datasets.

Labeling tools

Datasets can sometimes contain errors on the labels or even lack of them. An example can be when gathering data from your users, this data is pretty valuable since it belongs to the domain you want to solve but you will only have access to the ML models predictions. Obviously this is not enough, and you might want to properly label the samples by human labelers experts on that task.

To address this problem, a marketplace has appeared for data labeling services like image classification, object detection in images, object segmentation, video frame tagging, object tracking in videos, or text classification. In some tasks, the costs associated with the knowledge required to complete the labeling process successfully is quite expensive, and in those cases, a hybrid method between human and ML-based labeling solution can reduce the task's cost.

Some existing collecting and labeling tools include:

Synthetic creation

Labeling your own data might require time, logistics, and economical costs. A useful way to keep extending your datasets to train other ML algorithms is by generating realistic data samples by other ML approaches such as Generative Adversarial Networks (GAN).

Some of the benefits of synthetic data are:

  1. Anonymized data: usage constraints due to privacy rules or other regulations.

  2. Class unbalancing: it can mitigate problems by generating more samples of the underrepresented classes.

  3. Not yet encountered cases

Readings

The Data Flywheel: Building momentum by putting your data to work
Amazon Mechanical Turk
Scale AI: The Data Platform for AIscaleapi
Logo
Synthetic data generation — a must-have skill for new data scientistsMedium
Vertex AI  |  Google CloudGoogle Cloud
Home V2Appen
HomeSnorkel AI
Logo
Logo
The AI Hierarchy of Needs | HackerNoonhackernoon
Logo
Data collection is defined as the first need in the Data Science hierarchy of needs by Monica Rogati.
Logo
Logo
Logo