Data Curation

You will learn: recommended methods to curate your data.

In order to build a robust data pipeline that can evolve over time in terms of volume and data quality, it is crucial to standardize the way this data is managed and label, as well as curate and creating the proper splits before running your entire system.


For the case of supervised learning approaches, it is really important to keep your data well organized so it can be accessed via query systems and extended easily over time. A solution for this standardization process is using a relational database that acts as hierarchical trees of concepts.

An example can be found with the lexical database WordNet used in the ImageNet image database.

Examples in the medical domains include making use of medical ontologies, such as SNOMED or ICD in order to organize your medical data into standardized disease codes.

Versioning control

A really important idea to keep in mind is that datasets evolve over time, data is never fixed when an AI/ML solution is shipped to production. Changes to this dataset can include the addition of new data or modifications of the existing one. Independently to this, you must always checkpoint your data in order to facilitate reproducibility and introduce back up options in your production pipeline.

That being said, a potential solution for this problem is to version your dataset using in order to be able to access to the right data at any point in time through requesting specific states of this.

There are some open-source frameworks like DVC that provide check-pointing features of not only your dataset but also models trained on the different versions of your data.

Data Split

In most all Machine Learning scenarios, datasets are split into three buckets with different objectives:

  • Train set: a group of samples dedicated to training your model.

  • Validation set: a group of samples dedicated to iterating during your model.

  • Evaluation set (or test set): a group of samples dedicated to evaluating your model.

Some considerations when performing the validation and test splits are:

  1. They must be large enough in order to extract statistically meaningful insights. Usually, the ratios are 75-80% train set and 15-20% validation set.

  2. Share the same data distribution and domain as the train set.

Data Quality

It is crucial to never show any test sample at training time to your model. (Test) data leaking is common, especially when your solution includes data flywheel (discussed in Data Collection). If you are obtaining better than expected results when evaluating your model, it is recommended to set up a retrieval system, e.g. image retrieval systems using deep learning architectures have shown excellent performance at spotting test samples in training subsets.

Last updated