Dataset curation

Machine learning models learn through the presentation of examples and the optimization toward the loss defined. A good model is a model that can generalize well for the given task and to achieve this behavior we must make the model internalize the dataset variability. The model internalizes the data variability according the number and the quality of the presented dataset, so to have good models fundamentally we need a good dataset.

Interestingly Andrew NG cites that we have two mindsets that we can work:

The model centric
- hold the data fixed and improve the code/the model.
- when we find problems we ask: How do I change the model?
The data centric
- hold the code fixed and iteratively improve the data
- when we find problems we ask: How do I improve the data?

A good dataset

According with Andrew NG. we can define a good dataset when:

The labels are consistent (Same label situations, labeled equally);
The variability of scenes and objects cover all important cases;
The dataset is updated constantly from the production data to cover data drift;
The size is proper for the given task.

Dataset collection

To acquire a good dataset, first we need to pass through the definition of the task:

ML lifecycle:

Scope the project -> Define/Collect data -> Training the model -> Deploy the model in production

After we have defined the task we should make a manual with labeling instructions, the idea here is that we formalize how the labeling will occur so we can better handle ambiguous cases.

Next we collect the data from different scenarios so that we get a good coverage of where we are going to use the model. The data collected should resemble the variance in the generator that generates images for the model.

It seems that a common starting size to work with deep learning models is from 1k samples to 10k samples, of course enlarging this dataset normally is beneficial to the task.

When the dataset is small label consistency is important.

Data Curation

After you have a dataset, we may find some problems and you need to improve it gradually to the task that you are working with. We can follow this guidelines:

Find a lot examples that the network doesn't do what you need it to do, then label this situations with the correct label.
After that you need to put this examples in the training set, meaning the new training set will be the old dataset + the new samples (Note: You can also remove boring images so the learning can go faster)
Then we restart the process, the faster we can spin this loop, the faster we will get better models

Note: Andrej Karpathy notes that we are on a nice position right now, if we enlarge the processing power and enlarge the data we can keep improving the deep learning model.

Important questions to get the loop working well

How do I define and collect my data;
How do I modify my data to improve performance;
What metrics do I need to track concept/data drift;

List of triggers where we can find good samples to use

Detection flickering
User warning (Normally is caused by a bad prediction)
Label oscillation

Self supervised learning

When we are constrained into the resources and time that we can use to label images, we can think about using self supervised learning. It is a technique that we can profit of data similarities to learn latent information about the task at hand, not needing to create manual labels.

A good dataset​

Dataset collection​

Data Curation​

Important questions to get the loop working well​

List of triggers where we can find good samples to use​

Self supervised learning​

Links