Data augmentation is the process of transforming images to create new ones, for training machine learning models. This is an important step when building datasets because modern machine learning models are very powerful; if they're given datasets that are too small, these models can start to ‘overfit’, a problem where the models just memorize the mappings between their inputs and expected outputs.
You can learn more about how a model learns from images in my previous article on data labelling.
Back to overfitting. For example, given an image of a cat in the training set with the expectation that the model outputs the label 'cat', a model that overfits will just memorize that this particular arrangement of pixels equals a cat, instead of learning general patterns (e.g. fluffy coat, small paws, whiskers, etc). A model that overfits will have very good performance on the training set but perform poorly on images that it hasn't seen before because it hasn't learned these general patterns.
Data augmentation increases the number of examples in the training set while also introducing more variety in what the model sees and learns from. Both these aspects make it more difficult for the model to simply memorize mappings while also encouraging the model to learn general patterns. While it is possible to collect more real world data, this is much more expensive and time consuming than using data augmentation techniques. So while it's always better to grow the real world dataset, data augmentation can be a good substitute when resources are constrained.
In the rest of this blog we'll talk about some:
Additionally while data augmentation is applicable for a variety of data types like images, text, and audio, we'll be focusing on data augmentation techniques for images since that's the most relevant to our product.
If you want to learn the basics of machine learning, take a look at our free ebook, "Machine Learning Explained":