MLOPs: From Model Centric to Data-Centric
A machine learning or AI system constitutes two basic ingredients: Code and Data. By the code, I'm referring to everything including the architecture you choose, the algorithms you wish to employ, and the way you wish to implement them. While the data refers to the entire dataset that the models end up training upon.
Ever since the boom of machine learning in the 2010s, we have been witnessing significant advancement when it comes to the code territory. New algorithms, architectures have been coming out with improved performances and results. However, the same couldn't be said about the data. As a matter of fact, less than 1% of research in AI is put into data preparation.
So, today we'll be taking a detailed look at the data-centric AI and how the data is as crucial in MLOPs as the code; and how ML projects get a lot more effective if you shift our mentality towards improving the data at hand rather than putting all our focus towards the code.
So, let’s start without any further ado.
Model Centric VS Data-Centric: What Does It Mean?
Before diving in further and analyzing both the model-centric and data-centric approaches, let's take a quick look to spot the core differences in both terms:
The model-centric AI emphasizes the quality of code being written to deal with the data, instead of the quality of data being used. The idea used here is that as much data should be collected as possible, even if there's significant noise. Since the noise can be handled by writing quality code that can deal with it. Hence, data quantity is preferred over quality.
Approach: Fixed data, iterative improvement in model quality.
The data-centric AI, on the other hand, focuses more on the idea that noisy data does nothing good to a model. More so, it might as well backfire and affect the performance negatively. Hence, a lot of data isn't necessary. Instead, the data collected should be highly consistent and have minimal noise. Essentially, the approach prefers data quality over quantity.
Approach: Fixed model, iterative improvement in data quality.
Why Make the Shift from Model Centric to Data-Centric?
Now that we know what both terms precisely refer to and where do they differ, let's see why today's AI world requires us to make a paradigm shift from being model-centric to data-centric.
First off, there's a high need to understand that getting consistent and high-quality data has always been the top priority of every data scientist. Just like everyone wants to build their house out of the best materials available, the same goes for the data scientists as well.
However, if we move back ten years, it's easy to see why data scientists decided not to spend more time on data acquisition than the model building – because they couldn't! Sometimes, there’s extreme shortage of data in certain departments and data scientists get whatever they can.
Fortunately, we have seen a massive rise in big data, especially in the past few years. Not only has this made big data tools abundantly available, but it has also made these tools accessible to a general audience who couldn't get their hands onto such privileged and commercial tools a few years back.
Another humungous change experienced is the rise of internet devices available. Stats show that 90% of the data we have today was created in the last two years alone! So, we have more data today than ever, and this exponential trend is only expected to grow.
As a result, data scientists can be picky, and data no longer remains the limiting factor. In fact, it has become feasible for MLOPs teams now to run systematic processes on data to make it consistent and sit well according to the model requirements.
How Going Data-Centric Could be a Game Changer – The Steel Sheets Example
To know more about how going data-centric could make a huge difference in the world of AI, let's quickly go through an example brought up in a session of Andrew Ng recently.
So, imagine that we are to build a machine learning model that predicts the defects in steel sheets. There are 39 different kinds of defects that our model should be able to differentiate. Now, the model we've built is performing well – with an accuracy of 76.2%, however, the aim is to go over 90% accuracy.
Knowing the accuracy is already good, but we want to go as far as 90% - how do you think we could achieve this? Since the accuracy is already a decent number, you can't argue that the hyperparameters in the current model aren't tuned appropriately. Moreover, the model is already using state-of-the-art architecture, and any improvement seems close to impossible. From the perspective of code, at least.
However, when we take the data-centric road and try to make the data more consistent, cleaning out the noise from it, and then using the filtered and cleaner data with the same model, the results achieved almost seemed unreal. As much as it's hard to believe, the accuracy went up by a massive 16.9%, adding up to 93.1%!
So, you get the idea. Sometimes, you cannot tune your hyperparameters past a certain point, and there comes a time when putting more time into tweaking the model, again and again, becomes useless. It's where data quality helps. The more it's relevant and the lesser the noise, the better results you can achieve.
Ensuring Data Quality - The Role of MLOPs
Going data-centric is easier said than done. While it's pretty easy to mention that using cleaner data with lesser noise and better labeling ensures an increase in the performance of machine learning models, it's not as easy to control the data quality as well.
Since it all boils down to better consistency in data labeling, MLOPs play the pivotal role here. They're the un-caped heroes controlling all the data processing and management in a machine learning project that ensures quality and fulfillment of requirements. Simply put, MLOps = ModelOps + DataOps + DevOps.
There are three crucial questions that the MLOPs team has to answer when ensuring high data quality with consistent labeling:
How to define the perfect data and collect it.
How to modify the data to improve the model performance.
How to track data drift.
This is pretty much the essence of the MLOPs team reduced to a simple set of points, and only if these problems are taken care of, the performance of the model would improve significantly.
Good Data > Big Data
Big data has been the talk of the town for quite some time now. Companies have been using pretty advanced tools to acquire huge lots of data and improve their AI systems. While this is a good thing for the AI world since collecting more data is always beneficial for AI systems, it's time the industry realizes big data doesn't always refer to good data.
The whole idea of the data-centric approach revolves around the notion of good data instead of big data. Conventionally, companies preferred to have as much data as possible, even if it comes with noise. Since they were of the view that noise can be handled with model quality at the end of the day.
However, the data-centric approach urges us to look the other way around. Having good and consistent data, even with lesser volume, could be as effective as the data twice its volume that comes with noise.
Another complexity in such a scenario is deciding what good data precisely means. Since the term is quite relative, it's important to describe it in crystal clear terms. So, here are some major characteristics of good data:
A sound balance of input classes.
Consistent target labels.
Uses feedback from production data.
Once you ensure your data covers these requirements, you don't have to go after tremendous lots of data to make your projects effective and efficient. Again, the MLOPs team has a huge part of play here. They have to continuously monitor the models and keep iteratively improving the data quality.
For most, Machine learning has always been about the quality of code we write and the way we train our models. However, there has been significant proof lately, showing that having high-quality data is just as important as having a high-quality model to make an ML project effective.
Data-centric AI is a notion that stresses a systemic process to improve the quality of data instead of focusing on the code alone. Continuous monitoring of model performance and iteratively improving the data quality has proven to drop the failure rates of ML projects by a lot.
However, following the data-centric approach isn't a piece of cake and requires a full-on collaboration with the MLOPs team to ensure consistency in the data quality, hence making the approach efficient.