In the era of foundation models, designing a good data pipeline is crucial to success, and often more important than tweaking the model architecture itself. In other words, having a well-designed data pipeline that can effectively prepare data for input to the model has a significant impact on the model’s performance and accuracy.
However, building scalable, distributed pipelines that dynamically optimise for a target task is also a challenging task.
Time to Shift Focus on Data Pipeline
Manu Joseph, the creator of PyTorch Tabular, said, “No matter what awesome state-of-the-art model you have, you will still need a data pipeline to really use it in production.”
Collecting data from different resources is a crucial step in machine learning (ML) projects. To ensure accuracy and reliability, it is necessary to clean and pre-process the data before using it for training. However, whenever a new dataset is included, the same pre-processing steps are performed again. To avoid repeating this process every time the model is retrained or deployed in production, a data pipeline must be set up and standardised.
Model architecture, on the other hand, defines the various layers involved in the ML cycle. It includes the major steps required to transform raw data into training datasets that can enable a system to make decisions.
Joseph further added, “Focusing on the data pipeline is probably more important than the strategy or model itself. At the end of the day, the purpose of any machine learning model is to serve the customer or internal stakeholders by fulfilling a specific need. For the model to be useful, it needs to be deployed and work in a stable way.”
Data Pipelines Enable ChatGPT
OpenAI, for example, released the GPT-3.5 turbo model architecture, the model behind ChatGPT. But, for that model to become ChatGPT, a data pipeline is essential.
For example, if you’re going to the UI and you’re typing in something, the data should flow through the data pipeline to the model, which then returns the data that should be displayed. In the case of ChatGPT, users can provide feedback to improve the model’s performance. This feedback is another data pipeline that the system has set up to collect data, retrain the model, and fine-tune its performance. Without these data pipelines, GPT-3.5 Turbo is just a model that doesn’t provide any meaningful results.
Setting up a data pipeline can vary depending on the type of data being used. For example, if the data is tabular, a pandas pipeline or some other data management pipeline may be suitable for small datasets, while Spark-based processing may be more appropriate for larger datasets. In the tabular data space, a data pipeline typically involves consolidating data from different sources into a single table for training or analysis. This is often referred to as an ETL or ELT pipeline.
However, the specific tools and techniques used for each type of data pipeline can vary. For example, image data and streaming data require different pipelines, and different tools may be used for each pipeline.
Don’t get deterred by challenge
Joseph explained that designing a data pipeline can be challenging, especially when dealing with large amounts of data. In addition, issues with data quality can cause problems with the pipeline, such as unexpected changes in data types or other inconsistencies. For example, if the tables we’re pulling data from don’t have the required qualities, it can cause issues like unexpected changes in data types, such as integers being returned as floats. Such changes can break the pipeline and lead to issues with data quality.
Another challenge arises when serving the model where things like data drift and others can make the model stale. So, one has to make sure that the data that is fed is of high quality. Thus, it’s important to implement checks and balances to monitor the data and ensure that the distribution remains consistent.
The distribution change happens very frequently in most of the data because the world around us changes. So, typically, how this is handled is by keeping track of the data properties and then having to guess and understand what triggers these changes.
“Data scientists will say that okay, this data is getting changed or this particular thing is getting changed, and then somebody will come in and replicate it. But that’s a whole ML ops cycle. The data pipeline is essential in that aspect because you need to ensure that any deviation in the data type of data can be highlighted,” said Joseph.
He maintains that It’s basically about enabling strong engineering practice. For example, the stability of a data pipeline can be ensured by setting up a process. Whenever you make a change, it needs to go through a change management cycle. You need to raise a change request so that everything is tracked, and anything that we’re not aware of doesn’t creep in. This provides one level of stability.
The post Why Data Pipelines Matter More than Model Architecture in ML appeared first on Analytics India Magazine.