Developing Machine Learning Projects with Data Version Control (DVC)

Michael Gruner
Mar 28, 2023
10 min read

Updated: Apr 16, 2024

Part I: The Need for Data Version Control for Machine Learning Projects

"DVC is a swiss army knife for machine learning projects."

- Me, to the party guests

Welcome!

This is the first of a series of articles dedicated to Data Version Control (DVC) for Machine Learning projects and the Iterative ecosystem. If you have no idea what I'm talking about, then you're in the right place! Make sure you visit the full series:

Part I: The need for DVC (you are here).

Part II: DVC Tutorial

Part III: Experiment tracking (coming soon!)

Part IV: Continuous Machine Learning (coming soon!)

Part V: Advanced DVC (coming soon!)

This first article is structured as follows:

What is DVC?

Lessons From the Software World

Version Control Systems

Build Systems and Reproducibility

Experiment Tracking

DevOps, MLOps, and DVC

Beyond DVC: Iterative

The article is written in a fairly linear fashion, but feel free to jump to the section that most interests you.

What is Data Version Control (DVC)?

One of the best ways to understand a tool is to first understand what problem it solves. My background is in software engineering and I believe software development is a discipline a lot of readers may relate to. In a way, you may think of this article as a DVC primer for developers, however, I'll do my best to minimize prior knowledge presumptions. I do assume the reader is familiar with at least:

Using Git to version software
Running stuff from the command line
ML programming in something like Python

The TL;DR; if you're in a hurry (or maybe just lazy, no judgment here) is that DVC is a MLOps swiss army knife. A version control system for machine learning projects that encourages structured development, collaboration, reproducibility and deployment.

Lessons From the Software World

Software Development is a mature discipline (or at least that's what we like saying to ourselves). Since its origins in the mid 20th century, we are now at a point where the available tools and methodologies allow thousands of developers around the world to work on the same project simultaneously. Take the Linux Kernel for example, since 2005 it has had over 14 000 different contributors and runs on more than 1.6 billion devices every day. Now, THAT'S HOW YOU SCALE A PROJECT!

Version Control Systems

Version Control Systems (VCS) allow developers to record and document progress on software projects. There are many benefits to this, the most direct one being that at any point in time you may revert your system to a previous state if a fatal flaw is introduced, for example (it happens to the best of us). But there are other benefits to them. Having an up to date backup of your project encourages bravery in developers, and brave developers progress faster. Does the following awful scene resonate with you?

Chaos crawling over a project's progress.

The other obvious benefit of VCS is that it enables collaboration. A team of developers may all be working on the same project without stumping on each other. Going back to the Linux example, it may not come as a surprise that Linus Torvalds, the creator of the project also authored Git to handle its development.

A correct SW versioning methodology helps teams make well defined, constant progress, where every team member collaborates simultaneously.

Sadly, when it comes to machine learning, I've seen the story repeating itself. Developers sharing unversioned Jupyter Notebooks via DropBox, hundreds of experiments lost in chaos, and no history or ability to recover. While Git could (and should) be applied to this, there's more to the problem that meets the eye.

Unlike software development, source code is not the main input to machine learning, but data. And data can be heavy. An image-based dataset, for example, can measure TeraBytes. Attempting to version such a dataset with Git will simply render your repository useless, it's not the tool for the job. A similar problem occurs with the models that these machine learning scripts produce.

Datasets and models are way too big to upload to a Git repository.

The obvious go-to is to use an external storage provider (such as a Google Drive, AWS S3, GCP Storage, etc…) and download the data when you bootstrap the project. This, however, loses all the versioning benefits we just discussed. If you were to add new samples to your dataset, it would be prudent to have a way to recover the previous dataset. If you find that your latest model introduced some undesirable bias, it would be responsible to have a way to revert your system to a previous state. And yes, some storage providers do offer version-like capabilities, but do you really want to couple your project to a single infrastructure stack?

There's no easy way to track changes in external storage.

This is precisely what DVC solves! DVC is a tool that builds on top of Git to provide versioning capabilities to datasets, models, images and any other artifact that would otherwise bloat a repository. It uses the external storage (or local if you want it to) to save incremental changes and assigns them to a specific Git commit. Reverting to a different Git commit, will signal DVC to revert the data to data state as well.

We, essentially, have the best of both worlds!

Image generated with the help of Stable Diffusion, taken from Iterative.

Build Systems and Reproducibility

As you scale a project, build times increase exponentially. The software world has taught us that shortening these times make a big difference between a project that progresses at a fast, constant and agile pace; and one that doesn't. Take the example of the Linux kernel project again. Nowadays a full build can take somewhere between 5 to 10 minutes. This may not sound as much, but imagine waiting this much every time you are building a change (which I hope you are). It very quickly starts to add up.

Image courtesy of XKCD.

But of course, it does not take 10 minutes to compile each change you make in the Linux kernel. In a very astute fashion, the project detects the exact files that were modified along with those who depend on them and builds only those. This can be done within a few seconds.

A build system is a set of tools whose purpose is to build software in an efficient, reproducible and portable way. In the case of the Linux Kernel, the build system is Mak. In the case of OpenCV it is CMake, in the case of GLib it Meson. Here's an interesting fact for you: In 2003 the author of Make was awarded the ACM Software System Award for developing such an influential piece of software, a price that today rounds $35k.

A proper build system speeds up development, encourages progress and improves reproducibility/portability.

Machine learning projects, in a comparable way, suffer from similar problems. The difference is that the inputs to these are not generally source files and the outputs are not binaries, but data and models respectively. Imagine you need to build a model that detects monsters in images. The build process may be loosely modeled as in the image below:

A machine learning pipeline to detect monsters under children beds.

Initially, you have a bunch of images of monsters with different sizes and aspect ratios. We need to preprocess these to scale them to the size accepted by the network and normalize them so that their pixel values have the mean and standard deviation that we need. Preprocessing our extensive dataset takes around 20min. Next, we can proceed to train our model. With the current hyperparameters, the training process lasts around 2 hours. Finally, we can test our model, which takes 3 minutes. If we were to be happy with the results, we can deploy it and monitor it, restarting the re-training cycle once the production model starts drifting.

Does this mean every test will last 2h 23min? Of course not! Let's say you wanted to experiment by changing the model's learning rate. In that case, we could reuse the preprocessing that we already made, saving those precious 20min. During another test we decide to add some images to the test dataset. In that case we want the system to automatically understand it doesn't need to be trained again, just preprocess the new samples and re-test the system, which can all be done in under 10 minutes! Depending on the scenario, we might be able to save processing time here and there.

On the other hand, say we treated ourselves to a new, more powerful server. In that case, we want to be able to produce the exact same results without the need to manually run each step independently.

It would be very helpful to have build system-like capabilities in machine learning projects.

Again, that's what DVC brings to the table! Besides providing artifact versioning, DVC allows its users to build reproducible machine learning pipelines. These pipelines are backed by a powerful dependency tracking and caching mechanism such that only what needs to be built will be built. In fact, if you ever configure your system to a configuration that was previously built, DVC will recognize that and recover those previous results!

Image generated by Dall-E2, taken from Iterative.

Experiment Tracking

One aspect of machine learning that is not shared by software development is the higher experimental essence of its projects. Not to be confused with software projects being highly iterative (which ML projects are, too). By their very nature, machine learning projects are optimization problems that operate on higher dimensional spaces. This, simply put, results in the engineer needing to perform a lot of informed guesswork to try to find an architecture configuration that best models the data. We have a fancy word for this: experiments.

Experimentation can very easily run amok if it is not handled properly. You have tens of hyperparameters to tune, you need to choose between several architectures, loss functions, optimizers, regularization techniques, etc… I guarantee at some point you will:

Lose control of which combination you've tried before
Need a history of how your model metrics evolved based on how you tweak your parameters.

Again, DVC comes to the rescue. Since everything is properly versioned and the pipeline is well defined, experiments are very easy to launch. You just need to specify the parameters for each experiment, wait for them to finish and compare. DVC makes it easy to discard failed ones, and persist the configurations that actually improve your model.

Experiment tracking allows you to visualize your model's progress in a visual, structured way.

Image generated by Dall-E2, taken from Iterative.

DevOps, MLOps and DVC

The points discussed above are a few concrete examples on how adopting agile, automation and continuous integration/delivery practices can help us implement responsive and robust production-ready systems. If you come from the software world you may recognize these practices as DevOps. This methodology maps the lifecycle of a product such that your company can deliver software releases to your customers at a fast pace, constantly and based on real feedback by using well-known best practices.

As we've seen in previous sections, the concept behind these best practices can be extrapolated to the Machine Learning world. Unsurprisingly, this methodology is known as MLOps. It aims to map the lifecycle of models in production so that they can be developed, deployed and maintained reliably, constantly and efficiently. Just as with DevOps, the important bits are the concepts behind these practices, but you'll find that there exist tools designed specifically for MLOps which will make your project significantly smoother.

Image taken from Wikipedia's MLOps article.

DVC is a tool designed to help you implement MLOps in your project. Not only does it provide artifact versioning capabilities for robust and collaborative machine learning development; but also mechanisms to define reproducible and efficient builds and, with that, the ability to reliably track and share machine learning experiments.

DVC is a swiss army knife for MLOps.

Beyond DVC: Iterative

DVC is just one of the pieces of the full MLOps spectrum. Iterative, the company behind DVC, aims to provide solutions for the complete MLOps pipeline. The diagram below shows the Iterative ecosystem:

Iterative ecosystem diagram.

We can coarsely generalize the MLOps pipeline in three stages, and the tools that Iterative provide for each of them:

Development

This phase is all about creating a model that is capable of generalizing according to our data and needs. This process is highly iterative and experimental.

DVC: The protagonist of this article. An essential development tool that enables versioning capabilities, experiment tracking and reproducible builds.
Studio: A cloud front end for DVC. A dashboard that allows you to visualize your model progress, plots and metrics in one single place.
VS Code Extension: An extension for VS Code that exposes DVC capabilities in the IDE.

Infrastructure

In the software world, we would call this GitOps. Every production-ready system requires an infrastructure to run on. In the machine learning world these resources may be GPU equipped training servers, cloud storage, model serving machines, etc… GitOps is the practice of provisioning infrastructures by using versioned configuration files from a Git repository. This encourages, not only reproducibility, but scalability, flexibility and continuous improvement.

GTO: Convert a Git repository into a model registry (or any other artifact, really). The different artifacts can be assigned a version number and a stage: production, staging, development, etc… Aligned with GitOps, GTO handles this registry in a configuration file that may be versioned, but provides a command line tool to interact with it.
CML: Continuous Integration and Deployment for machine learning. CI/CD is an essential tool in DevOps or MLOps, where automation is key. CML integrates with all the major repository hosting services (GitHub, GitLab, BitBucket, etc..) so that different Git events (commits, merges, tags, etc…) trigger a machine learning workflow. For example, if one of my experiments was successful and effectively improved the F1-score of a model, I could push my branch and have CML create a report with the new metrics and plots for my colleagues. If you are ready to push a new model to production, CML can listen to this event and automatically deploy this new version to the model serving machine.

Deployment

The production-ready model needs to be served somewhere. This may be a server on the cloud, internally in a robot or a new version of a mobile app in the store. Either way, a proper deployment infrastructure is essential to MLOps.

MLEM: A standard interface to store and serve machine learning models. Regardless of the framework you use (PyTorch, ONNX, TensorFlow, etc…) MLEM provides a unified way to store your model along with the relevant metadata such as dependencies, versions, inputs, outputs, and required pre/post processing. This allows all your models to speak a common language and be served with the same tooling.

Once the model is deployed, it needs to be continuously monitored and improved. It's normal for machine learning solutions to drift. A MLOps pipeline is a robust framework for continuous improvement.

Final Remarks

There are a lot of machine learning tools available, but DVC and its friends fill a gap that machine learning projects had. If you really want to boost the development of your models and fasttrack the scaling of your AI product, I highly recommend giving MLOps a try. DVC and Iterative are a great way to do so without coupling your entire environment to proprietary clouds such as AWS, GCP or Azure. Check it out, they have great documentation and a very welcoming community.

Oh, and the best of all, these tools are all open source!

Fulfilled owl, image taken from Iterative.

Developing Machine Learning Projects with Data Version Control (DVC)

Part I: The Need for Data Version Control for Machine Learning Projects

Welcome!