Part II: Hello DVC World: A Data Version Control Tutorial
"DVC is a swiss army knife for machine learning projects."
- Me, to the party guests
Welcome!
This is the second of a series of articles dedicated to DVC and the Iterative ecosystem. If you have no idea what I'm talking about, then you're in the right place! Make sure you visit the full series:
Part II: Hello DVC world (you are here!)
Part III: Experiment tracking (coming soon!)
Part IV: Continuous Machine Learning (coming soon!)
Part V: Advanced DVC (coming soon!)
This post provides a hands-on tutorial on how to use DVC to manage and version reproducible machine learning projects. For that, we will build a skin disease classification model step by step¹. We'll start by defining the machine learning pipeline and go all the way to mapping it to DVC and improving the model performance systematically. By the end of this read, you should get the feel of how to develop machine learning projects using DVC and, hopefully, start your new project with it!
¹ The resulting model is for academic purposes only and must never, under any circumstance, be used as medical guidance.
While not strictly necessary, I do recommend you start reading the first part of this series: The Need for DVC, as it justifies the concepts applied in this walkthrough. Regardless, I hope you enjoy the read!
Skin Disease Classification ML Pipeline
A data processing pipeline (which ML pipelines are) refers to a series of data processing stages. Data moves through the pipeline from its raw form, undergoing transformations in each step until the desired outcome is obtained. To better explain this concept, let's build a skin disease classification model as shown in the following figure:
The development of the model can be expressed in the form of the following simple ML Pipeline. We'll be using it throughout the document.
This diagram clearly shows the data being transformed in several stages. Specifically:
Skin features: Not really a processing stage. This block represents our dataset which, in our case, is a CSV file containing the skin features collected by a dermatologist.
Preprocess: In this stage you'd typically cleanup your data and adjust it to be consumed by your training script. In our system, we are going to fill missing values from our tabular data and scale the values in some columns. We're also going to split it into the train, validation and test splits.
Train: Once your data is ready, we start the training process. The output of this stage is a new model and, typically, a set of learning curves and resulting model performance metrics. In our system we are going to use gradient boosting via XGBoost².
Test: Finally, you want to evaluate your system against unseen data. In this stage you use the recently trained model and evaluate its prediction capabilities. Did our model simply memorize the training data or is it capable of generalizing?
² Gradient boosting is a machine learning technique that gives a prediction model in the form of an ensemble of weaker prediction models. When these predictors are decision trees, the algorithm is called gradient-boosted trees. XGBoost is a powerful framework to train and run gradient-boosted trees.
Inputs and Outputs
Our simple ML Pipeline looks very linear. It would be a mistake, however, to assume that each stage consists only of a single input and output for the data to flow. In practice, each stage not only receives data from the previous one, but also needs parameters from the user to operate. Similarly, it may also produce additional outputs such as metrics and plots. A more realistic pipeline would look like:
The following sections delve into more detail about the different inputs and outputs that are common to encounter in ML pipelines.
Inputs
An input of a ML pipeline stage is what it requires to operate and produce its outputs. For example, let's focus on the preprocess: its purpose is to prepare data for training our model. In our specific use case, as said above, we want to fill missing values from our dataset and scale the values in some columns. In this case, defining the input is fairly straightforward: we receive raw data, and produce curated data splits.
The training step is not that trivial. In the simplest case it receives curated data (from the preprocess) and produces a trained model. In reality, this model will likely have hyperparameters you want to tune in order to find the best fit you can. You are developing a model after all! In our specific example, we can tune our gradient boosting algorithm by tweaking:
Learning rate (eta): step size shrinkage after each boosting stage.
Max depth: maximum depth that a tree can achieve during training.
Boost rounds: the number of weak learners to iteratively add to the ensemble.
All these parameters can be used as regularization methods. This means that using an adequate balance can favor generalization and improve learning speed, while poor choices may end up in overfitting.
The following figure shows our ML pipeline with the parameters specified throughout the steps. Besides the training parameters, we've added a threshold to the test stage, which will filter out inconclusive predictions.
It may be tricky to identify the different inputs in your ML pipeline. Here's my rule of thumb:
If a step needs to be re-executed after changing "something", that "something" is an input to that step.
For example, if I change the learning rate, I expect the training to occur again. However, if I change the threshold, it is not necessary to train again, just test.
To summarize: inputs can be anything, but you'll typically find them in the form of:
Files
Directories
Source code
Parameters
It might surprise you to see source files in the list above. However, following the rule of thumb above, if you modify the training script (to fix a bug, for example) you want to retrain the model using the corrected code.
Outputs
Outputs are everything that a ML pipeline step produces as part of its processing. They can be further used as inputs for downstream stages, or simply be end-of-line artifacts.
Again, the training step is not that simple. While at a first glance it may seem that the stage produces a trained model from the curated dataset and the hyperparameters, in reality there are other valuable outputs to this step:
Learning curves: a progression of the training and validation error through the boosting iterations.
Metrics: accuracy, precision, recall, F1, and other typical classification quality metrics.
All this information is tightly coupled to the current system configuration and is essential for model development.
You can't improve what you can't measure.
The following figure shows the ML pipeline, but now with the outputs we've chosen also specified.
Notice how we've also added some outputs to the test step: a confusion matrix to visualize per class performance and a set of classification metrics. These will help us evaluate the quality of our model against unseen data.
It is more intuitive to identify stage outputs but, for completeness, here's my rule of thumb:
If you remove "something" and you must run the stage to re-generate it, then that "something" is an output of that stage.
To summarize: outputs can be anything, but you'll find them in the form of:
Files
Directories
Models
Metric files
Plots
Images
Note that your scripts may generate other temporary artifacts or stuff that is really not important for the model development per se. I usually don't count those as outputs.
The DVC Pipeline
It's time to see how our ML Pipeline is represented in code. For that, we will use DVC. First, let's see the bird's eye view of such a project and then we'll build it step by step. The following figure shows our skin disease classification system implemented as a DVC pipeline.
Similarly, the following listing shows one possible project structure for our DVC pipeline above.
The last section of this article will guide you through the process of implementing this project step by step. Let's first review what these files represent.
DVC Files
dvc.yaml
The main file in every DVC project is dvc.yaml.This configuration file holds the definition of our machine learning pipeline. In its most basic form, this file looks like:
The stages object holds all the pipeline steps. In the example above, the stages are: stage1 and stage2. Each stage is formed by cmd and optionally: desc, outs, deps, plots and metrics.
cmd: The command to execute in order to perform the pipeline step. This is typically a script or an executable that lives within your project (a Python script, for example). The ${stage1.param1} reads these parameters from the params.yaml file. More on that later.
desc: A description of the stage. This is only for display purposes.
deps: The inputs to the stage. According to our rule of thumb above, if anything in this list changes, the step needs to be executed again. Note how the script used in the command is typically a dependency.
plots: Plot outputs for the stage. These will be regenerated if the step runs again. More on plots later.
metrics: Metric files for the stage. These will be regenerated if the step runs again. More on metrics later.
outs: Other outputs of the stage (models, modified data, etc…). These will be regenerated if the step runs again.
With the description above, our skin disease classification pipeline would look something like:
Take a moment to correlate our diagram with the textual description of the pipeline, there is an exact 1:1 correspondence between them. Make sure it is completely clear:
What the inputs and outputs of our three stages are.
How the outputs of one stage are the inputs of the next one.
Params are not marked as dependencies explicitly, but are used in the command.
Because preprocess produces dataset/train.csv and dataset/val.csv and train consumes them, train depends on preprocess.
Because train produces model/model.bin and test consumes it, test depends on train.
Steps that don't depend on each other could be run in parallel!
The YAML file format is extremely sensitive to indentation problems. Most DVC execution complaints are related to YAML formatting issues. Validate your file!
params.yaml
The params.yaml file holds the different parameters used by each pipeline stage. Once defined in this file, they can be used in the dvc.yaml as seen above. The parameters are typically structured in a hierarchical fashion. For example, the parameters for our skin disease classification pipeline would look like:
And, the test cmd in the dvc.yaml (as seen above) would, for example, use them as:
test:
cmd: src/test.py ${test.threshold}
Again, this is a YAML file so be very careful about the indentation!
An astute reader could ask themselves, why have parameters in a different file instead of hardcoding them in the scripts or the dvc.yaml? That is certainly a good question! By decoupling the parameters from the implementation we achieve a much more flexible design where we could experiment with different hyperparameters and even find the best configuration automatically!
A word of caution: don't add unnecessary complexity to your project by over-exposing parameters to DVC. Most of the time you won't tweak them and they'll only pollute your reports. Keep your development under control by exposing the ones you are most likely to tune.
You can expose new hyperparameters at any point in time without problems!
dvc.lock, .dvc
The dvc.lock file and .dvc directory are automatically generated and maintained by DVC. You do not need to create or modify these files. Their purpose is to hold information about the project's state and ensure proper execution.
The dvc.lock and .dvc are generated and maintained by DVC. Do not create or modify them manually!
To Commit or Not to Commit, That's the Question
Before we continue exploring the rest of the files that conform our pipeline, it's important to clarify whether (and where) these are kept under version control. The short answer is yes, all these files are meant to be versioned. Even if they are generated by running the pipeline, like plots, models and metrics. This may be hard to grasp for us with a software background, where tracking build outputs is discouraged. However, the project should be understood differently:
The primary purpose of the repository is to represent the model and its current state, not the source code in it.
Sure, the source code is fundamental to build the model, but when someone clones your repository they are really downloading the model in its current state.
Now, the question that remains to be answered is where do we version each file. Both, DVC and Git, are versioning mechanisms. Here is my rule of thumb:
Text files (YAML, JSON, CSV, etc…) are versioned using Git.
Binary files (PNG, JPEG, models, etc…) are versioned using DVC.
The reason is simple: since Git tracks changes, it can work very efficiently with text files but not so much with binary files. Git LFS may solve this problem in a way, but is not as flexible as DVC which is designed specifically for machine learning projects. By default, DVC will automatically track all the outputs, plots and metrics. In the following sections I'll show you how to change this default behavior.
Metrics and Plots
Metrics and plots are interesting files to talk about. If handled properly, DVC can generate reports from them. In this section we will talk briefly about them.
Metrics
Metric files contain information about the current state of the project. This information is typically numerical and is used as a summary to track the improvement of the model over time. For example, it can be the model's F1 score, precision, recall, accuracy, etc…, but can also be the number of instances in each class of your dataset, the throughput of your inference or, in general, anything you consider useful as a baseline to improve your model.
Metrics are defined as special stage outputs in the dvc.yaml file, as seen before. For example, the following snippets defines that our training stage outputs a metric file:
stages:
…
train:
desc: "Train the XGD model"
cmd: src/train.py ${train.boost_rounds} ${train.learning_rate} ${train.max_depth}
…
metrics:
- metrics/train_metrics.json
…
Metric files can be placed anywhere, but the metrics directory is standard. These, being text files, can be tracked by both DVC and Git. I personally like tracking them using Git. The following snippet shows how to indicate DVC that you don't want it tracking this file, since you're going to track it with Git.
stages:
…
train:
desc: "Train the XGD model"
cmd: src/train.py ${train.boost_rounds} ${train.learning_rate} ${train.max_depth}
…
metrics:
- metrics/train_metrics.json
cache: false
…
Of course, you'll need to add it to the repository and commit the file.
Use cache: false in your plots, metrics or outs to indicate you don't want DVC tracking them
At the time of this writing, metrics are stored in a hierarchical fashion and any of the following formats are supported:
JSON
TOML 1.0
YAML 1.2
For example, the following files would be equivalent for DVC:
metrics/train_metrics.yaml | metrics/train_metrics.json |
train: mIoU: 89.5 mAcc: 94.52 | { "train": { "mIoU": 89.5, "mAcc": 94.52 } } |
Using these files DVC can generate convenient improvement summary reports. We'll see that in action in the incoming sections.
Plots
Plots, similar to metrics, provide insightful information about the model's performance. They display information about the current state in a graphical manner. For example, plots can be confusion matrices, ROC curves, learning curves, GPU utilization, etc… DVC does not impose any restriction about the information you wish to plot.
DVC supports two different plot formats: raster and VegaLite. While both serve the same purpose, my experience is that the latter provides a nicer experience when generating reports. If possible, I recommend using that.
Raster Plots
These refer to PNG, JPEG, GIF or any other image format files. These are binary files and, hence, should be tracked by DVC and not Git. The syntax to define a raster plot output in the dvc.yaml is simply:
stages:
…
train:
desc: "Train the XGD model"
cmd: src/train.py ${train.boost_rounds} ${train.learning_rate} ${train.max_depth}
…
plots:
- plots/learning_curve.png
…
Vega Lite Plots
Vega Lite is a specification and tool for creating interactive plots from declarative JSON files. Instead of saving a graph in a raster format like PNG, DVC allows you to generate a plot from information in text files. At the time of this writing, the following formats are supported:
JSON,
YAML 1.2
CSV
TSV
For example, these two files would be equivalent for DVC:
plots/train.json | plots/train.csv |
{ "train": [ { "step" : 10, "acc_seg" : 0.60 }, { "step" : 20, "acc_seg" : 0.56 }, … ] } | step,acc_seg 10,0.60 20,0.56 … |
Internally DVC will parse this data and generate a nice plot like the following:
Being text files, I prefer to track them using Git instead of DVC. As such, the way you define the plot output in the dvc.yaml file is:
stages:
…
train:
desc: "Train the XGD model"
cmd: src/train.py ${train.boost_rounds} ${train.learning_rate} ${train.max_depth}
…
plots:
- plots/learning_curve.csv
x: step
y: loss_ce
title: Train Loss per Epoch
x_label: Epoch
y_label: Cross Entropy
template: linear
cache: false
…
Note the cache: false which indicates DVC that we are tracking this file with Git! Most of the fields are fairly self-explanatory, but you can check the specification for more details. The "step" and "loss_ce" names are columns from the plots/learning_curve.csv file.
The template field indicates the type of plot to generate. At the time of this writing, the following templates are supported: linear, simple, scatter, smooth, confusion, confusion_normalized, bar_horizontal, bar_horizontal and bar_horizontal_sorted. In my experience, linear, confusion and bar_horizontal are the ones that I mostly use.
Finally, there is a way to monitor the progress of plots in real-time. This technique is out of the scope of this tutorial, but you can check it out by yourself by reading about DVCLive.
Other Artifacts
Besides plots, metrics, and params, DVC allows you to define any input/output. Let's take the model.pkl as an example. As said before, you need to indicate that the model is an output of the training stage and an input to the test stage. The way to do this is:
stages:
…
train:
desc: "Train the XGD model"
cmd: src/train.py ${train.boost_rounds} ${train.learning_rate} ${train.max_depth}
…
outs:
- models/model.bin
…
test:
desc: "Test the trained XGD model"
cmd: src/test.py ${test.threshold}
deps:
- models/model.bin
- src/test.py
…
If defined correctly, DVC will generate a dependency graph and re-build the model efficiently.
Hands-on DVC Tutorial
It's finally time to put hands to the work and build the actual project. Open a session in your favorite terminal and follow along!
All the code developed in this tutorial can be found at GitHub
The reports generated by this tutorial can be found at Iterative Studio
The Basics
Start by initializing a Git repository in an empty directory
Now, initialize the DVC project in this repository by running dvc init.
This has created a set of configuration files which are ready to be committed to git.
Now let's do ourselves a favor and install the Git hooks. For this, we run dvc install. These will make sure we don't forget any important DVC actions after interacting with Git. For example, if we switch to another Git branch, we want to perform the switch also with DVC. The hooks will perform this switch automatically.
There's nothing to commit here. You'll need to run this with every new clone of your project. We'll talk about how to clone a Git+DVC repository in the last section.
The Dataset
Let's start building the pipeline. Create a directory for the dataset and download it by running:
Now let's indicate DVC that we want it to track the dataset. It is true that, in this example, the dataset is a text file and we could've tracked it with Git as well. However, to get the full DVC experience, let's not do that.
Let's understand what just happened. First, DVC automatically created two files:
dataset/features.csv.dvc
dataset/.gitignore
Feel free to explore the contents of these files. The first one serves like a "handler" to the actual file. It contains the path, size and current checksum. This way DVC will detect changes to the dataset and react accordingly. The .gitignore simply tells Git to ignore the dataset, since it's tracked by DVC. You need to add and commit these two files:
Congrats, you just started tracking the dataset! DVC will now detect changes to it.
The Pipeline: Dataset Sanitation
Nice! We are ready to define our first pipeline stage. This stage is the dataset sanitation. We are going to perform two fixes:
The 33rd column (patient age) is missing data. Replace the age with a 0 if an age is provided or 1 otherwise.
Decrease the disease class by 1 (so that it is zero based).
The result will be written into sanitized.csv. Create a src directory and create a src/preprocess.py file with the following contents inside. Also, let's make sure the pandas module is installed.
Now run the following to create the first stage:
Take a moment to analyze the command above. You can always run dvc stage add --help to find more details. This should've created the dvc.yaml file with the following contents in it:
Note that the files in outs are not required to exist initially. They will be generated by the command in cmd.
You can use the DVC command to create the stage, or you can simply edit the dvc.yaml file manually. I, personally, create the file by hand most of the time.
Now, as the message suggests, we need to add to the Git repository the newly created files. Also, don't forget to add the script you just created as well!
Before committing the changes, let's run the pipeline!
Again, take a moment to understand the output above. Inspect the dataset directory, you should see a dataset/sanitized.csv file! Add the remainder of the files to Git and commit your changes.
Go ahead and try to run the pipeline again:
See what just happened? DVC recognized that none of the dependencies changed so it didn't re-build the model.
Before moving to the next stage it is important to mention that the new artifacts, dataset/{train,val,test}.csv, are already being tracked by DVC automatically. We did not need to dvc add them manually as we did with the original dataset. This is because these artifacts are outputs of the prepare stage. We also didn't need to run dvc commit because this is done automatically by the repro.
The Pipeline: Model Training
Time to train our model! I suggest you try to add the stage by yourself or, if you are just here for the show, continue reading!
Create a src/train.py script and fill it with the following contents. As usual, let's extend the pipeline stage for the training process. Modify your dvc.yaml so that it looks like the following:
Does this look similar to what you had in mind? Great! Take a moment to explore the script. Worth mentioning, this snippet creates the metrics file:
metrics = {
"train" : {
'f1' : f1_score(val_Y, pred, average='macro'),
'accuracy' : np.sum(pred == val_Y) / val_Y.shape[0]
}
}
with open('metrics/train_metrics.json', 'w') as outfile:
outfile.write(json.dumps(metrics, indent=2) + '\n')
and this other one writes the plot file:
with open('plots/learning_curve.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['round', 'mlogloss'])
[writer.writerow([i, loss]) for i,loss in enumerate(results['val']['mlogloss'])]
That's right! Plain old Python.
We are using parameters for the first time. Create a params.yaml with the following contents:
We're almost done with the training stage. Lastly, make sure the models, metrics and plots directories exist; and that we have all dependencies installed:
And that should be it! Lets run the pipeline so far:
Congrats! You have now a ML pipeline capable of training a skin classifier model. Three interesting things to notice here:
If you didn't touch anything related to the prepare stage, it should've skipped it, like in the box above. If it was different for you, don't worry! Maybe you modified slightly the stage in the dvc.yaml, forcing it to process.
The stage ran successfully and produced the desired model, metrics and plot!
If you ran dvc repro again, DVC would be smart about it and understand that nothing needs to be done (since nothing has changed since our last repro).
Now, let's not forget to add everything to Git! Remember, we asked DVC not to track the metrics and plots, so we need to add those to the Git repo. The model, on the other hand, is going to be automatically tracked by DVC (we want it that way because its likely a binary file).
Reporting
We've been responsible ML model developers and DVC will reward us for that. Let's print the metric summary. Go ahead and type the following in your terminal:
Look at that nice summary! Our model is pretty good already, but let's not jump into conclusion until we see results using the test set in the next sections.
Similarly, run the following to generate the plot from our plots/learning_curve.csv file.
The command above has taken all of our plot files, and consolidated them into a single HTML file. Go ahead and open it with your favorite browser. It should display something like the following:
By now, I hope you are starting to get the gist of it. This plot is just a simple example, you're free to render whatever you feel useful for your model development. The important part is: we now have a baseline to compare to and improve our model upon!
The Pipeline: Testing
Finally, we are ready to test our model with unseen data. Now is your time to shine! Try to implement it yourself. As a refresher, there are the requirements of the testing stage:
The test script src/test.py should have these contents in it.
Depends on the test split.
Produces a confusion matrix plot and a metrics file.
Receives a threshold parameter from the user.
Spoiler alert! Solution below...
The dvc.yaml file should now look something like:
Similarly, the params.yaml should now look something like:
Now go ahead and perform the dvc repro with the new stage!
Again, note how DVC only processed what it needed to. Time to check the metrics and the plots!
Finally, don't forget to commit all the changes to Git as well! This includes the script you just created and the metrics and plot files (which we explicitly asked DVC not to track by using cache: false).
Improving the Model
The time has come to start improving the model. This, in my opinion, is where DVC shines the most!
DVC allows you to improve your model in a structured, reproducible way. At any point you can revert to a previous version.
Lets start by focusing on the training metrics:
Our data is simple enough, and Gradient Boosting so powerful, that we should be able to improve those scores. Lets do an experiment by increasing the model's max_depth.
Let's create a Git branch for this experiment. If it turns out unsuccessful, we can just discard it. It's not necessary, but I recommend it.
Go ahead and increase the max_depth to 6 and reduce the learning_rate to 0.1, for example. Since our params.yaml is already commited to our Git repo, we can always check the changes we've made:
Now run the pipeline again. Question for you: which stages do you expect to be executed again?
And now a neat trick: compare the results of your current experiments with your baseline.
In the summary above, HEAD refers to your baseline (whatever you have in Git at the time), workspace refers to the experiment you are running and Change refers to the variation the metric had from running your experiment. We can see that our model did improve in the validation set. We did not see any improvement on the test set, though. Regardless, I think, this improvement should be absorbed.
Plots can be compared too:
Commit your changes and merge your branch into main. We now have a new baseline to improve upon!
Note that you can change parameters several times before committing. However, dvc metrics diff and dvc plots diff will only compare the current changes against the last commit. Don't worry about pushing yet, we'll do that later.
Now let's try to improve the test results. Seems like the threshold might be too high. If there are 6 diseases to classify from, an ambiguous probability will be around 1/6=1.667. Let's lower our threshold to 0.3. As usual we'll create a new branch for this:
Now you know what to do! Run the pipeline and compare agains our baseline:
Much better! We're over 95% on our test set and the confusion matrix is looking nicer. Can you figure out why the learning curve shows no difference? That's right! because no re-training was made. Thanks DVC!
This is good enough for our purposes. Commit everything, merge your branch and impress your boss.
Sharing your Work
Finally, similar to how Git has GitHub or GitLab to back your code, you can configure a remote for DVC. There are a lot of options available, for this example I'm using a local directory. I encourage you to use a service like AWS, GCP, Azure, etc..., instead.
I'm going to configure a remote for Git and a remote for DVC. For simplicity, I'm naming both origin. Note how on DVC, unlike Git, we need to explicitly specify that origin will be our default remote.
Note that the DVC config file was modified with the new remote configuration. Don't forget to commit that change.
Now we're ready to push to both remotes. Since we installed the Git hooks, the git push will automatically push DVC stuff as well, but I like to do a more exhaustive push afterwards.
Finally, how does your co-worker use your ML pipeline? Easy! Just clone the repository and pull the DVC cache.
Now, the final surprise. Run dvc.repro in the newly cloned project. Nothing needs to be built! DVC has cached everything and our repo is in the very last state we pushed.
Final Remarks
DVC is a fundamental portion of my machine learning workflow. It has provided me, and my team, the same level of structured development as Git provides to our software projects. While this is a toy project, prepared for didactic purposes, the workflow is not that different from a real one. Here are some closing ideas:
Remember to commit small and often.
Get your hands dirty! It's the best way to learn.
The biggest challenge is remembering to maintain Git and DVC synchronized.
There is much more to DVC and iterative, checkout their docs.
Don't miss the next parts of this series.
Most importantly, have fun!
Finally, explore the project in both GitHub and Iterative Studio:
Commentaires