Get Started: Data Pipelines
Versioning large data files and directories for data science is powerful, but often not enough. Data needs to be filtered, cleaned, and transformed before training ML models - for that purpose DVC introduces a build system to define, execute and track data pipelines โ a series of data processing stages, that produce a final result.
๐ซ DVC is a "Makefile" system for machine learning projects!
DVC pipelines are versioned using Git, and allow you to better organize projects and reproduce complete workflows and results at will. You could capture a simple ETL workflow, organize your project, or build a complex DAG (Directed Acyclic Graph) pipeline.
Later, we will find DVC allows you to manage machine learning experiments on top of these pipelines - controlling their execution, injecting parameters, etc.
Setup
Working inside an initialized DVC project, let's get some sample code for the next steps:
$ wget https://code.dvc.org/get-started/code.zip
$ unzip code.zip && rm -f code.zip
Get the sample code like this:
$ tree
.
โโโ params.yaml
โโโ src
โโโ evaluate.py
โโโ featurization.py
โโโ prepare.py
โโโ requirements.txt
โโโ train.py
The data needed to run this example can be downloaded using dvc get
and
tracked with dvc add
(if you are following from
Data Versioning, you may already
have this data):
$ dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
$ dvc add data/data.xml
Now, let's go through some usual project setup steps (virtualenv, requirements, Git).
First, create and use a virtual environment (it's not a must, but we strongly recommend it):
$ virtualenv venv && echo "venv" > .gitignore
$ source venv/bin/activate
Next, install the Python requirements:
$ pip install -r src/requirements.txt
Finally, this is a good time to commit our code to Git:
$ git add .github/ data/ params.yaml src .gitignore
$ git commit -m "Initial commit"
Pipeline stages
Use dvc stage add
to create stages. These represent processing steps
(usually scripts/code tracked with Git) and combine to form the pipeline.
Stages allow connecting code to its corresponding data input and output.
Let's transform a Python script into a stage:
$ dvc stage add -n prepare \
-p prepare.seed,prepare.split \
-d src/prepare.py -d data/data.xml \
-o data/prepared \
python src/prepare.py data/data.xml
A dvc.yaml
file is generated. It includes information about the command we
want to run (python src/prepare.py data/data.xml
), its
dependencies, and outputs.
DVC uses the pipeline definition to automatically track the data used and
produced by any stage, so there's no need to manually run dvc add
for
data/prepared
!
Details on the command options used above:
-
-n prepare
specifies a name for the stage. If you open thedvc.yaml
file you will see a section namedprepare
. -
-p prepare.seed,prepare.split
defines special types of dependencies โ parameters. Any stage can depend on parameter values from a parameters file (params.yaml
by default). We'll discuss those more in the Metrics, Parameters, and Plots page.
prepare:
split: 0.20
seed: 20170428
-
-d src/prepare.py
and-d data/data.xml
mean that the stage depends on these files (dependencies) to work. Notice that the source code itself is marked as a dependency as well. If any of these files change, DVC will know that this stage needs to be reproduced when the pipeline is executed. -
-o data/prepared
specifies an output directory for this script, which writes two files in it.This is how the workspace looks like after the run:
. โโโ data โ โโโ data.xml โ โโโ data.xml.dvc +โ โโโ prepared +โ โโโ test.tsv +โ โโโ train.tsv +โโโ dvc.yaml +โโโ dvc.lock โโโ params.yaml โโโ src โโโ ...
-
The last line,
python src/prepare.py data/data.xml
is the command to run in this stage, and it's saved todvc.yaml
, as shown below.
The resulting prepare
stage contains all of the information above:
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- src/prepare.py
- data/data.xml
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
DVC can help simplify your workflow by keeping all your data inside your
project, but this isn't always practical if you already have a large dataset
stored elsewhere that you don't want to copy, or your stage writes data directly
to cloud storage. DVC can still detect when these external datasets change. Your
pipeline dependencies can point anywhere, not only local paths inside your
project. Same with outputs, except that you need to set cache: false
to tell
DVC not to make a local copy of these external outputs. See the example below or
read more in
External Dependencies and Outputs.
stages:
prepare:
cmd:
- wget
https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
-O bank-additional.zip
- python sm_prepare.py --bucket mybucket --prefix project-data
deps:
- sm_prepare.py
- https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
outs:
- s3://mybucket/project-data/input_data:
cache: false
Once you've added a stage, you can run the pipeline with dvc repro
.
Dependency graphs
By using dvc stage add
multiple times, defining outputs of a
stage as dependencies of another, we can describe a sequence of
dependent commands which gets to some desired result. This is what we call a
dependency graph which forms a full cohesive pipeline.
Let's create a 2nd stage chained to the outputs of prepare
, to perform feature
extraction:
$ dvc stage add -n featurize \
-p featurize.max_features,featurize.ngrams \
-d src/featurization.py -d data/prepared \
-o data/features \
python src/featurization.py data/prepared data/features
The dvc.yaml
file will now be updated to include the two stages.
And finally, let's add a 3rd train
stage:
$ dvc stage add -n train \
-p train.seed,train.n_est,train.min_split \
-d src/train.py -d data/features \
-o model.pkl \
python src/train.py data/features model.pkl
Finally, our dvc.yaml
should have all 3 stages.
This would be a good time to commit the changes with Git. These include
.gitignore
(s) and dvc.yaml
โ which describes our pipeline.
$ git add .gitignore data/.gitignore dvc.yaml
$ git commit -m "pipeline defined"
Great! Now we're ready to run the pipeline.
Reproducing
The pipeline definition in dvc.yaml
allow us to easily reproduce the pipeline:
$ dvc repro
You'll notice a dvc.lock
(a "state file") was created to capture the
reproduction's results.
dvc repro
relies on the dependency graph of stages defined in dvc.yaml
, and
uses dvc.lock
to determine what exactly needs to be run.
The dvc.lock
file is similar to a .dvc
file โ it captures hashes (in most
cases md5
s) of the dependencies and values of the parameters that were used.
It can be considered a state of the pipeline:
schema: '2.0'
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- path: data/data.xml
md5: 22a1a2931c8370d3aeedd7183606fd7f
size: 14445097
- path: src/prepare.py
md5: f09ea0c15980b43010257ccb9f0055e2
size: 1576
params:
params.yaml:
prepare.seed: 20170428
prepare.split: 0.2
outs:
- path: data/prepared
md5: 153aad06d376b6595932470e459ef42a.dir
size: 8437363
nfiles: 2
The dvc status
command can be used to compare the workspace with an actual
state of the workspace.
It's good practice to immediately commit dvc.lock
to Git after its creation or
modification, to record the current state & results:
$ git add dvc.lock && git commit -m "first pipeline repro"
Let's try to have a little bit of fun with it. First, change one of the parameters for the training stage:
- Open
params.yaml
and changen_est
to100
, and - (re)run
dvc repro
.
You will see:
$ dvc repro
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Running stage 'train' with command: ...
DVC detected that only train
should be run, and skipped everything else! All
the intermediate results are being reused.
Now, let's change it back to 50
and run dvc repro
again:
$ dvc repro
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
As before, there was no need to rerun prepare
, featurize
, etc. But this time
it also doesn't rerun train
! The previous run with the same set of inputs
(parameters & data) was saved in DVC's run cache, and was reused.
Visualizing
Having built our pipeline, we need a good way to understand its structure. Visualizing it as a graph of connected stages helps with that. DVC lets you do so without leaving the terminal!
$ dvc dag
+---------+
| prepare |
+---------+
*
*
*
+-----------+
| featurize |
+-----------+
*
*
*
+-------+
| train |
+-------+
Refer to dvc dag
to explore other ways this command can visualize a pipeline.
Summary
DVC pipelines (dvc.yaml
file, dvc stage add
, and dvc repro
commands) solve
a few important problems:
- Automation: run a sequence of steps in a "smart" way which makes iterating on your project faster. DVC automatically determines which parts of a project need to be run, and it caches "runs" and their results to avoid unnecessary reruns.
- Reproducibility:
dvc.yaml
anddvc.lock
files describe what data to use and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share. - Continuous Delivery and Continuous Integration (CI/CD) for ML: describing projects in a way that can be built and reproduced is the first necessary step before introducing CI/CD systems. See our sister project CML for some examples.