Defining Pipelines
Pipelines represent data workflows that you want to reproduce reliably — so the results are consistent. The typical pipelining process involves:
-
Obtain and
dvc add
ordvc import
the project's initial data requirements (see Data Versioning). This caches the data and generates.dvc
files. -
Define the pipeline stages in
dvc.yaml
files (more on this later). Example structure:stages: prepare: ... # stage 1 definition train: ... # stage 2 definition evaluate: ... # stage 3 definition
-
Capture other useful metadata such as runtime parameters, performance metrics, and plots to visualize. DVC supports multiple file formats for these.
We call this file-based definition codification (YAML format in our case). It has the added benefit of allowing you to develop pipelines on standard Git workflows (and GitOps).
Stages usually take some data and run some code, producing an output (e.g. an ML model). The pipeline is formed by making them interdependent, meaning that the output of a stage becomes the input of another, and so on. Technically, this is called a dependency graph (DAG).
Note that while each pipeline is a graph, this doesn't mean a single dvc.yaml
file. DVC checks the entire project tree and validates all such
files to find stages, rebuilding all the pipelines that these may define.
DVC represents a pipeline internally as a graph where the nodes are stages and the edges are directed dependencies (e.g. A before B). And in order for DVC to run a pipeline, its topology should be acyclic — because executing cycles (e.g. A -> B -> C -> A …) would continue indefinitely. More about DAGs.
Use dvc dag
to visualize (or export) them.
Stages
See the full specification of stage entries.
Each stage wraps around an executable shell command and specifies any file-based dependencies as well as outputs. Let's look at a sample stage: it depends on a script file it runs as well as on a raw data input (ideally tracked by DVC already):
stages:
prepare:
cmd: source src/cleanup.sh
deps:
- src/cleanup.sh
- data/raw
outs:
- data/clean.csv
We use GNU/Linux in these examples, but Windows or other shells can be used too.
Besides writing dvc.yaml
files manually (recommended), you can also create
stages with dvc stage add
— a limited command-line interface to setup
pipelines. Let's add another stage this way and look at the resulting
dvc.yaml
:
$ dvc stage add --name train \
--deps src/model.py \
--deps data/clean.csv \
--outs data/predict.dat \
python src/model.py data/clean.csv
stages:
prepare:
...
outs:
- data/clean.csv
train:
cmd: python src/model.py data/model.csv
deps:
- src/model.py
- data/clean.csv
outs:
- data/predict.dat
One advantage of using dvc stage add
is that it will verify the validity of
the arguments provided (otherwise stage definition won't be checked until
execution). A disadvantage is that some advanced features such as templating
are not available this way.
Notice that the new train
stage depends on the output from stage prepare
(data/clean.csv
), forming the pipeline (DAG).
Stage execution sequences will be determined entirely by the DAG, not by the
order in which stages are found in dvc.yaml
.
Simple dependencies
There's more than one type of stage dependency. A simple dependency is a file or
directory used as input by the stage command. When it's contents have changed,
DVC "invalidates" the stage — it knows that it needs to run again (see
dvc status
). This in turn may cause a chain reaction in which subsequent
stages of the pipeline are also reproduced.
DVC calculates a hash of file/dir contents to compare vs. previous versions.
This is a distinctive mechanism over traditional build tools like make
.
File system-level dependencies are defined in the deps
field of dvc.yaml
stages; Alternatively, using the --deps
(-d
) option of dvc stage add
(see
the previous section's example).
Parameter dependencies
A more granular type of dependency is the parameter (params
field of
dvc.yaml
), or hyperparameters in machine learning. These are any values used
inside your code to tune data processing, or that affect stage execution in any
other way. For example, training a Neural Network usually requires batch size
and epoch values.
Instead of hard-coding param values, your code can read them from a structured
file (e.g. YAML format). DVC can track any key/value pair in a supported
parameters file (params.yaml
by default). Params are granular dependencies because
DVC only invalidates stages when the corresponding part of the params file has changed.
stages:
train:
cmd: ...
deps: ...
params: # from params.yaml
- learning_rate
- nn.epochs
- nn.batch_size
outs: ...
See more details about this syntax.
Use dvc params diff
to compare parameters across project versions.
Outputs
Stage outputs are files (or directories) written by pipelines, for
example machine learning models and intermediate artifacts. These files are
cached by DVC automatically, and tracked with the help of
dvc.lock
files (or .dvc
files, see dvc add
).
Outputs can be dependencies of subsequent stages (as explained earlier). So when they change, DVC may need to reproduce downstream stages as well (handled automatically).
DVC can also track metrics and plots files, which can optionally be added as
stage outputs, or even added with cache: false
in dvc.yaml
since they are
often small enough to store in Git.
Outputs are produced by stage commands. DVC does not make any
assumption regarding this process; they should just match the path specified in
dvc.yaml
.