Running pipelines
To run a pipeline, you can use either dvc repro
or dvc exp run
. Either will
run the pipeline, and dvc exp run
will save the results as an
experiment (and has other
experiment-related features
like modifying parameters from the command line):
$ dvc exp run --set-param featurize.ngrams=3
Reproducing experiment 'funny-dado'
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'featurize':
> python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'
Running stage 'train':
> python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'
Running stage 'evaluate':
> python src/evaluate.py model.pkl data/features
Updating lock file 'dvc.lock'
Ran experiment(s): funny-dado
Experiment results have been applied to your workspace.
Stage outputs are deleted from the workspace before executing the
stage commands that produce them (unless persist: true
is used in dvc.yaml
).
DAG
DVC runs the DAG stages
sequentially, in the order defined by the
dependencies
and outputs. Consider
this example dvc.yaml
:
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
featurize:
cmd: python src/featurization.py data/prepared data/features
deps:
- data/prepared
- src/featurization.py
params:
- featurize.max_features
- featurize.ngrams
outs:
- data/features
The prepare
stage will always precede the featurize
stage because
data/prepared
is an output of prepare
and a dependency of
featurize
.
Caching Stages
DVC will try to avoid recomputing stages that have been run before. If you run a stage without changing its commands, dependencies, or parameters, DVC will skip that stage:
Stage 'prepare' didn't change, skipping
DVC will also recover the outputs from previous runs using the run cache.
Stage 'prepare' is cached - skipping run, checking out outputs
If you want a stage to run every time, you can use
always changed
in dvc.yaml
:
stages:
pull_latest:
cmd: python pull_latest.py
deps:
- pull_latest.py
outs:
- latest_results.csv
always_changed: true
Pull Missing Data
By default, DVC expects that all data to run the pipeline is available locally. Any missing data will be considered deleted and may cause the pipeline to fail. To avoid this, use the following flags:
--pull
will download missing data as needed, so you don't need to pull all data beforehand.--allow-missing
will skip stages with no other changes than missing data, so you don't need to download unnecessary data.
You can combine the --pull
and --allow-missing
flags to run a pipeline while
only pulling the data that is actually needed to run the changed stages.
In DVC>=3.0, --allow-missing
will not skip data saved with DVC<3.0 because the
hash type changed in DVC 3.0, which DVC considers a change to the data. To
migrate data to the new hash type, run dvc cache migrate --dvc-files
. See more
information about upgrading from DVC 2.x to 3.0.
Given the pipeline used in example-get-started-experiments:
$ dvc dag
+--------------------+
| data/pool_data.dvc |
+--------------------+
*
*
*
+------------+
| data_split |
+------------+
** **
** **
* **
+-------+ *
| train |* *
+-------+ **** *
* *** *
* **** *
* ** *
+-----------+ +----------+
| sagemaker | | evaluate |
+-----------+ +----------+
If we are in a machine where all the data is missing:
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
not in cache: models/model.pth
not in cache: results/train
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
changed outs:
not in cache: results/evaluate
sagemaker:
changed deps:
deleted: models/model.pth
changed outs:
not in cache: model.tar.gz
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
We can modify the evaluate
stage and DVC will only pull the necessary data to
run that stage (models/model.pkl
data/test_data/
) while skipping the rest of
the stages:
$ dvc exp run --pull --allow-missing --set-param evaluate.n_samples_to_save=20
Reproducing experiment 'hefty-tils'
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Running stage 'evaluate':
...
After the pipeline completes, the evaluate
stage is updated but all other
stages still have missing data:
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
changed outs:
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pth
not in cache: results/train
sagemaker:
changed deps:
deleted: models/model.pth
changed outs:
not in cache: model.tar.gz
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
We can run again with --pull
but not --allow-missing
to download data for
unchanged stages in the pipeline:
$ dvc exp run --pull
After the pipeline completes, all stages are up to date:
$ dvc status
Data and pipelines are up to date.
Verify Pipeline Status
In scenarios like CI jobs, you may want to check that the pipeline is up to date
without pulling or running anything. dvc repro --dry
will check which pipeline
stages to run without actually running them. However, if data is missing,
--dry
will fail because DVC does not know whether that data simply needs to be
pulled or is missing for some other reason. To check which stages to run and
ignore any missing data, use dvc repro --dry --allow-missing
.
This command will succeed if nothing has changed:
In the example below, data is missing because nothing has been pulled, but otherwise the pipeline is up to date.
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
$ dvc repro --allow-missing --dry
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
If anything is not up to date, the command will fail:
In the example below, the data_split
parameter in params.yaml
was modified,
so the pipeline is not up to date.
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
params.yaml:
modified: data_split
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
$ dvc repro --allow-missing --dry
'data/pool_data.dvc' didn't change, skipping
ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '.../example-get-started-experiments/data/pool_data'
To ensure any missing data exists, you can also check that all data exists on
the remote. The command below will succeed (set the exit code to 0
) if all
data is found in the remote. Otherwise, it will fail (set the exit code to 1
).
$ dvc data status --not-in-remote --json | grep -v not_in_remote
true
Debugging Stages
If you are using advanced features to interpolate values for your pipeline, like
templating or Hydra composition, you can get the interpolated values by running
dvc repro -vv
or dvc exp run -vv
, which will include information like:
2023-05-18 07:38:43,955 TRACE: Hydra composition enabled.
Contents dumped to params.yaml: {'model': {'batch_size':
512, 'latent_dim': 8, 'lr': 0.01, 'duration': '00:00:30:00',
'max_epochs': 2}, 'data_path': 'fra.txt', 'num_samples':
100000, 'seed': 423}
2023-05-18 07:38:44,027 TRACE: Context during resolution of
stage download: {'model': {'batch_size': 512, 'latent_dim':
8, 'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}
2023-05-18 07:38:44,073 TRACE: Context during resolution of
stage train: {'model': {'batch_size': 512, 'latent_dim': 8,
'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}