How to Resolve Merge Conflicts in DVC Metafiles
Sometimes multiple team members work on the same DVC-tracked data. When the time comes to combine their changes, merge conflicts can occur in Git-tracked DVC files, which need to be resolved.
dvc.yaml
Conflicts here are no different from what we would see in source code. See Git Merging.
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
< < < < < < < HEAD
- data/big.xml
= = = = = = =
- data/small.xml
> > > > > > > branch
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
dvc.lock
There's no need to resolve lock file merge conflicts manually. You can safely
remove this file. After merging dvc.yaml
, you can reproduce a clean dvc.lock
with dvc repro
.
dvc commit
can also be a good option, but only for the specific case where theHEAD
version is chosen.
.dvc
files
There are three main variations in the structure of these files, that differ by the command that has generated them:
Simple tracking (add)
In .dvc
files generated by dvc add
, you'll get something that looks like:
outs:
< < < < < < < HEAD
- md5: a304afb96060aad90176268345e10355
size: 12
= = = = = = =
- md5: 35dd1fda9cfb4b645ae431f4621fa324
size: 100
> > > > > > > branch
path: data.xml
You can pick one of the versions:
outs:
- md5: 35dd1fda9cfb4b645ae431f4621fa324
size: 100
path: data.xml
But if you want to actually merge the data files (or directories) of both versions, then you can follow this process:
- Run
dvc checkout data.xml
on bothHEAD
andbranch
; - Copy the data into temporary locations (e.g.
data.xml.head
anddata.xml.branch
); - Merge it by-hand;
- Finally, run
dvc add data.xml
to overwrite the conflicted.dvc
file.
Directories
If you have a directory, DVC provides a Git merge driver that can automatically resolve many merge conflicts for you. To use it, first set it up in your Git repo:
$ git config merge.dvc.name 'DVC merge driver'
$ git config merge.dvc.driver \
'dvc git-hook merge-driver --ancestor %O --our %A --their %B'
And add this line to your .gitattributes
(in the root of your git repo):
*.dvc merge=dvc
Now, when a merge conflict occurs, DVC will simply combine data from both branches.
If the same file was added or changed in both branches, the merge driver will fail unless the changes are the same. If the same file was deleted in both branches, the merge driver will fail.
Imported data
To resolve merge conflicts in .dvc
files generated by dvc import
or
dvc import-url
, remove the conflicted values altogether:
< < < < < < < HEAD
md5: 263395583f35403c8e0b1b94b30bea32
=======
md5: 520d2602f440d13372435d91d3bfa176
> > > > > > > branch
frozen: true
deps:
- path: get-started/data.xml
repo:
url: https://github.com/iterative/dataset-registry
< < < < < < < HEAD
rev_lock: f31f5c4cdae787b4bdeb97a717687d44667d9e62
= = = = = = =
rev_lock: 06be1104741f8a7c65449322a1fcc8c5f1070a1e
> > > > > > > branch
outs:
< < < < < < < HEAD
- md5: a304afb96060aad90176268345e10355
size: 12
= = = = = = =
- md5: 35dd1fda9cfb4b645ae431f4621fa324
size: 100
> > > > > > > branch
path: data.xml
So you get something like this:
frozen: true
deps:
- path: get-started/data.xml
repo:
url: https://github.com/iterative/dataset-registry
outs:
- path: data.xml
And then dvc update
the .dvc
file to download the latest data from its
original source.
Note that updating will bring in the latest version of the data from its source, which may not correspond with any of the hashes that was removed.