Upgrading from DVC 2.x to 3.0
DVC 3.0 introduced changes to how DVC hashes files and to where DVC-tracked data is stored in the cache. DVC 3.0 remains compatible with pre-existing data tracked by DVC 2.0, but there are a few important points that users should note when upgrading to DVC 3.0.
For a full list of breaking changes in DVC 3.0, please refer to the release notes.
File hashing changes
Previously, DVC would attempt to identify whether a DVC-tracked file contained text content, and would convert Windows-style CRLF line endings to Unix-style LF line endings before hashing the file content (i.e. a dos2unix conversion). This behavior was intended to simplify usage in cross-platform scenarios (where a DVC repository was used on both Unix and Windows machines). However, even though DVC would convert line endings when computing hashes, DVC would still store the original native content in both local DVC cache and in remote storage. This would lead to unintended side effects in situations where a given file was a binary file misidentified as text by DVC or where a text file was not intended to be cross platform (and CRLF should not have been considered equivalent to LF).
In DVC 3.0, the line ending conversion behavior has been removed, and DVC treats all files as if they contain binary data. This means that a text file with CRLF line endings will always be identified as completely separate from a file containing LF line endings, even if all other text content in the two files is identical.
When upgrading to DVC 3.0, users with pipelines that may be run in both Unix and
Windows environments should ensure that any pipeline stages with text outputs
(such as .csv
or .tsv
files) generate files with consistent line endings,
regardless of the platform where a stage is run.
For example, Python stages should explicitly generate files with either
Unix-style \n
or Windows-style \r\n
line endings, rather than relying on the
default platform specific os.linesep
behavior.
Optional local cache migration
In order to avoid hash collisions between files tracked in DVC 3.0 and older releases, files tracked in DVC 3.0 are stored separately from files tracked in older releases. By default, DVC does not automatically de-duplicate any data between files tracked in DVC 3.0 and files tracked in older releases. DVC will still read cached files from DVC 2.0 and will only duplicate for new or modified data.
Users can manually migrate existing local DVC cache data to the DVC 3.0 location
by running the dvc cache migrate
command. On most local filesystems,
dvc cache migrate
is equivalent to forcing the de-duplication of files tracked
in DVC 3.0 and files tracked in older releases. Files from the old cache
location will be re-hashed using the DVC 3.0 hash algorithm, atomically moved to
the new cache location, and then a link will be created from the old location to
the new one. This may take a long time.
On filesystems that do not support any type of linking, data will be copied from the old cache location into the DVC 3.0 location (resulting in no de-duplication).
By default, dvc cache migrate
only migrates cache data and does not modify
DVC files in the DVC repository.
dvc cache migrate --dvc-files
will migrate entries in all DVC files in the
repository so that DVC will only use data from the DVC 3.0 cache location.
Note that when using --dvc-files
option, DVC will only migrate DVC files in
workspace (and Git history will not be re-written).
For DVC remotes, there is no
equivalent migration command since it is not possible to link between old and
new locations on many remote filesystems. Instead, once you have migrated data
locally and pushed to the remote, you may use dvc gc -c
commands to remove
outdated data from the remote.