Edit on GitHub

How to Share a DVC Cache

Configuring the project's cache in a location shared by other projects helps avoid file duplication and can reduce data transfers between different data locations. Some scenarios:

  • Multiple users have copies of the same project on a single machine.
  • Different projects use the same data, so it makes sense to centralize its local storage.
  • There's a shared NAS drive or storage server mounted over the network.
  • You need to seamlessly share data between your working machine and a shared GPU server (for training ML models).

Preparation

Create a directory external to your DVC projects to be used as a shared cache location for everyone's projects:

$ mkdir -p /home/shared/dvc-cache

Make sure that the directory has proper permissions, so that all your colleagues can write to it, and can read cached files written by others. The most straightforward way to do this is to make all users members of the same group, and have the shared cache directory owned by that group.

Transfer existing cache (optional)

You can skip this part if you are setting up a new DVC project where the local cache directory (.dvc/cache by default), hasn't been used.

If you did work on the DVC projects previously and wish to transfer its existing cache to the shared cache directory, you will simply need to move its contents from the old location to the new one:

$ mv .dvc/cache/* /home/shared/dvc-cache

Now, ensure that the cached directories and files have appropriate permissions, so that they can be accessed by your colleagues (assuming their users are members of the same group):

$ sudo find /home/shared/dvc-cache -type d -exec chmod 0775 {} \;
$ sudo find /home/shared/dvc-cache -type f -exec chmod 0444 {} \;
$ sudo chown -R myuser:ourgroup /home/shared/dvc-cache/

Configure the shared cache

A cache directory outside the workspace is called an external cache. Set it to the directory we created earlier with dvc cache dir and configure it with dvc config cache:

$ dvc cache dir /home/shared/dvc-cache

$ dvc config cache.shared group
$ dvc config cache.type symlink

Above, we first tell DVC to set group permissions on new cache files. Then we enable symlinks to avoid having copies from the external cache to the workspace.

See dvc config cache and File link types for more info.

Note that enabling soft/hard links causes DVC to protect the linked data because editing them in-place would corrupt the cache. See dvc unprotect.

If you're using Git, commit the changes to your project's config file (usually .dvc/config):

$ git add .dvc/config
$ git commit -m "config external/shared DVC cache"

Using dvc gc with a shared cache may delete data needed in another project! See more info. about cleaning a shared cache safely.

Content

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat