Modifying Large Datasets
For large datasets comprised of many files, it can be painfully slow to operate on the entire dataset at once. Instead, you can operate on only the files you want to modify.
Granular modifications
Let's say you have a DVC-tracked dataset with many individual files:
$ tree images
images
├── test
│ ├── 0
│ │ ├── 00004.png
│ │ ├── 00011.png
│ │ ├── 00014.png
│ │ ├── 00026.png
│ │ ├── 00029.png
│ │ ├── 00056.png
│ │ ├── 00070.png
...
└── images.dvc
23 directories, 70001 files
You can dvc add
one or more new files or subdirectories to this dataset
without re-adding the entire dataset. Let's assume we have one new file in the
dataset:
$ cp ~/Downloads/new.png images/test/0/70001.png
$ dvc data status --granular
DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
(use "dvc checkout <file>..." to discard changes)
modified: images/
added: images/test/0/70001.png
Run dvc add
with the new file as the target:
$ dvc add images/test/0/70001.png
100% Adding...|████████████████████████████████████████|1/1 [00:00, 1.69file/s]
$ dvc data status --granular
DVC committed changes:
(git commit the corresponding dvc files to update the repo)
modified: images/
added: images/test/0/70001.png
(there are other changes not tracked by dvc, use "git status" to see)
You can also modify one or more existing files or subdirectories. Let's assume we have overwritten one file in the dataset:
$ cp ~/Downloads/updated.png images/test/0/00004.png
$ dvc data status --granular
DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
(use "dvc checkout <file>..." to discard changes)
modified: images/
modified: images/test/0/00004.png
$ dvc add images/test/0/00004.png
100% Adding...|████████████████████████████████████████|1/1 [00:00, 1.70file/s]
$ dvc data status --granular
DVC committed changes:
(git commit the corresponding dvc files to update the repo)
modified: images/
modified: images/test/0/00004.png
(there are other changes not tracked by dvc, use "git status" to see)
Finally, you can delete one or more files or subdirectories by removing them in the workspace and then specifying them as targets. Let's assume we have deleted one file in the dataset:
$ rm images/test/0/00011.png
$ dvc data status --granular
DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
(use "dvc checkout <file>..." to discard changes)
modified: images/
deleted: images/test/0/00011.png
$ dvc add images/test/0/00011.png
100% Adding...|████████████████████████████████████████|1/1 [00:00, 1.73file/s]
$ dvc data status --granular
DVC committed changes:
(git commit the corresponding dvc files to update the repo)
modified: images/
deleted: images/test/0/00011.png
(there are other changes not tracked by dvc, use "git status" to see)
This has the same effect as dvc add images/test/0
(or targeting any other
parent directory of the deleted file). The more granular the target, the faster
it is.
Modifying remote datasets
If your dataset is in remote storage but not downloaded to your workspace, it's
inconvenient to dvc pull
the entire dataset to update only one or a few files.
Instead, you can pull only the files you want to update:
$ tree
.
└── images.dvc
0 directories, 1 file
$ dvc pull images/test/0
See dvc ls
to list the available files to pull for the project.
Then you can modify them as needed and track those changes:
$ cp ~/Downloads/new.png images/test/0/70001.png
$ dvc add images/test/0/70001.png
100% Adding...|████████████████████████████████████████|1/1 [00:00, 1.73file/s]
Finally you can push the changes back to your remote without ever having to download the full dataset:
$ dvc push
2 files pushed
2 files were pushed: the new file and the updated directory listing. You can add, modify, and delete files from a remote dataset in this way.