Glossary
Artifact: An artifact is a model or other file or directory for which structured metadata (name, type, description, and one or more labels) may be stored in dvc.yaml
files. Model artifact metadata is used in the model registry.
Dependency: A file (e.g. data, code), directory (e.g. datasets), or parameter used as input for a stage in a DVC pipeline. These are specified as paths in the deps
field of dvc.yaml
or .dvc
files. Stages are invalidated (considered outdated) when any of their dependencies change. See dvc stage add
, dvc params
, dvc repro
.
DVC Cache: The DVC cache is a hidden storage (by default in .dvc/cache
) for files and directories tracked by DVC, and their different versions. For efficiency, it uses a content-addressable structure.
DVC File: dvc.yaml
, dvc.lock
, or .dvc
files. DVC commands create these in the workspace to codify pipelines and/or to track data for versioning. See also dvc repro
, dvc add
.
DVC Project: Initialized by running dvc init
in the workspace (typically a Git repository). It will contain the .dvc/
directory, as well as dvc.yaml
and .dvc
files created with commands such as dvc add
or dvc stage add
. More info
Experiment: A versioned iteration of ML model development. DVC tracks experiments as Git commits that DVC can find but that don't clutter your Git history or branches. Experiments may include code, metrics, parameters, plots, and data and model artifacts.
External Dependency: A stage dependency (deps
field in dvc.yaml
or in an import stage .dvc
file) with origin in an external source, for example HTTP, SSH, Amazon S3, Google Cloud Storage remote locations, or even other DVC repositories. See External Dependencies.
File Linking: A way to have a file appear in multiple different folders without occupying more physical space on the storage disk. This is both fast and economical. See large dataset optimization and dvc config cache
for more on file linking.
Import Stage: .dvc
file created with dvc import
or dvc import-url
, which represents a file or directory from an external source. It has an external dependency (the data source), an implicit download command, and the imported data itself as output.
Metrics: Key/value pairs saved in structured files (JSON, TOML 1.0, or YAML 1.2) that map a metric name (AUC, ROC, etc.) to a numeric value. By specifying metrics files in dvc.yaml
, DVC can compare them among machine learning experiments to evaluate machine learning performance. See dvc metrics
.
Model Registry: The model registry connects all of your team's models through Git. Find, organize, and manage all of them, and use GitOps workflows to version, promote, demote, and deploy them. Learn more.
Output: A file or directory tracked by DVC, recorded in the outs
section of a stage (in dvc.yaml
) or .dvc
file. Outputs are usually the result of stages. See dvc add
, dvc repro
, dvc import
, among others.
Parameters: Hyperparameters or other config values used by your code, loaded from a structured file (params.yaml
by default). They can be tracked as granular dependencies for stages of DVC pipelines (defined in dvc.yaml
). DVC can also compare them among machine learning experiments (useful for optimization). See dvc params
.
Pipeline: DVC pipelines describe data processing workflows in a standard declarative YAML format (dvc.yaml
). This guarantees DVC can reproduce them consistently. DVC also helps automate their execution and caches their results. See Defining Pipelines for more details.
Plots: Either data series saved in structured files (JSON, YAML 1.2, CSV, or TSV) or images saved in JPEG, GIF, PNG, or SVG files. By specifying plots files and optional properties in dvc.yaml
, DVC can compare them among machine learning experiments. See dvc plots
.
Run cache: A log of stages that have been run in the project. It's comprised of dvc.lock
file backups, identified as combinations of dependencies, commands, and outputs that correspond to each other. dvc repro
and dvc exp run
populate and reutilize the run cache. See Run cache for more details.
Stage: A stage represents an individual command, script, or source code that gets to some milestone as part of your project's workflow. For example, python train.py
may generate a machine learning model. DVC stages include data input(s) and resulting output(s), if any. Learn more.
Workspace: Directory containing all your DVC project files, e.g. raw data, source code, ML models. One project version at a time is visible in the workspace. More info