dvc.api.open()
Opens a tracked file.
def open(path: str,
repo: str = None,
rev: str = None,
remote: str = None,
remote_config: dict = None,
config: dict = None,
mode: str = "r",
encoding: str = None)
Usage
import dvc.api
with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as f:
# ... f is a file-like object that can be processed normally.
Description
Open a data or model file tracked in a DVC project and generate a corresponding file object. The file can be tracked by DVC (as an output) or by Git.
The exact type of file object depends on the mode
used. For more details,
please refer to Python's open()
built-in, which is used under the hood.
This function makes a direct connection to remote storage, so the file contents can be streamed. Your code can process the data buffer as it's streamed, which optimizes memory usage.
dvc.api.open()
may only be used as a context manager (using the with
keyword,
as shown in the examples).
Use dvc.api.read()
to load the complete file contents in a single function
call – no context manager involved. Neither function utilizes disc space.
Parameters
-
path
(required) - location and file name of the target to open, relative to the root of the project (repo
). -
repo
- specifies the location of the DVC project. It can be a URL or a file system path. Both HTTP and SSH protocols are supported for online Git repos (e.g.[user@]server:project.git
). Default: The current project is used (the current working directory tree is walked up to find it). -
rev
- Git commit (any revision such as a branch or tag name, commit hash, or experiment name). Ifrepo
is not a Git repo, this option is ignored. Default:None
(current working tree will be used) -
remote
- name of the DVC remote to look for the target data. Default: The default remote ofrepo
is used if aremote
argument is not given. For local projects, the cache is tried before the default remote. -
remote_config
- dictionary of options to pass to the DVC remote. This can be used to, for example, provide credentials to theremote
. -
config
- config dictionary to pass to the DVC project. This is merged with the existing project config and can be used to, for example, add an entirely newremote
. -
mode
- specifies the mode in which the file is opened. Defaults to"r"
(read). Mirrors the namesake parameter in builtinopen()
. -
encoding
- codec used to decode the file contents to a string. This should only be used in text mode. Defaults to"utf-8"
. Mirrors the namesake parameter in builtinopen()
.
Exceptions
-
dvc.exceptions.FileMissingError
- file inpath
is missing fromrepo
. -
dvc.exceptions.PathMissingError
-path
cannot be found inrepo
. -
dvc.exceptions.NoRemoteError
- noremote
is found.
Example: Use data or models from a DVC repository
Any file tracked in a DVC project (and stored remotely) can be processed directly in your Python code with this API. For example, an XML file tracked in a public DVC repo on GitHub can be processed like this:
from xml.sax import parse
import dvc.api
from mymodule import mySAXHandler
with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as f:
parse(f, mySAXHandler)
Notice that we use a SAX XML parser here because
dvc.api.open()
is able to stream the data from remote storage. (The
mySAXHandler
object should handle the event-driven parsing of the document in
this case.) This increases the performance of the code (minimizing memory
usage), and is typically faster than loading the whole data into memory.
If you just needed to load the complete file contents into memory, you can use
dvc.api.read()
instead:from xml.dom.minidom import parse import dvc.api url = 'https://github.com/iterative/dataset-registry' xmldata = dvc.api.read('get-started/data.xml', repo=url) xmldom = parse(xmldata)
Example: Accessing private repos
This is just a matter of using the right repo
argument, for example an SSH URL
(requires that the
credentials are configured
locally):
import dvc.api
with dvc.api.open(
'features.dat',
repo='git@server.com:path/to/repo.git'
) as f:
# ... Process 'features'
Example: Use different versions of data
The rev
argument lets you specify any Git commit to look for an artifact. This
way any previous version, or alternative experiment can be accessed
programmatically. For example, let's say your DVC repo has tagged releases of a
CSV dataset:
import csv
import dvc.api
with dvc.api.open('clean.csv', rev='v1.1.0') as f:
reader = csv.reader(f)
# ... Process 'clean' data from version 1.1.0
Also, notice that we didn't supply a repo
argument in this example. DVC will
attempt to find a DVC project to use in the current working
directory tree, and look for the file contents of clean.csv
in its local
cache; no download will happen if found. See the
Parameters section for more info.
Example: Choose a specific remote as the data source
Sometimes we may want to choose a specific remote storage as source, for example
if the repo
has no default remote set. This can be done by providing a remote
argument:
import dvc.api
with dvc.api.open('activity.log', remote='my-s3-bucket') as f:
for line in f:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
Example: Specify credentials for remote
See remote modify for full list of remote-specific config options.
import dvc.api
remote_config = {
'access_key_id': 'mykey',
'secret_access_key': 'mysecretkey',
'session_token': 'mytoken',
}
with dvc.api.open('data', remote_config=remote_config) as f:
# ... Process data
Example: Change default remote and specify credentials for it
See remote modify for full list of remote-specific config options.
import dvc.api
config = {
'core': {'remote': 'myremote'},
'remote': {
'myremote': {
'access_key_id': 'mykey',
'secret_access_key': 'mysecretkey',
'session_token': 'mytoken',
},
},
}
with dvc.api.open('data', config=config) as f:
# ... Process data
Example: Specify the text encoding
To chose which codec to open a text file with, send an encoding
argument:
import dvc.api
with dvc.api.open('data/nlp/words_ru.txt', encoding='koi8_r') as f:
# ... Process Russian words