HDFS & WebHDFS
Start with dvc remote add
to define the remote:
$ dvc remote add -d myremote hdfs://user@example.com:path
⚠️ Using HDFS with a Hadoop cluster might require additional setup. Our
assumption is that the client is set up to use it. Specifically, libhdfs
should be installed.
HDFS configuration parameters
If any values given to the parameters below contain sensitive user info, add
them with the --local
option, so they're written to a Git-ignored config file.
-
url
- remote location:$ dvc remote modify myremote url hdfs://user@example.com/path
-
user
- user name to access the remote.$ dvc remote modify --local myremote user myuser
-
kerb_ticket
- path to the Kerberos ticket cache for Kerberos-secured HDFS clusters$ dvc remote modify --local myremote \ kerb_ticket /path/to/ticket/cache
-
replication
- replication factor for write operations on HDFS cluster. Default value is 3.$ dvc remote modify myremote replication 2
WebHDFS
Using an HDFS cluster as remote storage is also supported via the WebHDFS API.
If your cluster is secured, then WebHDFS is commonly used with Kerberos and
HTTPS. To enable these for the DVC remote, set use_https
and kerberos
to
true
.
$ dvc remote add -d myremote webhdfs://example.com/path
$ dvc remote modify myremote use_https true
$ dvc remote modify myremote kerberos true
$ dvc remote modify --local myremote token SOME_BASE64_ENCODED_TOKEN
⚠️ Using WebHDFS requires to enable REST API access in the cluster: set the
config property dfs.webhdfs.enabled
to true
in hdfs-site.xml
.
💡 You may want to run kinit
before using the remote to make sure you have an
active kerberos session.
WebHDFS configuration parameters
If any values given to the parameters below contain sensitive user info, add
them with the --local
option, so they're written to a Git-ignored config file.
-
url
- remote location:$ dvc remote modify myremote url webhdfs://user@example.com/path
Do not provide a
user
in the URL withkerberos
ortoken
authentication. -
user
- user name to access the remote. Do not set this withkerberos
ortoken
authentication.$ dvc remote modify --local myremote user myuser
-
kerberos
- enable Kerberos authentication (false
by default):$ dvc remote modify myremote kerberos true
-
kerberos_principal
- Kerberos principal to use, in case you have multiple ones (for example service accounts). Only used ifkerberos
istrue
.$ dvc remote modify myremote kerberos_principal myprincipal
-
proxy_to
- Hadoop superuser to proxy as. Proxy user feature must be enabled on the cluster, and the user must have the correct access rights. If the cluster is secured, Kerberos must be enabled (setkerberos
totrue
) for this to work. This parameter is incompatible withtoken
.$ dvc remote modify myremote proxy_to myuser
-
use_https
- enables SWebHdfs. Note that DVC still expects the protocol inurl
to bewebhdfs://
, and will fail ifswebhdfs://
is used.$ dvc remote modify myremote use_https true
-
ssl_verify
- whether to verify SSL requests. Defaults totrue
whenuse_https
is enabled,false
otherwise.$ dvc remote modify myremote ssl_verify false
-
token
- Hadoop delegation token (as returned by the WebHDFS API). If the cluster is secured, Kerberos must be enabled (setkerberos
totrue
) for this to work. This parameter is incompatible with providing auser
and withproxy_to
.$ dvc remote modify myremote token "mysecret"
-
password
- Password to use in combination withuser
for Basic Authentication. If you providepassword
you must also provideuser
. Since this is a password it is recommended to store this in your local config (i.e. not in Git)$ dvc remote modify --local password "mypassword"
-
data_proxy_target
- Target mapping to be used in the call to the fsspec WebHDFS constructor (see https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=data_proxy#fsspec.implementations.webhdfs.WebHDFS.__init__ ). This enables support for access to a WebHDFS cluster that is behind a High Availability proxy server and rewrites the URL used for connecting.For example, if you provide the url
webhdfs://host:port/
and you provide the valuehttps://host:port/gateway/cluster
for thedata_proxy_target
parameter, then internally the fsspec WebHDFS will rewrite every occurrence ofhttps://host:port/webhdfs/v1
intohttps://host:port/gateway/cluster/webhdfs/v1
$ dvc remote modify data_proxy_target "https://host:port/gateway/cluster"