Edit on GitHub

HDFS & WebHDFS

Start with dvc remote add to define the remote:

$ dvc remote add -d myremote hdfs://user@example.com:path

⚠️ Using HDFS with a Hadoop cluster might require additional setup. Our assumption is that the client is set up to use it. Specifically, libhdfs should be installed.

HDFS configuration parameters

If any values given to the parameters below contain sensitive user info, add them with the --local option, so they're written to a Git-ignored config file.

  • url - remote location:

    $ dvc remote modify myremote url hdfs://user@example.com/path
  • user - user name to access the remote.

    $ dvc remote modify --local myremote user myuser
  • kerb_ticket - path to the Kerberos ticket cache for Kerberos-secured HDFS clusters

    $ dvc remote modify --local myremote \
                                kerb_ticket /path/to/ticket/cache
  • replication - replication factor for write operations on HDFS cluster. Default value is 3.

    $ dvc remote modify myremote replication 2

WebHDFS

Using an HDFS cluster as remote storage is also supported via the WebHDFS API.

If your cluster is secured, then WebHDFS is commonly used with Kerberos and HTTPS. To enable these for the DVC remote, set use_https and kerberos to true.

$ dvc remote add -d myremote webhdfs://example.com/path
$ dvc remote modify myremote use_https true
$ dvc remote modify myremote kerberos true
$ dvc remote modify --local myremote token SOME_BASE64_ENCODED_TOKEN

⚠️ Using WebHDFS requires to enable REST API access in the cluster: set the config property dfs.webhdfs.enabled to true in hdfs-site.xml.

💡 You may want to run kinit before using the remote to make sure you have an active kerberos session.

WebHDFS configuration parameters

If any values given to the parameters below contain sensitive user info, add them with the --local option, so they're written to a Git-ignored config file.

  • url - remote location:

    $ dvc remote modify myremote url webhdfs://user@example.com/path

    Do not provide a user in the URL with kerberos or token authentication.

  • user - user name to access the remote. Do not set this with kerberos or token authentication.

    $ dvc remote modify --local myremote user myuser
  • kerberos - enable Kerberos authentication (false by default):

    $ dvc remote modify myremote kerberos true
  • kerberos_principal - Kerberos principal to use, in case you have multiple ones (for example service accounts). Only used if kerberos is true.

    $ dvc remote modify myremote kerberos_principal myprincipal
  • proxy_to - Hadoop superuser to proxy as. Proxy user feature must be enabled on the cluster, and the user must have the correct access rights. If the cluster is secured, Kerberos must be enabled (set kerberos to true) for this to work. This parameter is incompatible with token.

    $ dvc remote modify myremote proxy_to myuser
  • use_https - enables SWebHdfs. Note that DVC still expects the protocol in url to be webhdfs://, and will fail if swebhdfs:// is used.

    $ dvc remote modify myremote use_https true
  • ssl_verify - whether to verify SSL requests. Defaults to true when use_https is enabled, false otherwise.

    $ dvc remote modify myremote ssl_verify false
  • token - Hadoop delegation token (as returned by the WebHDFS API). If the cluster is secured, Kerberos must be enabled (set kerberos to true) for this to work. This parameter is incompatible with providing a user and with proxy_to.

    $ dvc remote modify myremote token "mysecret"
  • password - Password to use in combination with user for Basic Authentication. If you provide password you must also provide user. Since this is a password it is recommended to store this in your local config (i.e. not in Git)

    $ dvc remote modify --local password "mypassword"
  • data_proxy_target - Target mapping to be used in the call to the fsspec WebHDFS constructor (see https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=data_proxy#fsspec.implementations.webhdfs.WebHDFS.__init__ ). This enables support for access to a WebHDFS cluster that is behind a High Availability proxy server and rewrites the URL used for connecting.

    For example, if you provide the url webhdfs://host:port/ and you provide the value https://host:port/gateway/cluster for the data_proxy_target parameter, then internally the fsspec WebHDFS will rewrite every occurrence of https://host:port/webhdfs/v1 into https://host:port/gateway/cluster/webhdfs/v1

    $ dvc remote modify data_proxy_target "https://host:port/gateway/cluster"
Content

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat