Skip to content

webHDFS via HTTP artifacts

webHDFS is a protocol allowing to access Hadoop or similar data storage via a unified REST API.

Input Artifacts

You can use HTTP artifacts to connect to webHDFS, where the URL will be the webHDFS endpoint including the file path and any query parameters. Suppose your webHDFS endpoint is available under https://mywebhdfsprovider.com/webhdfs/v1/ and you have a file my-art.txt located in a data folder, which you want to use as an input artifact. To construct the URL, you append the file path to the base webHDFS endpoint and set the OPEN operation via query parameter. The result is: https://mywebhdfsprovider.com/webhdfs/v1/data/my-art.txt?op=OPEN. See the below Workflow which will download the specified webHDFS artifact into the specified path:

spec:
  # ...
  inputs:
    artifacts:
    - name: my-art
    path: /my-artifact
    http:
      url: "https://mywebhdfsprovider.com/webhdfs/v1/file.txt?op=OPEN"

Additional fields can be set for HTTP artifacts (for example, headers). See usage in the full webHDFS example.

Output Artifacts

To declare a webHDFS output artifact, instead use the CREATE operation and set the file path to your desired location. In the below example, the artifact will be stored at outputs/newfile.txt. You can overwrite existing files with overwrite=true.

spec:
  # ...
  outputs:
    artifacts:
    - name: my-art
    path: /my-artifact
    http:
      url: "https://mywebhdfsprovider.com/webhdfs/v1/outputs/newfile.txt?op=CREATE&overwrite=true"

Authentication

The above examples show minimal use cases without authentication. However, in a real-world scenario, you may want to use authentication. The authentication mechanism is limited to those supported by HTTP artifacts:

  • HTTP Basic Auth
  • OAuth2
  • Client Certificates

Examples for the latter two mechanisms can be found in the full webHDFS example.

Provider dependent

While your webHDFS provider may support the above mechanisms, Hadoop itself only supports authentication via Kerberos SPNEGO and Hadoop delegation token. HTTP artifacts do not currently support SPNEGO, but delegation tokens can be used via the delegation query parameter.

Comments