Asynchronous Job Pattern¶

Introduction¶

If triggering an external job (e.g. an Amazon EMR job) from Argo that does not run to completion in a container, there are two options:

create a container that polls the external job completion status
combine a trigger step that starts the job with a suspend step that is resumed by an API call to Argo when the external job is complete.

This document describes the second option in more detail.

The pattern¶

The pattern involves two steps - the first step is a short-running step that triggers a long-running job outside Argo (e.g. an HTTP submission), and the second step is a suspend step that suspends workflow execution and is ultimately either resumed or stopped (i.e. failed) via a call to the Argo API when the job outside Argo succeeds or fails.

When implemented as a WorkflowTemplate it can look something like this:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: external-job-template
spec:
  entrypoint: run-external-job
  arguments:
    parameters:
      - name: "job-cmd"
  templates:
    - name: run-external-job
      inputs:
        parameters:
          - name: "job-cmd"
            value: "{{workflow.parameters.job-cmd}}"
      steps:
        - - name: trigger-job
            template: trigger-job
            arguments:
              parameters:
                - name: "job-cmd"
                  value: "{{inputs.parameters.job-cmd}}"
        - - name: wait-completion
            template: wait-completion
            arguments:
              parameters:
                - name: uuid
                  value: "{{steps.trigger-job.outputs.result}}"

    - name: trigger-job
      inputs:
        parameters:
          - name: "job-cmd"
      container:
        image: appropriate/curl:latest
        command: [ "/bin/sh", "-c" ]
        args: [ "{{inputs.parameters.job-cmd}}" ]

    - name: wait-completion
      inputs:
        parameters:
          - name: uuid
      suspend: { }

In this case the job-cmd parameter can be a command that makes an HTTP call via curl to an endpoint that returns a job UUID. More sophisticated submission and parsing of submission output could be done with something like a Python script step.

On job completion the external job would need to call either resume if successful:

You may need an access token.

curl --request PUT \
  --url https://localhost:2746/api/v1/workflows/<NAMESPACE>/<WORKFLOWNAME>/resume
  --header 'content-type: application/json' \
  --header "Authorization: $ARGO_TOKEN" \
  --data '{
      "namespace": "<NAMESPACE>",
      "name": "<WORKFLOWNAME>",
      "nodeFieldSelector": "inputs.parameters.uuid.value=<UUID>"
    }'

or stop if unsuccessful:

curl --request PUT \
  --url https://localhost:2746/api/v1/workflows/<NAMESPACE>/<WORKFLOWNAME>/stop
  --header 'content-type: application/json' \
  --header "Authorization: $ARGO_TOKEN" \
  --data '{
      "namespace": "<NAMESPACE>",
      "name": "<WORKFLOWNAME>",
      "nodeFieldSelector": "inputs.parameters.uuid.value=<UUID>",
      "message": "<FAILURE-MESSAGE>"
    }'

Retrying failed jobs¶

Using argo retry on failed jobs that follow this pattern will cause Argo to re-attempt the suspend step without re-triggering the job.

Instead you need to use the --restart-successful option, e.g. if using the template from above:

argo retry <WORKFLOWNAME> --restart-successful --node-field-selector templateRef.template=run-external-job,phase=Failed

Asynchronous Job Pattern¶

Introduction¶

The pattern¶

Retrying failed jobs¶

Comments