Running Pipelines on AWS
Run
: The Docker container executor
Run
: The Docker container executorDisdat provides a run
command for executing Docker containers from images created with the dockerize
command. run
can launch containers on a local Docker server, or queue remote jobs to run on the AWS Batch service.
Disdat assumes that images running in containers do not have access to persistent local storage, and thus reads input bundles from a remote and writes output bundles to a remote. The user should first configure a remote bundle repository with the remote
command before using run
to execute transforms.
::
Using run
run
Users execute transforms in Docker containers using the run
command:
::
input_bundle
specifies the name of the bundle in the current context to send to the transform as input. run
will push the input bundle to the remote before executing the transform; the user must first commit the bundle with the commit
command if it is not already committed.
output_bundle
specifies the name of the bundle in the current context to receive from the transform as output. run
will push the output bundle from inside the Docker container of the transform to the remote, and then pull the bundle back to the local context. If the bundle does not already exist, run
will create a new bundle, otherwise it will create a new version of the existing bundle.
pipe_class
specifies the fully-qualified class name of the transform. The module and class name of the transform should follow standard Python naming conventions, i.e., the transform named module.submodule.PipeClass
should be defined as class PipeClass
in the file module/submodule.py
under pipe_root
.
--backend
specifies the execution backend for the Docker containers.
Local
launches containers on a local Docker server.AWSBatch
queues launch requests at an AWS Batch queue.
--no-push-input
instructs run
not to push the input bundle to the remote. The user should ensure that the remote copy of the bundle contains the correct version of the input to use.
--use-aws-session-token AWS_SESSION_TOKEN_DURATION
creates and uses temporary AWS credentials within a container running on AWS Batch to obtain read/write access to S3 from within the container.
To push and pull bundles to and from a remote, the user needs to have appropriate AWS credentials to access the S3 bucket holding the remote. We assume that the user has first followed the instructions in the AWS command-line documentation for setting up AWS credentials and profiles required to access AWS services, as run
uses the same mechanisms as the AWS command-line tool to access S3.
Configuring AWS Batch to run Disdat images
To queue remote jobs to run Disdat images on AWS Batch, the user needs to have access to a configured Batch compute environment and job queue. The user should refer to the AWS Batch documentation for instruction on how to configure Batch. Once the user has a configured Batch compute environment and job queue, they should set the following configuration option in the Disdat configuration file in the [run]
stanza:
aws_batch_queue
: The name of the AWS Batch queue to which the user willsubmit jobs. Disdat will automatically create a Batch "job definition" as
necessary to queue jobs for a transform.
.. caution::
Once the user has configured Batch, they can queue jobs using the --backend AWSBatch
option to the run
command:
::
Given a transform named module.submodule[.submodule].PipeClass
, run
will create a Batch job definition named disdat-module-submodule-job-definition
that eventually launches a container from the ECR images with the suffix disdat-module-submodule[-submodule]:latest
, sends input_bundle
to the transform in the container as the input, and receive output_bundle
at the remote as the output.
.. note::
Last updated