> For the complete documentation index, see [llms.txt](https://disdat.gitbook.io/disdat-documentation/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://disdat.gitbook.io/disdat-documentation/running-pipelines-on-aws.md).

# Running Pipelines on AWS

## `Run`: The Docker container executor

Disdat provides a `run` command for executing Docker containers from images created with the `dockerize` command. `run` can launch containers on a local Docker server, or queue remote jobs to run on the AWS Batch service.

Disdat assumes that images running in containers do not have access to persistent local storage, and thus reads input bundles from a remote and writes output bundles to a remote. The user should first configure a remote bundle repository with the `remote` command before using `run` to execute transforms.

::

```
$ dsdt remote local_distat_context  s3://s3-bucket/path/to/remote/s3/folder
$ dsdt run input_bundle output_bundle user.module.PipeClass
```

### Using `run`

Users execute transforms in Docker containers using the `run` command:

::

```
dsdt run [-h] [--backend {Local,AWSBatch}] [--force] [--no-push-input]
         [--use-aws-session-token AWS_SESSION_TOKEN_DURATION]
         [-it INPUT_TAG] [-ot OUTPUT_TAG]
         input_bundle output_bundle pipe_cls ...
```

`input_bundle` specifies the name of the bundle in the current context to send to the transform as input. `run` will push the input bundle to the remote before executing the transform; the user must first commit the bundle with the `commit` command if it is not already committed.

`output_bundle` specifies the name of the bundle in the current context to receive from the transform as output. `run` will push the output bundle from inside the Docker container of the transform to the remote, and then pull the bundle back to the local context. If the bundle does not already exist, `run` will create a new bundle, otherwise it will create a new version of the existing bundle.

`pipe_class` specifies the fully-qualified class name of the transform. The module and class name of the transform should follow standard Python naming conventions, i.e., the transform named `module.submodule.PipeClass` should be defined as `class PipeClass` in the file `module/submodule.py` under `pipe_root`.

`--backend` specifies the execution backend for the Docker containers.

* `Local` launches containers on a local Docker server.
* `AWSBatch` queues launch requests at an AWS Batch queue.

`--no-push-input` instructs `run` not to push the input bundle to the remote. The user should ensure that the remote copy of the bundle contains the correct version of the input to use.

`--use-aws-session-token AWS_SESSION_TOKEN_DURATION` creates and uses temporary AWS credentials within a container running on AWS Batch to obtain read/write access to S3 from within the container.

To push and pull bundles to and from a remote, the user needs to have appropriate AWS credentials to access the S3 bucket holding the remote. We assume that the user has first followed the instructions in the AWS command-line documentation for setting up AWS credentials and profiles required to access AWS services, as `run` uses the same mechanisms as the AWS command-line tool to access S3.

### Configuring AWS Batch to run Disdat images

To queue remote jobs to run Disdat images on AWS Batch, the user needs to have access to a configured Batch compute environment and job queue. The user should refer to the AWS Batch documentation for instruction on how to configure Batch. Once the user has a configured Batch compute environment and job queue, they should set the following configuration option in the Disdat configuration file in the `[run]` stanza:

* `aws_batch_queue`: The name of the AWS Batch queue to which the user will

  submit jobs. Disdat will automatically create a Batch "job definition" as

  necessary to queue jobs for a transform.

.. caution::

```
As stated before, Disdat reads input bundles from a remote and writes
output bundles to a remote when executing images in a container. The
EC2 instances in the Batch compute environment must therefore have read/
write access to the S3 bucket holding the remote. The easiest way to
accomplish this is the ensure that the AWS IAM role that owns the EC2
instances (the "Compute resources: Instance role", which by default is
``ecsInstanceRole``) has S3 read/write access. Alternatively, the user
can use the ``--use-aws-session-token`` to create and use temporary
AWS credentials within the container to obtain read/write access to S3.
```

Once the user has configured Batch, they can queue jobs using the `--backend AWSBatch` option to the `run` command:

::

```
dsdt run --backend AWSBatch input_bundle output_bundle pipe_cls
```

Given a transform named `module.submodule[.submodule].PipeClass`, `run` will create a Batch job definition named `disdat-module-submodule-job-definition` that eventually launches a container from the ECR images with the suffix `disdat-module-submodule[-submodule]:latest`, sends `input_bundle` to the transform in the container as the input, and receive `output_bundle` at the remote as the output.

.. note::

```
Because Batch is by design a non-interactive execution backend, ``run``
will not block to pull the output bundle back to the local context.
Instead, the user must monitor the job status using the Batch dashboard,
and pull the output bundle with ``dsdt pull``.
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://disdat.gitbook.io/disdat-documentation/running-pipelines-on-aws.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
