Disdat can package your pipeline into a Docker container, wrapping up code dependencies so that it runs the same almost anywhere. For data science, this can mean packaging up feature creation, model training, and prediction into one or more containers and running those containers on a compute cluster (such as AWS Batch). Disdat gives you the ability to run your pipeline in the container through the CLI with dsdt run or with the API with disdat.api.run.
In addition to standard Python support via setuptools, the dockerizer supports Disdat pipelines that use:
Custom Python packages (not available via pip)
Linux packages via either a list of .deb packages in a deb.txt file or custom yourpkg.deb files.
R packages
Create a setup.py file:
In Disdat, dockerization is based on having a setup.pyfile that can be used to build a source distribution for your code (an sdist). Here is the setup.py file from our disdat-examples git repo.
Sometimes you need to add auxillery data to your source distribution. You can do so using a MANIFEST.in file. The disdat-examples git repo has a MANIFEST.in file you can look at as an example of how to include a spacy English-based model.
Optional Installs via the Config Directory
One can add R, Linux, and other source Python packages (sdists) to your image by creating a configuration directory. The directory can have any name (here we call it config). One places the relevant directives under $PROJECT_HOME/config/
Optional R installs
Create a file $PROJECT_HOME/config/python-3.6.8-slim/r.txt For example, this file could list MASS quantreg forecast dplyr tidyr data.table timetk lubridate as packages to install.
Optional Linux installs
List your Debian packages in a `deb.txt` like$PROJECT_HOME/config/python-3.6.8-slim/deb.txt Or you may place Debian packages directly in $PROJECT_HOME/config/python-3.6.8-slim/
Optional source Python packages
Place sdists under $PROJECT_HOME/config/python-sdist/ so that they are included in the Docker images virtual environment.
Usage
Options
Examples
Create a container without a config directory
Create a container with optional installs specified from a config directory
Push a container to AWS ECR without building it
Users can push Disdat Docker images to AWS Elastic Container Registry (ECR) using the --push option to the dockerize command:
To push images to Docker, the user needs to specify the registry prefix in the Disdat configuration file. We assume that the user has first followed the instructions in the AWS ECR documentation for setting up AWS credentials and profiles required to access AWS services, as the dockerizer uses the current AWS profile to determine the ECR URL and obtain Docker server authentication tokens. The user then sets the following configuration options in the Disdat configuration file in the [docker] stanza:
registry: Set to *ECR*.
repository_prefix: An optional prefix of the form a/b/[...] that
dockerize will prepend to the image name. Given a name in the setup.py file like disdat-examples and a prefix a/b, dockerize will push
positional arguments:
pipeline_root Root of the Python source tree containing the user-
defined transform; must have a setuptools-style
setup.py file
optional arguments:
-h, --help show this help message and exit
--config-dir CONFIG_DIR
A directory containing configuration files for the
operating system within the Docker image
--os-type OS_TYPE The base operating system type for the Docker image
--os-version OS_VERSION
The base operating system version for the Docker image
--push Push the image to a remote Docker registry (default is
to not push; must set 'docker_registry' in Disdat
config)
--get-id Do not build, only return latest container image ID
--sagemaker Create a Docker image executable as a SageMaker
container.
--no-build Do not build an image (only copy files into the Docker
build context)