LogoLogo
  • Overview
  • Setup and Configuration
  • Other Data Versioning Systems
  • Examples
    • Tutorial
      • Creating Bundles with the Python API
      • Push/Pull using S3
      • Simple Pipeline
      • Run the Pipeline
      • Dockerize a Pipeline
      • Run the Pipeline Container (locally)
      • Run the Pipeline Container (AWS)
    • Examples
      • MNIST and TensorFlow
      • Spacy Task
  • Basic Concepts
    • Bundles
      • Naming
      • Bundle Data Types
      • Tags and Parameters
      • Lineage (or Bundle Metadata)
    • Data Contexts
  • Reference
    • CLI Reference
      • dsdt add
      • dsdt apply
      • dsdt cat
      • dsdt context
      • dsdt commit
      • dsdt dockerize
      • dsdt init
      • dsdt lineage
      • dsdt ls
      • dsdt pull
      • dsdt push
      • dsdt remote
      • dsdt rm
      • dsdt rmr
      • dsdt switch
    • Python API
  • Details
  • Building Pipelines
  • Running Pipelines on AWS
  • Admin
    • Contact / Slack
Powered by GitBook
On this page
  • Setting up for AWS
  • Setting up Docker
  • Disdat Configuration File
  • Developer Install:

Was this helpful?

Setup and Configuration

PreviousOverviewNextOther Data Versioning Systems

Last updated 5 years ago

Was this helpful?

The majority of Disdat configuration has to do with using AWS resources (S3 for storing data contexts and bundles, and AWS Batch for running Disdat container pipelines). Disdat has its own configuration file as well, and that is covered below.

Setting up for AWS

Disdat uses AWS s3 as its "backing store" for each data context (and its bundles). If you want to create remotes for your local contexts, and then push and pull bundles to and from S3, you'll need to set up AWS. First you need to have an AWS account and then set up your AWS credentials.

  1. Optional but useful for setting up credentials in 2. Install the AWS CLI in your Python virtual environment via pip install awscli

  2. Place your AWS credentials in your ~/.aws/credentials file ().

If you want to submit pipelines to the cloud, you will also need to . That means creating 1.) a and, 2.) a batch.

Setting up Docker

Disdat can take your pipeline and create a Docker container in which it runs. To do so (and to be able to run dsdt dockerize .) you need to install Docker on your system.

  • Mac: Install

  • Unix (Ubuntu): Install via apt ()

Disdat Configuration File

Disdat stores its configuration in ~/.config/disdat/disdat.cfg Running dsdt init sets up and creates this configuration file for you.

You don't need to bother with the configuration file unless you're going to Dockerize and then to AWS Batch or Sagemaker. And if you do then you should:

  1. Set a repository_prefix. This will be a prefix for your ECR docker images and AWS Batch job descriptions.

  2. Does your pipeline need custom packages? If so, point it a pip.conf

  3. Set aws_batch_queue to the name of your .

[core]
ignore_code_version=True

[docker]

# A Docker registry to which to push pipeline images. For example:
# registry = docker.io
# If using AWS ECR, you can specify "*ECR*", and disdat will determine the
# registry for you:
registry = *ECR*

# An optional Docker repository prefix to use before the (generated)
# pipeline image name when pushing images to a registry. Do *not* include
# the registry in the repository prefix. For example:
# repository_prefix = kyocum/deploy

# If specified, log into ECR before pushing an image; not necessary if the
# registry is "*ECR*"
# ecr_login =

[dockerize]
os_type = python
os_version = 3.6.8-slim
dot_pip_file = ~/.pip/pip.conf
#dot_odbc_ini_file = <some_path_to>/odbc.ini

[run]
# For AWS Batch: A job queue you set up
aws_batch_queue = disdat-batch-queue

# For AWS SageMaker: default instance type is smallest
# other valid types include:
#  'ml.m4.xlarge' | 'ml.m4.2xlarge' | 'ml.m4.4xlarge' | 'ml.m4.10xlarge' | 'ml.m4.16xlarge' | 'ml.m5.large'
#| 'ml.m5.xlarge' | 'ml.m5.2xlarge' | 'ml.m5.4xlarge' | 'ml.m5.12xlarge' | 'ml.m5.24xlarge' | 'ml.c4.xlarge'
#| 'ml.c4.2xlarge' | 'ml.c4.4xlarge' | 'ml.c4.8xlarge' | 'ml.p2.xlarge' | 'ml.p2.8xlarge' | 'ml.p2.16xlarge'
#| 'ml.p3.2xlarge' | 'ml.p3.8xlarge' | 'ml.p3.16xlarge' | 'ml.c5.xlarge' | 'ml.c5.2xlarge' | 'ml.c5.4xlarge'
#| 'ml.c5.9xlarge' | 'ml.c5.18xlarge'
aws_sagemaker_instance_type = ml.m4.xlarge

# Number of instances to run your training job
aws_sagemaker_instance_count = 1

# Note if you have a lot of inputs in your s3 input uri, they have to fit here. Since
# disdat uses other s3 paths for inputs, you can keep this small unless you're doing something special.
aws_sagemaker_volume_sizeGB = 128

# Max run time for training job -- 5 minutes default
aws_sagemaker_max_runtime_sec = 300

# An input prefix, all objects that share this prefix show up in the container
aws_sagemaker_s3_input_uri = s3://vpc-0971016e-ds-shared/dsdt/dsdt_sagemaker_input

# Disdat doesn't place models in the s3 destination bucket.   But SageMaker still needs one.
aws_sagemaker_s3_output_uri = s3://vpc-0971016e-ds-shared/dsdt/dsdt_sagemaker_output

# Role for SageMaker containers to assume to access S3, Cloud logs, etc.
aws_sagemaker_role_arn = arn:aws:iam::011152746114:role/service-role/AmazonSageMaker-ExecutionRole-20180119T100204

Developer Install:

Note in this case we're going to clone the repo and assume you have already activated your virtual environment. We then do an *editable* installation of the Disdat repo. We finally run a bash script that creates some files that are necessary if you wish to use the Disdat Dockerizer.

$ mkvirtualenv disdat
$ git clone git@github.com:kyocum/disdat.git
$ cd disdat
$ pip install -e .
$ ./build-dist.sh

AWS instructions
stand up AWS Batch
compute environment
job submission queue
Docker Desktop
instructions
submit jobs
AWS Batch queue