LogoLogo
  • Overview
  • Setup and Configuration
  • Other Data Versioning Systems
  • Examples
    • Tutorial
      • Creating Bundles with the Python API
      • Push/Pull using S3
      • Simple Pipeline
      • Run the Pipeline
      • Dockerize a Pipeline
      • Run the Pipeline Container (locally)
      • Run the Pipeline Container (AWS)
    • Examples
      • MNIST and TensorFlow
      • Spacy Task
  • Basic Concepts
    • Bundles
      • Naming
      • Bundle Data Types
      • Tags and Parameters
      • Lineage (or Bundle Metadata)
    • Data Contexts
  • Reference
    • CLI Reference
      • dsdt add
      • dsdt apply
      • dsdt cat
      • dsdt context
      • dsdt commit
      • dsdt dockerize
      • dsdt init
      • dsdt lineage
      • dsdt ls
      • dsdt pull
      • dsdt push
      • dsdt remote
      • dsdt rm
      • dsdt rmr
      • dsdt switch
    • Python API
  • Details
  • Building Pipelines
  • Running Pipelines on AWS
  • Admin
    • Contact / Slack
Powered by GitBook
On this page
  • Quick Start
  • Going Further!
  • Authors

Was this helpful?

Overview

NextSetup and Configuration

Last updated 2 years ago

Was this helpful?

Disdat manages data produced by data science pipelines so you don't have to. Instead of managing custom naming taxonomies for each project, such as "models/2-1-18/made-with-1-1-17-data.parquet", Disdat manages the outputs in your local FS or S3 for you.

With Disdat you can:

  • Use remote data contexts to share those data products across your teams.

  • Pipelines support popular packages, including scikit, TensorFlow, and R.

  • Easily debug your workflows. Run pipelines locally or remotely, trace lineage, and more.

  • Put it in production: At Intuit, we use Disdat to define pipelines and manage data products for forecasting. Disdat currently manages tens of thousands of versioned data objects.

Disdat might be a good fit for your project if:

  • Python is the language you work in. It supports scikit-learn, Tensorflow, and Spark-based jobs.

  • Your workflow operates in a batch fashion. Disdat bundles define data at a point in time.

  • You already use AWS (needed if you wish to share data or use our built-in job submission tool).

Quick Start

$ mkvirtualenv disdat
$ pip install disdat

Next initialize Disdat. This sets up a local file directory to store bundles. After this point you can start using Disdat to author pipelines and create versioned data. Check that you're running the latest version available on PyPi:

$ dsdt init
$ dsdt --version
Running Disdat version 0.8.21

Going Further!

Authors

is a Python (3.6.8+) package (on ) for data versioning. It allows data scientists to create, share, and track data products. Disdat organizes data into , collections of literal values and files produced by data science tasks, such as data cleaning, featurization, model training, or prediction. Bundles are the unit at which data is versioned and shared.

is a version of Spotify's Luigi system we instrumented with Disdat -- you can write pipelines that automatically consume and produce versioned data.

Use a Python API to version your own data products ().

Optionally build with a form of to automatically use versioned data.

Build Docker Containers from your pipelines with the Disdat .

For an introduction to the concepts behind Disdat, please see our OpML19 paper

We assume you have installed . Let's install Disdat into a new Python environment:

: Create bundles, make pipelines, run containers!

: Understand bundles, local and remote contexts, and pushing / pulling data!

: Check out a TensorFlow MNIST example.

: Yuck. Luckily most of it is about making sure you have AWS credentials.

: Write tasks that create bundles, and hook them together into pipelines.

Disdat could not have come to be without the support of Human Longevity, Inc. and Intuit, Inc. . It has benefited from numerous discussions, code contributions, and emotional support from Sean Rowan, Ted Wong, Jonathon Lunt, Jason Knight, and Axel Bernal.

Disdat
PyPi
bundles
Disdat-Luigi
examples
pipelines
Spotify's Luigi
Dockerizer
Disdat: Bundle Data Management for Machine Learning Pipelines
virtualenvwrapper
Tutorial
Basic Concepts
More Examples
Configuration
Building Pipelines