Overview

Disdat is a Python (3.6.8+) package (on PyPi) for data versioning. It allows data scientists to create, share, and track data products. Disdat organizes data into bundles, collections of literal values and files produced by data science tasks, such as data cleaning, featurization, model training, or prediction. Bundles are the unit at which data is versioned and shared.

Disdat-Luigi is a version of Spotify's Luigi system we instrumented with Disdat -- you can write pipelines that automatically consume and produce versioned data.

Disdat manages data produced by data science pipelines so you don't have to. Instead of managing custom naming taxonomies for each project, such as "models/2-1-18/made-with-1-1-17-data.parquet", Disdat manages the outputs in your local FS or S3 for you.

With Disdat you can:

  • Use a Python API to version your own data products (examples).

  • Use remote data contexts to share those data products across your teams.

  • Optionally build pipelines with a form of Spotify's Luigi to automatically use versioned data.

  • Pipelines support popular packages, including scikit, TensorFlow, and R.

  • Build Docker Containers from your pipelines with the Disdat Dockerizer.

  • Easily debug your workflows. Run pipelines locally or remotely, trace lineage, and more.

  • Put it in production: At Intuit, we use Disdat to define pipelines and manage data products for forecasting. Disdat currently manages tens of thousands of versioned data objects.

For an introduction to the concepts behind Disdat, please see our OpML19 paper Disdat: Bundle Data Management for Machine Learning Pipelines

Disdat might be a good fit for your project if:

  • Python is the language you work in. It supports scikit-learn, Tensorflow, and Spark-based jobs.

  • Your workflow operates in a batch fashion. Disdat bundles define data at a point in time.

  • You already use AWS (needed if you wish to share data or use our built-in job submission tool).

Quick Start

We assume you have installed virtualenvwrapper. Let's install Disdat into a new Python environment:

$ mkvirtualenv disdat
$ pip install disdat

Next initialize Disdat. This sets up a local file directory to store bundles. After this point you can start using Disdat to author pipelines and create versioned data. Check that you're running the latest version available on PyPi:

$ dsdt init
$ dsdt --version
Running Disdat version 0.8.21

Going Further!

  • Tutorial: Create bundles, make pipelines, run containers!

  • Basic Concepts: Understand bundles, local and remote contexts, and pushing / pulling data!

  • More Examples: Check out a TensorFlow MNIST example.

  • Configuration: Yuck. Luckily most of it is about making sure you have AWS credentials.

  • Building Pipelines: Write tasks that create bundles, and hook them together into pipelines.

Authors

Last updated