# Overview

![](/files/-M9RvBZE_mMnCN353XhX)

<div align="left"><img src="https://badge.fury.io/py/disdat.svg" alt=""></div>

[Disdat](https://github.com/kyocum/disdat) is a Python (3.6.8+) package (on [PyPi](https://badge.fury.io/py/disdat)) for data versioning.    It allows data scientists to create, share, and track data products. Disdat organizes data into [*bundles*](/disdat-documentation/basic-concepts/bundles.md), collections of literal values and files produced by data science tasks, such as data cleaning, featurization, model training, or prediction.  Bundles are the unit at which data is versioned and shared.&#x20;

[Disdat-Luigi](https://github.com/kyocum/disdat-luigi) is a version of Spotify's Luigi system we instrumented with Disdat -- you can write pipelines that automatically consume and produce versioned data.    &#x20;

Disdat manages data produced by data science pipelines so you don't have to.  Instead of managing custom naming taxonomies for each project,  such as "models/2-1-18/made-with-1-1-17-data.parquet", Disdat manages the outputs in your local FS or S3 for you. &#x20;

With Disdat you can:

* Use a Python API to version your own data products ([examples](https://github.com/seanr15/disdat-examples)).
* Use remote data contexts to share those data products across your teams.
* Optionally build [pipelines](/disdat-documentation/building-pipelines.md) with a form of [Spotify's Luigi](https://github.com/spotify/luigi) to automatically use versioned data.
* Pipelines support popular packages, including scikit, TensorFlow, and R.  &#x20;
* Build Docker Containers from your pipelines with the Disdat [Dockerizer](/disdat-documentation/reference/dsdt-the-cli/dsdt-dockerize.md). &#x20;
* Easily debug your workflows.   Run pipelines locally or remotely, trace lineage, and more.&#x20;
* Put it in production: At Intuit, we use Disdat to define pipelines and manage data products for forecasting.   Disdat currently manages tens of thousands of versioned data objects.&#x20;

For an introduction to the concepts behind Disdat, please see our OpML19 paper [*Disdat: Bundle Data Management for Machine Learning Pipelines*](https://www.usenix.org/conference/opml19/presentation/yocum)

Disdat might be a good fit for your project if:

* Python is the language you work in.   It supports scikit-learn, Tensorflow, and Spark-based jobs. &#x20;
* Your workflow operates in a batch fashion.   Disdat bundles define data at a point in time.
* You already use AWS (needed if you wish to share data or use our built-in job submission tool).&#x20;

## Quick Start

We assume you have installed [virtualenvwrapper](https://virtualenvwrapper.readthedocs.io/en/latest/).  Let's install Disdat into a new Python environment:

```
$ mkvirtualenv disdat
$ pip install disdat
```

Next initialize Disdat. This sets up a local file directory to store bundles.  After this point you can start using Disdat to author pipelines and create versioned data. Check that you're running the latest version available on PyPi:

```
$ dsdt init
$ dsdt --version
Running Disdat version 0.8.21
```

## Going Further!

* [**Tutorial**](/disdat-documentation/examples/short-test-drive.md)**:** Create bundles, make pipelines, run containers!
* [**Basic Concepts**](/disdat-documentation/basic-concepts/bundles.md): Understand bundles, local and remote contexts, and pushing / pulling data!
* [**More Examples**](/disdat-documentation/examples/untitled-1.md)**:** Check out a TensorFlow MNIST example. &#x20;
* [**Configuration**](/disdat-documentation/setup-and-configuration.md)**:** Yuck.  Luckily most of it is about making sure you have AWS credentials.
* [**Building Pipelines**](/disdat-documentation/building-pipelines.md)**:** Write tasks that create bundles, and hook them together into pipelines.&#x20;

## Authors

Disdat could not have come to be without the support of Human Longevity, Inc. <img src="/files/-Lxs8fV_BXap56p5Cekx" alt="" data-size="line">and Intuit, Inc.<img src="/files/-Lxs8xFbhQ7w8z6lgXoV" alt="" data-size="line"> .  It has benefited from numerous discussions, code contributions, and emotional support from Sean Rowan, Ted Wong, Jonathon Lunt, Jason Knight, and Axel Bernal.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://disdat.gitbook.io/disdat-documentation/master.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
