Other Data Versioning Systems

You have many choices when it comes to data versioning, and we're glad you're considering Disdat.

Data versioning is a quickly evolving space. There are now many projects that have risen to help data scientists write and deploy pipelines, track and find data, and replay steps. Here we try to list out the projects we know about, and where possible, give some guidance as to their strengths and weaknesses.

Disdat stands in contrast to existing systems for managing pipeline data artifacts. Many are closed, monolithic ecosystems, providing pipeline authoring, model versioning, deployment, feature storage, monitoring, and visualization. Examples include Palantir’s Foundry, Facebook’sFBLearner, and Uber’s Michelangelo and PyML.

Closer in spirit to Disdat are MLFlow, Pachyderm, DVC, and Netflix's Metaflow which aim to version pipeline experiments to enable reproducibility. The table below takes a first pass at a comparison among these systems. Note in many cases these systems mean slightly different things about some features. Replay may mean "rerun particular steps" or it may mean "reproduce a single task's output." Where Pachyderm and DVC support git-like operations, Disdat eschews some version control concepts, such as branching and merging, whose semantics for ML artifacts remain an open question (e.g., merging ML model weights between branches). We go into detail about the Data Versioning and Data Naming features below the table.

Feature

Disdat

Metaflow

MLFlow

Pachyderm

DVC

Pipeline Language

Python

YAML config (many)

YAML config

(many)

Python* (many)

Workflow

Luigi-based

Custom

User-defined

Custom

API

Language

Python

Python*

(R,Java,REST)

Python*

(R,Java,REST)

Python

(Go, Scala)

None

Data Versioning

Yes (bundles)

Yes

(data artifacts)

Yes (runs)

Yes

(datums)

Yes

(files in git)

Managed

Data Naming

Yes

Some

(namespaces)

Data Lineage

Yes

For models

Yes

Replay

Some

(manual)

Yes (runs)

dvc repro

"Auto" Containerization

Yes

No* (experimental)

Cloud Execution

Yes

(AWS)

Yes

(AWS)

Databricks

(Kubernetes)

Data Versioning

Bundles: A Disdat Bundle is a logical data collection of literals and files. Disdat pipelines create a Bundle each time a task runs, but users also create them "by hand" (e.g., for publishing business process files, like Excel). Bundles contain both literals and files.

Many systems version data, but do so at either fixed points or with weaker naming semantics (below).

Task-level: A Metaflow DataArtifact is a programing object that refers to a single output artifact. A MetaFlow Task object can point to many DataArtifacts.
Git-like (via commits): A Pachyderm Commit refers to a set of files that a pipeline container, having run, stores in a Pachyderm repository. Pachyderm binds a task to a single container, which means a container execution produces a commit. DVC relies on files in an extant git repository and git's commit mechanism.
Pipeline (via runs): An MLFlow Run can refer to many file outputs or artifacts. Use an API call to "add" artifacts to a run.

Data Naming

Naming data and their outputs is hard. While systems like MLFlow API capture parameters and data outputs, users must still organize and manage their data.

A few of the ways Disdat makes it easier are:

No fully qualified output paths: Disdat does not require users to determine fully-qualified paths for data outputs. Unlike MLFlow or MetaFlow, users never have to specify s3://my-bucket/savin/tmp/s3demo/fruit or `/Users/me/path/to/local/modelas outputs. Instead, users ask for "managed paths", allowing Disdat to manage local (or remote) storage for them. This has lots of cool benefits, one being reduced data sprawl and that there is no worry about overwriting data.

Inside a Disdat task, users only give the basename of a file they want to include in the output:

# Inside a Disdat PipeTask's pipe_run method:
df = pd.DataFrame({'Sharks':['hammerhead','great white'],'Region':['SA','NA']})
my_output = self.create_output_file("marines.csv")
with my_output.open('w') as of:
    df.to_csv(of)
return my_output

Human names: Each Disdat bundle can be given a human name. It can be something like: best_model_ever. And then users can refer to this output in the CLI or the API by the human name and always get the latest version of best_model_ever . There are two other kinds of names for Disdat bundles, processing_name and uuid. We'll discuss them more here.

Data contexts: Often we want to keep our bundles organized by deployed pipeline and whether they are deployed in prod or in dev. In Disdat you do so by having a context. This is simply a namespace under which you can find all the bundles that have been placed there.

PreviousSetup and Configuration NextTutorial

Last updated 5 years ago

Was this helpful?