Other Data Versioning Systems
You have many choices when it comes to data versioning, and we're glad you're considering Disdat.
Data versioning is a quickly evolving space. There are now many projects that have risen to help data scientists write and deploy pipelines, track and find data, and replay steps. Here we try to list out the projects we know about, and where possible, give some guidance as to their strengths and weaknesses.
Disdat stands in contrast to existing systems for managing pipeline data artifacts. Many are closed, monolithic ecosystems, providing pipeline authoring, model versioning, deployment, feature storage, monitoring, and visualization. Examples include Palantir’s Foundry, Facebook’sFBLearner, and Uber’s Michelangelo and PyML.
Closer in spirit to Disdat are MLFlow, Pachyderm, DVC, and Netflix's Metaflow which aim to version pipeline experiments to enable reproducibility. The table below takes a first pass at a comparison among these systems. Note in many cases these systems mean slightly different things about some features. Replay may mean "rerun particular steps" or it may mean "reproduce a single task's output." Where Pachyderm and DVC support git-like operations, Disdat eschews some version control concepts, such as branching and merging, whose semantics for ML artifacts remain an open question (e.g., merging ML model weights between branches). We go into detail about the Data Versioning and Data Naming features below the table.
Feature
Disdat
Metaflow
MLFlow
Pachyderm
DVC
Pipeline Language
Python
Python
YAML config (many)
YAML config
(many)
Python* (many)
Workflow
Luigi-based
Custom
User-defined
Custom
Custom
API
Language
Python
Python*
(R,Java,REST)
Python*
(R,Java,REST)
Python
(Go, Scala)
None
Data Versioning
Yes (bundles)
Yes
(data artifacts)
Yes (runs)
Yes
(datums)
Yes
(files in git)
Managed
Data Naming
Yes
Some
(namespaces)
No
No
No
Data Lineage
Yes
Yes
For models
Yes
Yes
Replay
Some
(manual)
Yes (runs)
Yes (runs)
Yes (runs)
dvc repro
"Auto" Containerization
Yes
No
No* (experimental)
No
No
Cloud Execution
Yes
(AWS)
Yes
(AWS)
Databricks
No
(Kubernetes)
No
Data Versioning
Bundles: A Disdat Bundle is a logical data collection of literals and files. Disdat pipelines create a Bundle each time a task runs, but users also create them "by hand" (e.g., for publishing business process files, like Excel). Bundles contain both literals and files.
Many systems version data, but do so at either fixed points or with weaker naming semantics (below).
Task-level: A Metaflow DataArtifact is a programing object that refers to a single output artifact. A MetaFlow Task object can point to many DataArtifacts.
Git-like (via commits): A Pachyderm Commit refers to a set of files that a pipeline container, having run, stores in a Pachyderm repository. Pachyderm binds a task to a single container, which means a container execution produces a commit. DVC relies on files in an extant git repository and git's commit mechanism.
Pipeline (via runs): An MLFlow Run can refer to many file outputs or artifacts. Use an API call to "add" artifacts to a run.
Data Naming
Naming data and their outputs is hard. While systems like MLFlow API capture parameters and data outputs, users must still organize and manage their data.
A few of the ways Disdat makes it easier are:
No fully qualified output paths: Disdat does not require users to determine fully-qualified paths for data outputs. Unlike MLFlow or MetaFlow, users never have to specify s3://my-bucket/savin/tmp/s3demo/fruit
or `/Users/me/path/to/local/model
as outputs. Instead, users ask for "managed paths", allowing Disdat to manage local (or remote) storage for them. This has lots of cool benefits, one being reduced data sprawl and that there is no worry about overwriting data.
Inside a Disdat task, users only give the basename of a file they want to include in the output:
Human names: Each Disdat bundle can be given a human name. It can be something like: best_model_ever
. And then users can refer to this output in the CLI or the API by the human name and always get the latest version of best_model_ever
. There are two other kinds of names for Disdat bundles, processing_name and uuid. We'll discuss them more here.
Data contexts: Often we want to keep our bundles organized by deployed pipeline and whether they are deployed in prod or in dev. In Disdat you do so by having a context. This is simply a namespace under which you can find all the bundles that have been placed there.
Last updated