LogoLogo
  • Overview
  • Setup and Configuration
  • Other Data Versioning Systems
  • Examples
    • Tutorial
      • Creating Bundles with the Python API
      • Push/Pull using S3
      • Simple Pipeline
      • Run the Pipeline
      • Dockerize a Pipeline
      • Run the Pipeline Container (locally)
      • Run the Pipeline Container (AWS)
    • Examples
      • MNIST and TensorFlow
      • Spacy Task
  • Basic Concepts
    • Bundles
      • Naming
      • Bundle Data Types
      • Tags and Parameters
      • Lineage (or Bundle Metadata)
    • Data Contexts
  • Reference
    • CLI Reference
      • dsdt add
      • dsdt apply
      • dsdt cat
      • dsdt context
      • dsdt commit
      • dsdt dockerize
      • dsdt init
      • dsdt lineage
      • dsdt ls
      • dsdt pull
      • dsdt push
      • dsdt remote
      • dsdt rm
      • dsdt rmr
      • dsdt switch
    • Python API
  • Details
  • Building Pipelines
  • Running Pipelines on AWS
  • Admin
    • Contact / Slack
Powered by GitBook
On this page
  • Bundle MetaData

Was this helpful?

  1. Basic Concepts

Bundles

The unit of data versioning

PreviousSpacy TaskNextNaming

Last updated 5 years ago

Was this helpful?

The bundle is a named collection of files and literals, and is the unit at which data is produced, versioned, and consumed. The (described next) is a view abstraction that gathers together one or more bundles, and assists with managing bundles across multiple locations.

Data science tasks often input files as well as auxillary information. They also output another set of files with some literals. The figure below illustrates how a task, performing model selection, takes multiple files as input (3 training sets) and information about each. The task itself is parameterized to limit the search space (iterations and model types to explore). After selection, the task returns a file per model type with information indicating the total number of models explored.

Each time this task runs, it reads the collection of inputs and produces a collection of outputs. Disdat Bundles these files and literals into one logical collection of related data. You, the user, can choose a human name for each bundle, like Selection_Data or Model_Ranks in the figure below.

Bundles are immutable. Each bundle is given a unique id (UUID). Even if you force Disdat to re-run the same task with the same parameters, you'll just get another bundle.

Bundle MetaData

When Disdat creates a bundle after running your task (you can also create bundles manually), you have some great metadata:

  • User, date: The user at the time the task ran.

  • githash, git url: The githash of the code that created the bundle.

  • task start, stop, and duration: Timestamps for when and how long the task ran.

This looks like:

$dsdt lineage 11461d56-ab36-490d-82cc-3548b7178c8e
------ Lineage @ depth 0 -----
processing name: ForecastGroups_2019_09_10_2019_09_10T07302____0ff443ea56
uuid: 11461d56-ab36-490d-82cc-3548b7178c8e
creation date: 2019-09-10 00:30:29.505912
code repo: https://github.intuit.com/data-science/iep-pipeline
code hash: 2b07f8dd02bdea160481673fde6e5abb111a6ae5
code method:
git commit URL: https://githubs.com/data-science/iep-pipeline/commit/2b07f8dd02bdea160481673fde6e5abb111a6ae5
code branch: HEAD
Start 1568100628.7684238 Stop 1568100629.5045693 Duration 0.7361454963684082

: Your bundle is given three names by default: human name, processing name, and a uuid.

: Your bundle keeps a reference to all the input bundles used to create it.

: Your bundle keeps all the parameters of the task.

Names
Lineage
Parameters
data context
Data science tasks naturally input and output sets of files and parameters.
Disdat versions the input and output collections as Bundles.