Bundles

The unit of data versioning

The bundle is a named collection of files and literals, and is the unit at which data is produced, versioned, and consumed. The data context (described next) is a view abstraction that gathers together one or more bundles, and assists with managing bundles across multiple locations.

Data science tasks often input files as well as auxillary information. They also output another set of files with some literals. The figure below illustrates how a task, performing model selection, takes multiple files as input (3 training sets) and information about each. The task itself is parameterized to limit the search space (iterations and model types to explore). After selection, the task returns a file per model type with information indicating the total number of models explored.

Each time this task runs, it reads the collection of inputs and produces a collection of outputs. Disdat Bundles these files and literals into one logical collection of related data. You, the user, can choose a human name for each bundle, like Selection_Data or Model_Ranks in the figure below.

Bundles are immutable. Each bundle is given a unique id (UUID). Even if you force Disdat to re-run the same task with the same parameters, you'll just get another bundle.

Bundle MetaData

When Disdat creates a bundle after running your task (you can also create bundles manually), you have some great metadata:

  • Names: Your bundle is given three names by default: human name, processing name, and a uuid.

  • Lineage: Your bundle keeps a reference to all the input bundles used to create it.

  • Parameters: Your bundle keeps all the parameters of the task.

  • User, date: The user at the time the task ran.

  • githash, git url: The githash of the code that created the bundle.

  • task start, stop, and duration: Timestamps for when and how long the task ran.

This looks like:

$dsdt lineage 11461d56-ab36-490d-82cc-3548b7178c8e
------ Lineage @ depth 0 -----
processing name: ForecastGroups_2019_09_10_2019_09_10T07302____0ff443ea56
uuid: 11461d56-ab36-490d-82cc-3548b7178c8e
creation date: 2019-09-10 00:30:29.505912
code repo: https://github.intuit.com/data-science/iep-pipeline
code hash: 2b07f8dd02bdea160481673fde6e5abb111a6ae5
code method:
git commit URL: https://githubs.com/data-science/iep-pipeline/commit/2b07f8dd02bdea160481673fde6e5abb111a6ae5
code branch: HEAD
Start 1568100628.7684238 Stop 1568100629.5045693 Duration 0.7361454963684082

Last updated