Bundles
The unit of data versioning
Last updated
Was this helpful?
The unit of data versioning
Last updated
Was this helpful?
The bundle is a named collection of files and literals, and is the unit at which data is produced, versioned, and consumed. The (described next) is a view abstraction that gathers together one or more bundles, and assists with managing bundles across multiple locations.
Data science tasks often input files as well as auxillary information. They also output another set of files with some literals. The figure below illustrates how a task, performing model selection, takes multiple files as input (3 training sets) and information about each. The task itself is parameterized to limit the search space (iterations and model types to explore). After selection, the task returns a file per model type with information indicating the total number of models explored.
Each time this task runs, it reads the collection of inputs and produces a collection of outputs. Disdat Bundles these files and literals into one logical collection of related data. You, the user, can choose a human name for each bundle, like Selection_Data or Model_Ranks in the figure below.
When Disdat creates a bundle after running your task (you can also create bundles manually), you have some great metadata:
User, date: The user at the time the task ran.
githash, git url: The githash of the code that created the bundle.
task start, stop, and duration: Timestamps for when and how long the task ran.
This looks like:
: Your bundle is given three names by default: human name, processing name, and a uuid.
: Your bundle keeps a reference to all the input bundles used to create it.
: Your bundle keeps all the parameters of the task.