You are here
Cooking With a Laptop?
How are data analysis and the collection of provenance like cooking?
Data analysis is based on datasets, like those collected in the field and laboratory. Datasets are the basis for the rest of the analysis and represent the raw ingredients of a meal.
Next, analyses are performed on these datasets. There is a wide variety of possible analyses to perform, comparable to the multitude of ways to clean, slice, and flavor even the most basic combinations of ingredients.
But have you ever tried to make a dish with only a list of the ingredients? While stews and smoothies may work out, many dishes require certain steps to be carried out in a certain order or in a certain manner. Therefore, it is much easier (especially for an inexperienced chef like me!) if the recipe is more detailed.
Similarly, the logic of a data analysis is difficult to follow if only the raw data is provided, or if the description of the statistical analyses used is not sufficiently detailed. In the data analysis world, this detailed “recipe” is called provenance, or information that describes steps of the analysis and the history behind the final results. The goal of collecting and visualizing provenance, like that of a recipe, is to facilitate the reproducibility of the process that it describes.
Usually, ecologists use the programming language R to perform data analyses. My mentors, Emery Boose, Barbara Lerner, and Matthew Lau have been collecting and visualizing data provenance for R scripts using the tools RDataTracker and DDGExplorer. A script is a set of instructions that can be used on any input of the same format to produce an instance of an output. RDataTracker parses an R script, converting each line into a process node in a Data Derivation Graph (DDG). RDataTracker records files accessed by the script and intermediate data sets as data nodes. The relationships between process nodes and data nodes are represented as edges in the graph. Nodes and edges are formatted into a Prov-JSON file, which provides a standard textual representation of a DDG. Next, the file is read by DDGExplorer. This application allows the user to interact with a visual representation of a DDG, with expandable and collapsible nodes, associated line numbers, and views of intermediate data nodes.
However, R is not the only programming language used for data analysis. Python is another language used. Because of differences between the languages, they have different strengths that complement each other to produce the desired results in complicated analyses.
Think of Python and R as 2 chefs with different preparation techniques, tasked with making a single cake. One way to divide the work between the chefs might be to assign each chef to a different sub-process of completing the task. Chef Python is responsible for the sub-process of making the batter, which Chef R is responsible for the icing. To work efficiently, both chefs need to communicate with each other to prevent repeated steps.
A workflow is a sequence of sub-processes required to complete a process. For example, the workflow for baking a cake includes the sub-processes of mixing the batter and making the icing. The goal for this project is to provide provenance not only for individual Python and R scripts, but also for complete workflows consisting of multiple scripts. Provenance of an entire workflow provides a general overview and abstraction of the analysis.
One challenging aspect of provenance collection is the concept of granularity. Granularity is the level of abstraction at which provenance is collected. This allows for provenance to be sufficiently detailed or abstracted, depending on the needs of the viewer. If you are already an experienced chef, you will not need details on how to separate egg whites from yolks or the best technique for whisking. However, an inexperienced chef might need step-by-step instructions.
In the data analysis world, a higher granularity of provenance collection provides information that would be useful for debugging lines of script. This is because each process that is performed on the dataset can be examined individually to ensure that its result is as intended. On the other hand, a lower granularity would provide a simple flowchart for a high-level understanding and clear communication.
The first software I tried for Python provenance was StarFlow. This software collected provenance from Python scripts at a low granularity, only recording when files were accessed. Like an over-simplified recipe, this does not provide much novel insight with such a simple workflow. Therefore, I also tried another software NoWorkflow. This provided a much higher default granularity. Now, the ongoing challenge is to make the information adaptable to the viewers’ needs. Control flow loops in the scripts, functions, and the scripts themselves are collapsible nodes to produce a dynamic visual representation of the workflow in DDG Explorer.
Once Python scripts can be linked together into a workflow, the next step is to integrate provenance collection tools for Python and R into a single workflow. This will allow data scientists to collect provenance for complete and complex workflows. And when recipes get big, the results get even bigger and better. Yum.
Jen is a rising senior at Middlebury College studying Computer Science and Molecular Biology.