DAGs

All Prism projects are structured as directed acyclic graphs, or DAGs.

Important: Users do NOT need to be familiar with DAGs in order to use Prism! This page is purely for educational purposes.

A DAG is a conceptual representation for a series of tasks. These tasks must adhere to specific rules, which are specified by the acronym itself:

  • Directed: In general, each task will have an input and an output, and tasks can be related to one another via dependencies.

  • Acyclic: Circular references between tasks are not permitted. That is, no task can feed into another that goes on to feed into itself. For example, Task A cannot feed into Task B if Task B feeds into Task A. This could cause an infinite loop within your program or pipeline.

  • Graph: In mathematics, graphs are represented as a series of nodes connected via edges. In the context of data science and data engineering, the nodes are tasks, and the edges, dependencies between them.

In this documentation, we use the "task" (or "module") and "dependency" nomenclature rather than the more mathematically formal "node" and "edge".

Here is an example of a data pipeline represented as a DAG.

Each task has an input and an output, and dependencies are clearly laid out via directed arrows. In addition, there are no circular references between tasks (e.g., Create Feature Set 2 doesn't feed back into Process Data).

If you want to learn more about DAGs and how they are used throughout data science, check out this article here.