Claudia Misale got her PhD in Computer Science at University of Torino on May 11, 2017 defending her thesis entitled “PiCo: A Domain-Specific Language for Data Analytics Pipelines”
In her thesis Claudia reviews and analyses the state of the art frameworks for data analytics and proposing a methodology to compare their expressiveness and advocates the design of a novel C++ DSL for big data analytics (so-called PiCo: Pipeline Composition). PiCo, differently from Spark/Flink/etc is fully polymorphic and exhibit a clear separation between data and transformations. This, together with the careful C++/Fastflow implementation, eases the application development since data scientists can play with the pipelining of different transformations without any need to adapt the data type (and its memory layout). Type is inferred along the transformation pipeline in a “fluent” programming fashion. The clear separation between transformation, data type and its layout in memory make it possible to really optimise data movements, memory usage and eventually performance. Application developed with PiCo exhibit a huge 10x reduced memory footprint against Spark/Flink equivalent. The fully C++/Fastflow run-time support make it possible to really generate the network of run-time support processes from data processing pipeline, thus achieving the maximum scalability imposed by true data dependencies (much beyond the simple master-worker paradigm of Spark and Flink). Being PiCo C++11/14, it is already open to host native GPU offloading, which paves the way for the analytics-machine learning converge. See more at DOI:10.5281/zenodo.579753
Claudia is flying today to New York to start her career as a scientist at IBM TJ Watson research center within the Data-Centric Systems Solutions.
Congratulations Claudia. It has been a pleasure working with you for the past 4 years.
PiCo: A Domain-Specific Language for Data Analytics Pipelines
Aldinucci, Marco; Tremblay, Guy
In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks.
From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics.
The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.
Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world.