# Claudia Misale

Formerly PhD student in the Parallel Computing group, from June 2017 research Staff Member in the Cloud and Cognitive Solutions, Data Centric Solutions (DCS) group at IBM T.J. Watson Research Center (NY).

Old webpage. Last update on June 8th 2017.

## Summary

Claudia Misale received her PhD at Computer Science Department of the University of Torino on May 2017. She was a member of the parallel computing Alpha group and a Research associate at University of Torino from 2013 to 2017.

In her thesis, she proposed a unifying model for Big Data Analytics (BDA) tools, based on the dataflow model, according to which any application expressed in one of such frameworks can be formulated, at abstract level, in terms of a graph of functional-style operators. The formalization of such layered model lead to the formalization and implementation of PiCo (Pipeline Composition), a C++ Domain Specific Language (DSL) for the implementation of BDA applications in terms of compositions of operators and pipelines of operators transforming data (e.g., map, or reduce operators).
This model is intended to give the user a unique interface for both stream and batch processing, hiding completely data management and focusing only on operations. Such operators exploit shared-memory parallelism by mapping an application to a composition of FastFlow farms and pipelines.

In particular, the farm pattern is exploited to express the parallelism both between operators (e.g., computing multiple statistic measures on the same data) and within an operator (e.g., parallel implementation of the map function).
Data collections to be processed, either (bounded) data sets or (unbounded) streams, are split in \emph{micro-batches}, that are streamed to the application as indivisible items.
PiCo can be found at https://github.com/alpha-unito/PiCo.

She is now a Research Staff Member in the Cloud and Cognitive Solutions, Data Centric Solutions (DCS) group at IBM T.J. Watson Research Center (NY) where she is working on optimizing the runtime of Apache Spark for IBM Platforms.

She has co-authored papers in international journals and conference proceedings.
She was participating in the European STREP FP7 ParaPhrase and REPARA projects and she has been a research intern in the High Performance System Software research group in the in the Data Centric Systems Department at IBM T.J. Watson at Yorktown Heights (NY), working on optimizing the Spark big data analytics framework for Data-Centric Systems (DCS).
Her research is focused on high performance computing, in particular parallel computing and optimization on multicore, many-cores an distributed platforms.
Her research activities started from the optimization and parallelization of Bioinformatics tools.
Her research is now focused on high performance computing, in particular on high level programming models and patterns for distributed computing and high-performance big data analytics for HPC platforms.

## Claudia Misale’s Research Interests

• High-level development tools and languages for parallel computing
• High-level models and patterns for distributed computing
• High-performance big data analytics for HPC platforms
• Parallel programming models
• High-performance tools for bioinformatics

## Achievements and Awards

• IBM Scholarship Program 2015/2016: 10,000\$ Scholarship award for accomplishment reached during the internship at IBM T.J. Watson Research Center.
• Best paper award on real-world applications for Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity at 22nd Euromicro Intl. Conference on Parallel, Distributed, and Network-Based Processing (PDP 2014), Feb. 2014, IEEE.
• 3 year PhD grant from Italian Ministry of Education, Universities and Research, ranked 2nd in the competitive selection at Computer Science Department of University of Torino, 2012.
• MSc. in Computer Science summa cum laude (first-class honour degree), 2012.

## Publications

### 2018

• C. Misale, M. Drocco, G. Tremblay, and M. Aldinucci, “PiCo: a Novel Approach to Stream Data Analytics,” in Proc. of Euro-Par Workshops: 1st Intl. Workshop on Autonomic Solutions for Parallel and Distributed Data Stream Processing (Auto-DaSP 2017), Santiago de Compostela, Spain, 2018. doi:10.1007/978-3-319-75178-8_10
[BibTeX] [Abstract]

In this paper, we present a new C++ API with a fluent interface called PiCo (Pipeline Composition). PiCo’s programming model aims at making easier the programming of data analytics applications while preserving or enhancing their performance. This is attained through three key design choices: 1) unifying batch and stream data access models, 2) decoupling processing from data layout, and 3) exploiting a stream-oriented, scalable, effiicient C++11 runtime system. PiCo proposes a programming model based on pipelines and operators that are polymorphic with respect to data types in the sense that it is possible to re-use the same algorithms and pipelines on different data models (e.g., streams, lists, sets, etc.). Preliminary results show that PiCo can attain better performances in terms of execution times and hugely improve memory utilization when compared to Spark and Flink in both batch and stream processing.

@inproceedings{pico:autodasp:17,
abstract = {In this paper, we present a new C++ API with a fluent interface called PiCo (Pipeline Composition). PiCo's programming model aims at making easier the programming of data analytics applications while preserving or enhancing their performance. This is attained through three key design choices: 1) unifying batch and stream data access models, 2) decoupling processing from data layout, and 3) exploiting a stream-oriented, scalable, effiicient C++11 runtime system. PiCo proposes a programming model based on pipelines and operators that are polymorphic with respect to data types in the sense that it is possible to re-use the same algorithms and pipelines on different data models (e.g., streams, lists, sets, etc.). Preliminary results show that PiCo can attain better performances in terms of execution times and hugely improve memory utilization when compared to Spark and Flink in both batch and stream processing.},
address = {Santiago de Compostela, Spain},
author = {Claudia Misale and Maurizio Drocco and Guy Tremblay and Marco Aldinucci},
booktitle = {Proc. of Euro-Par Workshops: 1st Intl. Workshop on Autonomic Solutions for Parallel and Distributed Data Stream Processing (Auto-DaSP 2017)},
date-modified = {2018-01-21 16:08:28 +0000},
doi = {10.1007/978-3-319-75178-8_10},
month = aug,
publisher = {Springer},
series = {{LNCS}},
title = {PiCo: a Novel Approach to Stream Data Analytics},
volume = {10659},
year = {2018},
bdsk-url-1 = {https://dx.doi.org/10.1007/978-3-319-75178-8_10}
}

• M. Aldinucci, M. Drocco, M. Claudia, and G. Tremblay, “Encyclopedia of Big Data Technologies,,” , S. Sakr and A. Zomaya, Eds., Springer, 2018.
[BibTeX] [Abstract]

In this chapter, some of the most common tools for Big Data analytics are surveyed, inter-alia, Apache Spark, Flink, Storm, and Beam. They are compared against well-defined features concerning programming model (language expressivity and semantics), and execution model (parallel behaviour and run-time support). The implementation of a running example is provided for all of them.

@inbook{bigdata:encyclopedia:18,
abstract = {In this chapter, some of the most common tools for Big Data analytics are surveyed, inter-alia, Apache Spark, Flink, Storm, and Beam. They are compared against well-defined features concerning programming model (language expressivity and semantics), and execution model (parallel behaviour and run-time support). The implementation of a running example is provided for all of them.},
author = {Marco Aldinucci and Maurizio Drocco and Misale Claudia and Guy Tremblay},
chapter = {Languages for Big Data analysis},
date-modified = {2018-03-06 22:21:46 +0000},
editor = {S. Sakr and A. Zomaya},
publisher = {Springer},
title = {Encyclopedia of Big Data Technologies,},
year = {2018}
}

### 2017

• C. Misale, “PiCo: A Domain-Specific Language for Data Analytics Pipelines,” PhD Thesis, 2017. doi:10.5281/zenodo.579753

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world.

@phdthesis{17:pico:misale:thesis,
abstract = {In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models---for which only informal (and often confusing) semantics is generally provided---all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks.
From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics.
The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.
Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world.},
author = {Claudia Misale},
date-modified = {2017-06-19 15:55:21 +0000},
doi = {10.5281/zenodo.579753},
keywords = {fastflow, rephrase, toreador, repara, paraphrase},
month = may,
school = {Computer Science Department, University of Torino},
title = {PiCo: A Domain-Specific Language for Data Analytics Pipelines},
url = {https://iris.unito.it/retrieve/handle/2318/1633743/320170/Misale_thesis.pdf},
year = {2017},
bdsk-url-1 = {https://iris.unito.it/retrieve/handle/2318/1633743/320170/Misale_thesis.pdf},
bdsk-url-2 = {http://dx.doi.org/10.5281/zenodo.579753}
}

• C. Misale, M. Drocco, M. Aldinucci, and G. Tremblay, “A Comparison of Big Data Frameworks on a Layered Dataflow Model,” Parallel Processing Letters, vol. 27, iss. 01, pp. 1-20, 2017. doi:10.1142/S0129626417400035

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only informal (and often confusing) semantics is generally provided, all share a common underlying model, namely, the Dataflow model. The Dataflow model we propose shows how various tools share the same expressiveness at different levels of abstraction. The contribution of this work is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm), thus making it easier to understand high-level data-processing applications written in such frameworks. Second, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.

@article{17:bigdatasurvey:PPL,
abstract = {In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only informal (and often confusing) semantics is generally provided, all share a common underlying model, namely, the Dataflow model. The Dataflow model we propose shows how various tools share the same expressiveness at different levels of abstraction. The contribution of this work is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm), thus making it easier to understand high-level data-processing applications written in such frameworks. Second, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.},
author = {Misale, Claudia and Drocco, Maurizio and Aldinucci, Marco and Tremblay, Guy},
date-modified = {2017-12-12 12:16:32 +0000},
date-published = {March 2017},
doi = {10.1142/S0129626417400035},
eprint = {http://www.worldscientific.com/doi/pdf/10.1142/S0129626417400035},
journal = {Parallel Processing Letters},
number = {01},
pages = {1--20},
title = {A Comparison of Big Data Frameworks on a Layered Dataflow Model},
url = {https://iris.unito.it/retrieve/handle/2318/1626287/303421/preprintPPL_4aperto.pdf},
volume = {27},
year = {2017},
bdsk-url-1 = {https://iris.unito.it/retrieve/handle/2318/1626287/303421/preprintPPL_4aperto.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1142/S0129626417400035}
}

### 2016

• C. Misale, M. Drocco, M. Aldinucci, and G. Tremblay, “A Comparison of Big Data Frameworks on a Layered Dataflow Model,” in Proc. of Intl. Workshop on High-Level Parallel Programming (HLPP), Muenster, Germany, 2016, pp. 1-19. doi:10.5281/zenodo.321866

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only informal (and often confusing) semantics is generally provided, all share a common underlying model, namely, the Dataflow model. The Dataflow model we propose shows how various tools share the same expressiveness at different levels of abstraction. The contribution of this work is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm), thus making it easier to understand high-level data-processing applications written in such frameworks. Second, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.

@inproceedings{16:bigdatasurvey:hlpp,
abstract = {In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only informal (and often confusing) semantics is generally provided, all share a common underlying model, namely, the Dataflow model. The Dataflow model we propose shows how various tools share the same expressiveness at different levels of abstraction. The contribution of this work is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm), thus making it easier to understand high-level data-processing applications written in such frameworks. Second, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.},
author = {Claudia Misale and Maurizio Drocco and Marco Aldinucci and Guy Tremblay},
booktitle = {Proc. of Intl. Workshop on High-Level Parallel Programming (HLPP)},
date-modified = {2017-12-12 14:49:47 +0000},
doi = {10.5281/zenodo.321866},
month = jul,
pages = {1-19},
publisher = {arXiv.org},
title = {A Comparison of Big Data Frameworks on a Layered Dataflow Model},
url = {http://arxiv.org/pdf/1606.05293v1.pdf},
year = {2016},
bdsk-url-1 = {http://arxiv.org/pdf/1606.05293v1.pdf},
bdsk-url-2 = {http://dx.doi.org/10.5281/zenodo.321866}
}

• B. Nicolae, C. H. A. Costa, C. Misale, K. Katrinis, and Y. Park, “Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics,” IEEE Transactions on Parallel and Distributed Systems, vol. PP, iss. 99, 2016. doi:10.1109/TPDS.2016.2627558

Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40\% better performance with 50\% less memory utilization for buffering and excellent weak scalability.

@article{16:shuffle:tpds:misale,
abstract = {Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40\% better performance with 50\% less memory utilization for buffering and excellent weak scalability.},
author = {Bogdan Nicolae and Carlos H. A. Costa and Claudia Misale and Kostas Katrinis and Yoonho Park},
date-modified = {2017-04-01 21:55:16 +0000},
doi = {10.1109/TPDS.2016.2627558},
journal = {IEEE Transactions on Parallel and Distributed Systems},
keywords = {ibm},
number = {99},
title = {Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics},
url = {https://iris.unito.it/retrieve/handle/2318/1624908/295954/tpds_4aperto.pdf},
volume = {PP},
year = {2016},
bdsk-url-1 = {http://dx.doi.org/10.1109/TPDS.2016.2627558},
bdsk-url-2 = {https://iris.unito.it/retrieve/handle/2318/1624908/295954/tpds_4aperto.pdf}
}

• F. Tordini, M. Drocco, C. Misale, L. Milanesi, P. LiÒ, I. Merelli, M. Torquati, and M. Aldinucci, “NuChart-II: the road to a fast and scalable tool for Hi-C data analysis,” International Journal of High Performance Computing Applications, pp. 1-16, 2016. doi:10.1177/1094342016668567

Recent advances in molecular biology and bioinformatics techniques brought to an explosion of the information about the spatial organisation of the DNA in the nucleus of a cell. High-throughput molecular biology techniques provide a genome-wide capture of the spatial organization of chromosomes at unprecedented scales, which permit to identify physical interactions between genetic elements located throughout a genome. Recent results have shown that there is a large correlation between co-localization and co-regulation of genes, but these important information are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we present NuChart-II, an efficient and highly optimized tool for genomic data analysis that provides a gene-centric, graph-based representation of genomic information. While designing NuChart-II we addressed several common issues in the parallelisation of memory bound algorithms for shared-memory systems. With performance and usability in mind, NuChart-II is a R package that embeds a C++ engine: computing capabilities and memory hierarchy of multi-core architectures are fully exploited, while the versatile R environment for statistical analysis and data visualisation rises the level of abstraction and permits to orchestrate analysis and visualisation of genomic data.

@article{16:ijhpca:nuchart,
abstract = {Recent advances in molecular biology and bioinformatics techniques brought to an explosion of the information about the spatial organisation of the DNA in the nucleus of a cell. High-throughput molecular biology techniques provide a genome-wide capture of the spatial organization of chromosomes at unprecedented scales, which permit to identify physical interactions between genetic elements located throughout a genome. Recent results have shown that there is a large correlation between co-localization and co-regulation of genes, but these important information are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we present NuChart-II, an efficient and highly optimized tool for genomic data analysis that provides a gene-centric, graph-based representation of genomic information. While designing NuChart-II we addressed several common issues in the parallelisation of memory bound algorithms for shared-memory systems. With performance and usability in mind, NuChart-II is a R package that embeds a C++ engine: computing capabilities and memory hierarchy of multi-core architectures are fully exploited, while the versatile R environment for statistical analysis and data visualisation rises the level of abstraction and permits to orchestrate analysis and visualisation of genomic data.},
author = {Fabio Tordini and Maurizio Drocco and Claudia Misale and Luciano Milanesi and Pietro Li{\o} and Ivan Merelli and Massimo Torquati and Marco Aldinucci},
date-modified = {2017-12-12 13:40:17 +0000},
doi = {10.1177/1094342016668567},
journal = {International Journal of High Performance Computing Applications},
keywords = {fastflow, bioinformatics, repara, rephrase, interomics, mimomics},
pages = {1--16},
title = {{NuChart-II}: the road to a fast and scalable tool for {Hi-C} data analysis},
url = {https://iris.unito.it/retrieve/handle/2318/1607126/238747/main.pdf},
year = {2016},
bdsk-url-1 = {http://hdl.handle.net/2318/1607126},
bdsk-url-2 = {http://dx.doi.org/10.1177/1094342016668567}
}

• M. Drocco, C. Misale, and M. Aldinucci, “A Cluster-As-Accelerator approach for SPMD-free Data Parallelism,” in Proc. of 24th Euromicro Intl. Conference on Parallel Distributed and network-based Processing (PDP), Crete, Greece, 2016, pp. 350-353. doi:10.1109/PDP.2016.97

In this paper we present a novel approach for functional-style programming of distributed-memory clusters, targeting data-centric applications. The programming model proposed is purely sequential, SPMD-free and based on high- level functional features introduced since C++11 specification. Additionally, we propose a novel cluster-as-accelerator design principle. In this scheme, cluster nodes act as general inter- preters of user-defined functional tasks over node-local portions of distributed data structures. We envision coupling a simple yet powerful programming model with a lightweight, locality- aware distributed runtime as a promising step along the road towards high-performance data analytics, in particular under the perspective of the upcoming exascale era. We implemented the proposed approach in SkeDaTo, a prototyping C++ library of data-parallel skeletons exploiting cluster-as-accelerator at the bottom layer of the runtime software stack.

@inproceedings{skedato:pdp:16,
abstract = {In this paper we present a novel approach for functional-style programming of distributed-memory clusters, targeting data-centric applications. The programming model proposed is purely sequential, SPMD-free and based on high- level functional features introduced since C++11 specification. Additionally, we propose a novel cluster-as-accelerator design principle. In this scheme, cluster nodes act as general inter- preters of user-defined functional tasks over node-local portions of distributed data structures. We envision coupling a simple yet powerful programming model with a lightweight, locality- aware distributed runtime as a promising step along the road towards high-performance data analytics, in particular under the perspective of the upcoming exascale era. We implemented the proposed approach in SkeDaTo, a prototyping C++ library of data-parallel skeletons exploiting cluster-as-accelerator at the bottom layer of the runtime software stack.},
author = {Maurizio Drocco and Claudia Misale and Marco Aldinucci},
booktitle = {Proc. of 24th Euromicro Intl. Conference on Parallel Distributed and network-based Processing (PDP)},
date-modified = {2017-12-12 14:49:31 +0000},
doi = {10.1109/PDP.2016.97},
keywords = {rephrase, fastflow},
pages = {350--353},
publisher = {IEEE},
title = {A Cluster-As-Accelerator approach for {SPMD}-free Data Parallelism},
year = {2016},
bdsk-url-1 = {http://hdl.handle.net/2318/1611858},
bdsk-url-3 = {http://dx.doi.org/10.1109/PDP.2016.97}
}

• B. Nicolae, C. H. A. Costa, C. Misale, K. Katrinis, and Y. Park, “Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics,” in IEEE/ACM 16th Intl. Symposium on Cluster, Cloud and Grid Computing, CCGrid 2016, Cartagena, Colombia, 2016. doi:10.1109/CCGrid.2016.85

Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.

@inproceedings{16:ccgrid:misale,
abstract = {Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.},
author = {Bogdan Nicolae and Carlos H. A. Costa and Claudia Misale and Kostas Katrinis and Yoonho Park},
booktitle = {{IEEE/ACM} 16th Intl. Symposium on Cluster, Cloud and Grid Computing, CCGrid 2016},
date-modified = {2016-07-26 16:02:54 +0200},
doi = {10.1109/CCGrid.2016.85},
keywords = {spark, ibm},
publisher = {IEEE},
title = {Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics},
url = {http://ieeexplore.ieee.org/document/7515716/},
year = {2016},
bdsk-url-1 = {http://ieeexplore.ieee.org/document/7515716/},
bdsk-url-2 = {http://dx.doi.org/10.1109/CCGrid.2016.85}
}

• M. Aldinucci, M. Danelutto, M. Drocco, P. Kilpatrick, C. Misale, G. P. Pezzi, and M. Torquati, “A Parallel Pattern for Iterative Stencil + Reduce,” Journal of Supercomputing, pp. 1-16, 2016. doi:10.1007/s11227-016-1871-z

We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop in both data-parallel and streaming applications, or a combination of both. The pattern makes it possible to deploy a single stencil computation kernel on different GPUs. We discuss the implementation of Loop-of-stencil-reduce in FastFlow, a framework for the implementation of applications based on the parallel patterns. Experiments are presented to illustrate the use of Loop-of-stencil-reduce in developing data-parallel kernels running on heterogeneous systems.

@article{16:stencilreduce:jsupe,
abstract = {We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop in both data-parallel and streaming applications, or a combination of both. The pattern makes it possible to deploy a single stencil computation kernel on different GPUs. We discuss the implementation of Loop-of-stencil-reduce in FastFlow, a framework for the implementation of applications based on the parallel patterns. Experiments are presented to illustrate the use of Loop-of-stencil-reduce in developing data-parallel kernels running on heterogeneous systems.},
author = {Marco Aldinucci and Marco Danelutto and Maurizio Drocco and Peter Kilpatrick and Claudia Misale and Guilherme {Peretti Pezzi} and Massimo Torquati},
date-modified = {2016-09-23 07:40:20 +0000},
doi = {10.1007/s11227-016-1871-z},
journal = {Journal of Supercomputing},
keywords = {nvidia, repara, rephrase},
pages = {1--16},
title = {A Parallel Pattern for Iterative Stencil + Reduce},
url = {http://arxiv.org/pdf/1609.04567v1.pdf},
year = {2016},
bdsk-url-1 = {http://dx.doi.org/10.1007/s11227-016-1871-z},
bdsk-url-2 = {http://arxiv.org/pdf/1609.04567v1.pdf}
}

### 2015

• F. Tordini, M. Drocco, C. Misale, L. Milanesi, P. LiÒ, I. Merelli, and M. Aldinucci, “Parallel Exploration of the Nuclear Chromosome Conformation with NuChart-II,” in Proc. of 23rd Euromicro Intl. Conference on Parallel Distributed and network-based Processing (PDP), 2015. doi:10.1109/PDP.2015.104

High-throughput molecular biology techniques are widely used to identify physical interactions between genetic elements located throughout the human genome. Chromosome Conformation Capture (3C) and other related techniques allow to investigate the spatial organisation of chromosomes in the cell’s natural state. Recent results have shown that there is a large correlation between co-localization and co-regulation of genes, but these important information are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we introduce NuChart-II, a tool for Hi-C data analysis that provides a gene-centric view of the chromosomal neighbour- hood in a graph-based manner. NuChart-II is an efficient and highly optimized C++ re-implementation of a previous prototype package developed in R. Representing Hi-C data using a graph- based approach overcomes the common view relying on genomic coordinates and permits the use of graph analysis techniques to explore the spatial conformation of a gene neighbourhood.

@inproceedings{nuchar:tool:15,
abstract = {High-throughput molecular biology techniques are widely used to identify physical interactions between genetic elements located throughout the human genome. Chromosome Conformation Capture (3C) and other related techniques allow to investigate the spatial organisation of chromosomes in the cell's natural state. Recent results have shown that there is a large correlation between co-localization and co-regulation of genes, but these important information are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we introduce NuChart-II, a tool for Hi-C data analysis that provides a gene-centric view of the chromosomal neighbour- hood in a graph-based manner. NuChart-II is an efficient and highly optimized C++ re-implementation of a previous prototype package developed in R. Representing Hi-C data using a graph- based approach overcomes the common view relying on genomic coordinates and permits the use of graph analysis techniques to explore the spatial conformation of a gene neighbourhood.
},
author = {Fabio Tordini and Maurizio Drocco and Claudia Misale and Luciano Milanesi and Pietro Li{\o} and Ivan Merelli and Marco Aldinucci},
booktitle = {Proc. of 23rd Euromicro Intl. Conference on Parallel Distributed and network-based Processing (PDP)},
date-modified = {2017-12-12 13:55:10 +0000},
doi = {10.1109/PDP.2015.104},
keywords = {fastflow, bioinformatics, paraphrase, repara, impact},
month = mar,
publisher = {IEEE},
title = {Parallel Exploration of the Nuclear Chromosome Conformation with {NuChart-II}},
year = {2015},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2015.104}
}

• M. Drocco, C. Misale, G. P. Pezzi, F. Tordini, and M. Aldinucci, “Memory-Optimised Parallel Processing of Hi-C Data,” in Proc. of 23rd Euromicro Intl. Conference on Parallel Distributed and network-based Processing (PDP), 2015, pp. 1-8. doi:10.1109/PDP.2015.63

This paper presents the optimisation efforts on the creation of a graph-based mapping representation of gene adjacency. The method is based on the Hi-C process, starting from Next Generation Sequencing data, and it analyses a huge amount of static data in order to produce maps for one or more genes. Straightforward parallelisation of this scheme does not yield acceptable performance on multicore architectures since the scalability is rather limited due to the memory bound nature of the problem. This work focuses on the memory optimisations that can be applied to the graph construction algorithm and its (complex) data structures to derive a cache-oblivious algorithm and eventually to improve the memory bandwidth utilisation. We used as running example NuChart-II, a tool for annotation and statistic analysis of Hi-C data that creates a gene-centric neigh- borhood graph. The proposed approach, which is exemplified for Hi-C, addresses several common issue in the parallelisation of memory bound algorithms for multicore. Results show that the proposed approach is able to increase the parallel speedup from 7x to 22x (on a 32-core platform). Finally, the proposed C++ implementation outperforms the first R NuChart prototype, by which it was not possible to complete the graph generation because of strong memory-saturation problems.

@inproceedings{nuchart:speedup:15,
abstract = {This paper presents the optimisation efforts on the creation of a graph-based mapping representation of gene adjacency. The method is based on the Hi-C process, starting from Next Generation Sequencing data, and it analyses a huge amount of static data in order to produce maps for one or more genes. Straightforward parallelisation of this scheme does not yield acceptable performance on multicore architectures since the scalability is rather limited due to the memory bound nature of the problem. This work focuses on the memory optimisations that can be applied to the graph construction algorithm and its (complex) data structures to derive a cache-oblivious algorithm and eventually to improve the memory bandwidth utilisation. We used as running example NuChart-II, a tool for annotation and statistic analysis of Hi-C data that creates a gene-centric neigh- borhood graph. The proposed approach, which is exemplified for Hi-C, addresses several common issue in the parallelisation of memory bound algorithms for multicore. Results show that the proposed approach is able to increase the parallel speedup from 7x to 22x (on a 32-core platform). Finally, the proposed C++ implementation outperforms the first R NuChart prototype, by which it was not possible to complete the graph generation because of strong memory-saturation problems.},
author = {Maurizio Drocco and Claudia Misale and Guilherme {Peretti Pezzi} and Fabio Tordini and Marco Aldinucci},
booktitle = {Proc. of 23rd Euromicro Intl. Conference on Parallel Distributed and network-based Processing (PDP)},
date-modified = {2017-12-12 14:45:09 +0000},
doi = {10.1109/PDP.2015.63},
keywords = {fastflow,bioinformatics, paraphrase, repara, impact},
month = mar,
pages = {1-8},
publisher = {IEEE},
title = {Memory-Optimised Parallel Processing of {Hi-C} Data},
year = {2015},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2015.63}
}

### 2014

• C. Misale, G. Ferrero, M. Torquati, and M. Aldinucci, “Sequence alignment tools: one parallel pattern to rule them all?,” BioMed Research International, 2014. doi:10.1155/2014/539410

In this paper we advocate high-level programming methodology for Next Generation Sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools against their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel patterns. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols and task scheduling, gaining more possibility for seamless performance tuning. In this work we show some use case in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets.

@article{bowtie-bwa:ff:multicore:biomed:14,
abstract = {In this paper we advocate high-level programming methodology for Next Generation Sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools against their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel patterns. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols and task scheduling, gaining more possibility for seamless performance tuning. In this work we show some use case in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets.
},
author = {Claudia Misale and Giulio Ferrero and Massimo Torquati and Marco Aldinucci},
date-modified = {2015-09-27 12:16:28 +0000},
doi = {10.1155/2014/539410},
journal = {BioMed Research International},
keywords = {fastflow,bioinformatics, paraphrase, repara},
title = {Sequence alignment tools: one parallel pattern to rule them all?},
year = {2014},
bdsk-url-2 = {http://dx.doi.org/10.1155/2014/539410}
}

• C. Misale, “Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity,” in Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing, Torino, Italy, 2014. doi:10.1109/PDP.2014.50

The implementation of DNA alignment tools for Bioinformatics lead to face different problems that dip into performances. A single alignment takes an amount of time that is not predictable and there are different factors that can affect performances, for instance the length of sequences can determine the computational grain of the task and mismatches or insertion/deletion (indels) increase time needed to complete an alignment. Moreover, an alignment is a strong memory- bound problem because of the irregular memory access pat- terns and limitations in memory-bandwidth. Over the years, many alignment tools were implemented. A concrete example is Bowtie2, one of the fastest (concurrent, Pthread-based) and state of the art not GPU-based alignment tool. Bowtie2 exploits concurrency by instantiating a pool of threads, which have access to a global input dataset, share the reference genome and have access to different objects for collecting alignment results. In this paper a modified implementation of Bowtie2 is presented, in which the concurrency structure has been changed. The proposed implementation exploits the task-farm skeleton pattern implemented as a Master-Worker. The Master-Worker pattern permits to delegate only to the Master thread dataset reading and to make private to each Worker data structures that are shared in the original version. Only the reference genome is left shared. As a further optimisation, the Master and each Worker were pinned on cores and the reference genome was allocated interleaved among memory nodes. The proposed implementation is able to gain up to 10 speedup points over the original implementation.

@inproceedings{ff:bowtie2:pdp:14,
abstract = {The implementation of DNA alignment tools for Bioinformatics lead to face different problems that dip into performances. A single alignment takes an amount of time that is not predictable and there are different factors that can affect performances, for instance the length of sequences can determine the computational grain of the task and mismatches or insertion/deletion (indels) increase time needed to complete an alignment. Moreover, an alignment is a strong memory- bound problem because of the irregular memory access pat- terns and limitations in memory-bandwidth. Over the years, many alignment tools were implemented. A concrete example is Bowtie2, one of the fastest (concurrent, Pthread-based) and state of the art not GPU-based alignment tool. Bowtie2 exploits concurrency by instantiating a pool of threads, which have access to a global input dataset, share the reference genome and have access to different objects for collecting alignment results. In this paper a modified implementation of Bowtie2 is presented, in which the concurrency structure has been changed. The proposed implementation exploits the task-farm skeleton pattern implemented as a Master-Worker. The Master-Worker pattern permits to delegate only to the Master thread dataset reading and to make private to each Worker data structures that are shared in the original version. Only the reference genome is left shared. As a further optimisation, the Master and each Worker were pinned on cores and the reference genome was allocated interleaved among memory nodes. The proposed implementation is able to gain up to 10 speedup points over the original implementation.},
author = {Claudia Misale},
booktitle = {Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing},
date-modified = {2015-09-27 12:41:24 +0000},
doi = {10.1109/PDP.2014.50},
editor = {Marco Aldinucci and Daniele D'Agostino and Peter Kilpatrick},
keywords = {fastflow, paraphrase},
note = {(Best paper award)},
publisher = {IEEE},
title = {Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity},
year = {2014},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2014.50}
}

• M. Aldinucci, M. Drocco, G. P. Pezzi, C. Misale, F. Tordini, and M. Torquati, “Exercising high-level parallel programming on streams: a systems biology use case,” in Proc. of 34th IEEE Intl. Conference on Distributed Computing Systems Workshops (ICDCSW), Madrid, Spain, 2014. doi:10.1109/ICDCSW.2014.38

The stochastic modelling of biological systems, cou- pled with Monte Carlo simulation of models, is an increasingly popular technique in Bioinformatics. The simulation-analysis workflow may result into a computationally expensive task reducing the interactivity required in the model tuning. In this work, we advocate high-level software design as a vehicle for building efficient and portable parallel simulators for a variety of platforms, ranging from multi-core platforms to GPGPUs to cloud. In particular, the Calculus of Wrapped Compartments (CWC) parallel simulator for systems biology equipped with on- line mining of results, which is designed according to the FastFlow pattern-based approach, is discussed as a running example. In this work, the CWC simulator is used as a paradigmatic example of a complex C++ application where the quality of results is correlated with both computation and I/O bounds, and where high-quality results might turn into big data. The FastFlow parallel programming framework, which advocates C++ pattern- based parallel programming makes it possible to develop portable parallel code without relinquish neither run-time efficiency nor performance tuning opportunities. Performance and effectiveness of the approach are validated on a variety of platforms, inter-alia cache-coherent multi-cores, cluster of multi-core (Ethernet and Infiniband) and the Amazon Elastic Compute Cloud.

@inproceedings{cwc:gpu:dcperf:14,
abstract = {The stochastic modelling of biological systems, cou- pled with Monte Carlo simulation of models, is an increasingly popular technique in Bioinformatics. The simulation-analysis workflow may result into a computationally expensive task reducing the interactivity required in the model tuning. In this work, we advocate high-level software design as a vehicle for building efficient and portable parallel simulators for a variety of platforms, ranging from multi-core platforms to GPGPUs to cloud. In particular, the Calculus of Wrapped Compartments (CWC) parallel simulator for systems biology equipped with on- line mining of results, which is designed according to the FastFlow pattern-based approach, is discussed as a running example. In this work, the CWC simulator is used as a paradigmatic example of a complex C++ application where the quality of results is correlated with both computation and I/O bounds, and where high-quality results might turn into big data. The FastFlow parallel programming framework, which advocates C++ pattern- based parallel programming makes it possible to develop portable parallel code without relinquish neither run-time efficiency nor performance tuning opportunities. Performance and effectiveness of the approach are validated on a variety of platforms, inter-alia cache-coherent multi-cores, cluster of multi-core (Ethernet and Infiniband) and the Amazon Elastic Compute Cloud.},
author = {Marco Aldinucci and Maurizio Drocco and Guilherme {Peretti Pezzi} and Claudia Misale and Fabio Tordini and Massimo Torquati},
booktitle = {Proc. of 34th IEEE Intl. Conference on Distributed Computing Systems Workshops (ICDCSW)},
date-modified = {2017-12-12 13:53:58 +0000},
doi = {10.1109/ICDCSW.2014.38},
keywords = {fastflow, gpu, bioinformatics, paraphrase, impact, nvidia},
publisher = {IEEE},
title = {Exercising high-level parallel programming on streams: a systems biology use case},
year = {2014},
bdsk-url-2 = {http://dx.doi.org/10.1109/ICDCSW.2014.38}
}

• M. Aldinucci, M. Torquati, C. Spampinato, M. Drocco, C. Misale, C. Calcagno, and M. Coppo, “Parallel stochastic systems biology in the cloud,” Briefings in Bioinformatics, vol. 15, iss. 5, pp. 798-813, 2014. doi:10.1093/bib/bbt040

The stochastic modelling of biological systems, coupled with Monte Carlo simulation of models, is an increasingly popular technique in bioinformatics. The simulation-analysis workflow may result computationally expensive reducing the interactivity required in the model tuning. In this work, we advocate the high-level software design as a vehicle for building efficient and portable parallel simulators for the cloud. In particular, the Calculus of Wrapped Components (CWC) simulator for systems biology, which is designed according to the FastFlow pattern-based approach, is presented and discussed. Thanks to the FastFlow framework, the CWC simulator is designed as a high-level workflow that can simulate CWC models, merge simulation results and statistically analyse them in a single parallel workflow in the cloud. To improve interactivity, successive phases are pipelined in such a way that the workflow begins to output a stream of analysis results immediately after simulation is started. Performance and effectiveness of the CWC simulator are validated on the Amazon Elastic Compute Cloud.

@article{cwc:cloud:bib:13,
abstract = {The stochastic modelling of biological systems, coupled with Monte Carlo simulation of models, is an increasingly popular technique in bioinformatics. The simulation-analysis workflow may result computationally expensive reducing the interactivity required in the model tuning. In this work, we advocate the high-level software design as a vehicle for building efficient and portable parallel simulators for the cloud. In particular, the Calculus of Wrapped Components (CWC) simulator for systems biology, which is designed according to the FastFlow pattern-based approach, is presented and discussed. Thanks to the FastFlow framework, the CWC simulator is designed as a high-level workflow that can simulate CWC models, merge simulation results and statistically analyse them in a single parallel workflow in the cloud. To improve interactivity, successive phases are pipelined in such a way that the workflow begins to output a stream of analysis results immediately after simulation is started. Performance and effectiveness of the CWC simulator are validated on the Amazon Elastic Compute Cloud.},
author = {Marco Aldinucci and Massimo Torquati and Concetto Spampinato and Maurizio Drocco and Claudia Misale and Cristina Calcagno and Mario Coppo},
date-modified = {2015-09-27 12:33:52 +0000},
doi = {10.1093/bib/bbt040},
issn = {1467-5463},
journal = {Briefings in Bioinformatics},
keywords = {fastflow, bioinformatics, cloud, paraphrase, impact, biobits},
number = {5},
pages = {798-813},
title = {Parallel stochastic systems biology in the cloud},
volume = {15},
year = {2014},
bdsk-url-1 = {http://dx.doi.org/10.1093/bib/bbt040},
}

### 2013

• C. Misale, M. Aldinucci, and M. Torquati, “Memory affinity in multi-threading: the Bowtie2 case study,” in Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) — Poster Abstracts, Fiuggi, Italy, 2013.

The diffusion of the Next Generation Sequencing (NGS) has increased the amount of data obtainable by genomic experiments. From a DNA sample a NGS run is able to produce millions of short sequences (called reads), which should be mapped into a reference genome. In this paper, we analyse the performance of Bowtie2, a fast and popular DNA mapping tool. Bowtie2 exhibits a multithreading implementation on top of pthreads, spin-locks and SSE2 SIMD extension. From parallel computing viewpoint, is a paradigmatic example of a software requiring to address three fundamental problems in shared-memory programming for cache-coherent multi-core platforms: synchronisation efficiency at very fine grain (due to short reads), load-balancing (due to long reads), and efficient usage of memory subsystem (due to SSE2 memory pressure). We compare the original implementation against an alternative implementation on top of the FastFlow pattern-based programming framework. The proposed design exploits the high-level farm pattern of FastFlow, which is implemented top of nonblocking multi-threading and lock-less (CAS-free) queues, and provides the programmer with high-level mechanism to tune task scheduling to achieve both load-balancing and memory affinity. The proposed design, despite the high-level design, is always faster and more scalable with respect to the original one. The design of both original and alternative version will be presented along with their experimental evaluation on real-world data sets.

@inproceedings{ff:acaces:13,
abstract = {The diffusion of the Next Generation Sequencing (NGS) has increased
the amount of data obtainable by genomic experiments. From a DNA sample a NGS run is able to produce millions of short sequences (called reads), which should be mapped into a reference genome. In this paper, we analyse the performance of Bowtie2, a fast and popular DNA mapping tool. Bowtie2 exhibits a multithreading implementation on top of pthreads, spin-locks and SSE2 SIMD extension.
From parallel computing viewpoint, is a paradigmatic example of a software requiring to address three
fundamental problems in shared-memory programming for cache-coherent multi-core platforms: synchronisation efficiency at very fine grain (due to short reads), load-balancing (due to long reads), and efficient usage of memory subsystem (due to SSE2 memory pressure).
We compare the original implementation against an alternative implementation on top of the
FastFlow pattern-based programming framework. The proposed design exploits the high-level farm pattern of FastFlow, which is implemented top of nonblocking multi-threading and lock-less (CAS-free) queues, and provides the programmer with high-level mechanism to tune task scheduling to achieve both load-balancing and memory affinity. The proposed design, despite the high-level  design, is always faster and more scalable with respect to the original one.
The design of both original and alternative version will be presented along with their experimental evaluation on real-world data sets.},
}