## FastFlow Papers

### Articles

• F. Tordini, M. Drocco, C. Misale, L. Milanesi, P. LiÒ, I. Merelli, M. Torquati, and M. Aldinucci, “NuChart-II: the road to a fast and scalable tool for Hi-C data analysis,” International Journal of High Performance Computing Applications (IJHPCA), pp. 1-16, 2016. doi:10.1177/1094342016668567
[BibTeX] [Abstract]

Recent advances in molecular biology and bioinformatics techniques brought to an explosion of the information about the spatial organisation of the DNA in the nucleus of a cell. High-throughput molecular biology techniques provide a genome-wide capture of the spatial organization of chromosomes at unprecedented scales, which permit to identify physical interactions between genetic elements located throughout a genome. Recent results have shown that there is a large correlation between co-localization and co-regulation of genes, but these important information are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we present NuChart-II, an efficient and highly optimized tool for genomic data analysis that provides a gene-centric, graph-based representation of genomic information. While designing NuChart-II we addressed several common issues in the parallelisation of memory bound algorithms for shared-memory systems. With performance and usability in mind, NuChart-II is a R package that embeds a C++ engine: computing capabilities and memory hierarchy of multi-core architectures are fully exploited, while the versatile R environment for statistical analysis and data visualisation rises the level of abstraction and permits to orchestrate analysis and visualisation of genomic data.

@article{16:ijhpca:nuchart,
abstract = {Recent advances in molecular biology and bioinformatics techniques brought to an explosion of the information about the spatial organisation of the DNA in the nucleus of a cell. High-throughput molecular biology techniques provide a genome-wide capture of the spatial organization of chromosomes at unprecedented scales, which permit to identify physical interactions between genetic elements located throughout a genome. Recent results have shown that there is a large correlation between co-localization and co-regulation of genes, but these important information are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we present NuChart-II, an efficient and highly optimized tool for genomic data analysis that provides a gene-centric, graph-based representation of genomic information. While designing NuChart-II we addressed several common issues in the parallelisation of memory bound algorithms for shared-memory systems. With performance and usability in mind, NuChart-II is a R package that embeds a C++ engine: computing capabilities and memory hierarchy of multi-core architectures are fully exploited, while the versatile R environment for statistical analysis and data visualisation rises the level of abstraction and permits to orchestrate analysis and visualisation of genomic data.},
author = {Fabio Tordini and Maurizio Drocco and Claudia Misale and Luciano Milanesi and Pietro Li{\o} and Ivan Merelli and Massimo Torquati and Marco Aldinucci},
date-modified = {2016-10-09 21:55:39 +0000},
doi = {10.1177/1094342016668567},
journal = {International Journal of High Performance Computing Applications (IJHPCA)},
keywords = {fastflow, bioinformatics, repara, rephrase, interomics, mimomics},
pages = {1--16},
title = {{NuChart-II}: the road to a fast and scalable tool for {Hi-C} data analysis},
year = {2016},
bdsk-url-1 = {http://hdl.handle.net/2318/1607126},
bdsk-url-2 = {http://dx.doi.org/10.1177/1094342016668567}
}

• A. Bracciali, M. Aldinucci, M. Patterson, T. Marschall, N. Pisanti, I. Merelli, and M. Torquati, “pWhatsHap: efficient haplotyping for future generation sequencing,” BMC Bioinformatics, vol. 17, iss. Suppl 11, p. 342, 2016. doi:10.1186/s12859-016-1170-y
[BibTeX] [Abstract] [Download PDF]

Background: Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WhatsHap is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology’s current trends that are producing longer fragments. Results: Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered pWhatsHap, a parallel, high-performance version of WhatsHap. pWhatsHap is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WhatsHap, pWhatsHap exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WhatsHap, which increases with coverage. Conclusions: Due to its structure and management of the large datasets, the parallelisation of WhatsHap posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, pWhatsHap, is a freely available toolkit that improves the efficiency of the analysis of genomics information.

@article{16:pwhatshap:bmc,
abstract = {Background: Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WhatsHap is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments.
Results: Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered pWhatsHap, a parallel, high-performance version of WhatsHap. pWhatsHap is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WhatsHap, pWhatsHap exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WhatsHap, which increases with coverage.
Conclusions: Due to its structure and management of the large datasets, the parallelisation of WhatsHap posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, pWhatsHap, is a freely available toolkit that improves the efficiency of the analysis of genomics information.
},
author = {Andrea Bracciali and Marco Aldinucci and Murray Patterson and Tobias Marschall and Nadia Pisanti and Ivan Merelli and Massimo Torquati},
date-modified = {2016-10-17 17:28:27 +0000},
doi = {10.1186/s12859-016-1170-y},
journal = {BMC Bioinformatics},
keywords = {fastflow, paraphrase, rephrase},
number = {Suppl 11},
pages = {342},
title = {pWhatsHap: efficient haplotyping for future generation sequencing},
url = {http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1170-y?site=bmcbioinformatics.biomedcentral.com},
volume = {17},
year = {2016},
bdsk-url-1 = {http://hdl.handle.net/2318/1607125},
bdsk-url-2 = {http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1170-y?site=bmcbioinformatics.biomedcentral.com},
bdsk-url-3 = {http://dx.doi.org/10.1186/s12859-016-1170-y}
}

• M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, and M. Torquati, “Pool Evolution: A Parallel Pattern for Evolutionary and Symbolic Computing,” International Journal of Parallel Programming, vol. 44, iss. 3, pp. 531-551, 2016. doi:10.1007/s10766-015-0358-5
[BibTeX] [Abstract] [Download PDF]

We introduce a new parallel pattern derived from a specific application domain and show how it turns out to have application beyond its domain of origin. The pool evolution pattern models the parallel evolution of a population subject to mutations and evolving in such a way that a given fitness function is optimized. The pattern has been demonstrated to be suitable for capturing and modeling the parallel patterns underpinning various evolutionary algorithms, as well as other parallel patterns typical of symbolic computation. In this paper we introduce the pattern, we discuss its implementation on modern multi/many core architectures and finally present experimental results obtained with FastFlow and Erlang implementations to assess its feasibility and scalability.

@article{pool:ijpp:15,
abstract = {We introduce a new parallel pattern derived from a specific application domain and show how it turns out to have application beyond its domain of origin. The pool evolution pattern models the parallel evolution of a population subject to mutations and evolving in such a way that a given fitness function is optimized. The pattern has been demonstrated to be suitable for capturing and modeling the parallel patterns underpinning various evolutionary algorithms, as well as other parallel patterns typical of symbolic computation. In this paper we introduce the pattern, we discuss its implementation on modern multi/many core architectures and finally present experimental results obtained with FastFlow and Erlang implementations to assess its feasibility and scalability.},
author = {Marco Aldinucci and Sonia Campa and Marco Danelutto and Peter Kilpatrick and Massimo Torquati},
date-added = {2015-03-21 22:15:47 +0000},
date-modified = {2015-09-24 11:15:53 +0000},
doi = {10.1007/s10766-015-0358-5},
issn = {0885-7458},
journal = {International Journal of Parallel Programming},
keywords = {fastflow, paraphrase, repara},
number = {3},
pages = {531--551},
publisher = {Springer US},
title = {Pool Evolution: A Parallel Pattern for Evolutionary and Symbolic Computing},
url = {http://calvados.di.unipi.it/storage/paper_files/2015_ff_pool_ijpp.pdf},
volume = {44},
year = {2016},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2015_ff_pool_ijpp.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/s10766-015-0358-5}
}

• M. Aldinucci, G. P. Pezzi, M. Drocco, C. Spampinato, and M. Torquati, “Parallel Visual Data Restoration on Multi-GPGPUs using Stencil-Reduce Pattern,” International Journal of High Performance Computing Applications, vol. 29, iss. 4, pp. 461-472, 2015. doi:10.1177/1094342014567907
[BibTeX] [Abstract] [Download PDF]

In this paper, a highly effective parallel filter for visual data restoration is presented. The filter is designed following a skeletal approach, using a newly proposed stencil-reduce, and has been implemented by way of the FastFlow parallel programming library. As a result of its high-level design, it is possible to run the filter seamlessly on a multicore machine, on multi-GPGPUs, or on both. The design and implementation of the filter are discussed, and an experimental evaluation is presented.

@article{ff:denoiser:ijhpca:15,
abstract = {In this paper, a highly effective parallel filter for visual data restoration is presented. The filter is designed following a skeletal approach, using a newly proposed stencil-reduce, and has been implemented by way of the FastFlow parallel programming library. As a result of its high-level design, it is possible to run the filter seamlessly on a multicore machine, on multi-GPGPUs, or on both. The design and implementation of the filter are discussed, and an experimental evaluation is presented.},
author = {Marco Aldinucci and Guilherme {Peretti Pezzi} and Maurizio Drocco and Concetto Spampinato and Massimo Torquati},
date-added = {2014-08-23 00:06:10 +0000},
date-modified = {2015-09-24 11:21:20 +0000},
doi = {10.1177/1094342014567907},
journal = {International Journal of High Performance Computing Applications},
keywords = {fastflow, paraphrase, impact, nvidia},
number = {4},
pages = {461-472},
title = {Parallel Visual Data Restoration on Multi-{GPGPUs} using Stencil-Reduce Pattern},
url = {http://calvados.di.unipi.it/storage/paper_files/2015_ff_stencilreduce_ijhpca.pdf},
volume = {29},
year = {2015},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2015_ff_stencilreduce_ijhpca.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1177/1094342014567907}
}

• I. Merelli, F. Tordini, M. Drocco, M. Aldinucci, P. LiÒ, and L. Milanesi, “Integrating Multi-omic features exploiting Chromosome Conformation Capture data,” Frontiers in Genetics, vol. 6, iss. 40, 2015. doi:10.3389/fgene.2015.00040
[BibTeX] [Abstract] [Download PDF]

The representation, integration and interpretation of omic data is a complex task, in particular considering the huge amount of information that is daily produced in molecular biology laboratories all around the world. The reason is that sequencing data regarding expression profiles, methylation patterns, and chromatin domains is difficult to harmonize in a systems biology view, since genome browsers only allow coordinate-based representations, discarding functional clusters created by the spatial conformation of the DNA in the nucleus. In this context, recent progresses in high throughput molecular biology techniques and bioinformatics have provided insights into chromatin interactions on a larger scale and offer a formidable support for the interpretation of multi-omic data. In particular, a novel sequencing technique called Chromosome Conformation Capture (3C) allows the analysis of the chromosome organization in the cell’s natural state. While performed genome wide, this technique is usually called Hi-C. Inspired by service applications such as Google Maps, we developed NuChart, an R package that integrates Hi-C data to describe the chromosomal neighbourhood starting from the information about gene positions, with the possibility of mapping on the achieved graphs genomic features such as methylation patterns and histone modifications, along with expression profiles. In this paper we show the importance of the NuChart application for the integration of multi-omic data in a systems biology fashion, with particular interest in cytogenetic applications of these techniques. Moreover, we demonstrate how the integration of multi-omic data can provide useful information in understanding why genes are in certain specific positions inside the nucleus and how epigenetic patterns correlate with their expression.

@article{nuchart:frontiers:15,
abstract = {The representation, integration and interpretation of omic data is a complex task, in particular considering the huge amount of information that is daily produced in molecular biology laboratories all around the world. The reason is that sequencing data regarding expression profiles, methylation patterns, and chromatin domains is difficult to harmonize in a systems biology view, since genome browsers only allow coordinate-based representations, discarding functional clusters created by the spatial conformation of the DNA in the nucleus. In this context, recent progresses in high throughput molecular biology techniques and bioinformatics have provided insights into chromatin interactions on a larger scale and offer a formidable support for the interpretation of multi-omic data. In particular, a novel sequencing technique called Chromosome Conformation Capture (3C) allows the analysis of the chromosome organization in the cell's natural state. While performed genome wide, this technique is usually called Hi-C. Inspired by service applications such as Google Maps, we developed NuChart, an R package that integrates Hi-C data to describe the chromosomal neighbourhood starting from the information about gene positions, with the possibility of mapping on the achieved graphs genomic features such as methylation patterns and histone modifications, along with expression profiles. In this paper we show the importance of the NuChart application for the integration of multi-omic data in a systems biology fashion, with particular interest in cytogenetic applications of these techniques. Moreover, we demonstrate how the integration of multi-omic data can provide useful information in understanding why genes are in certain specific positions inside the nucleus and how epigenetic patterns correlate with their expression.},
author = {Merelli, Ivan and Tordini, Fabio and Drocco, Maurizio and Aldinucci, Marco and Li{\o}, Pietro and Milanesi, Luciano},
date-added = {2015-02-01 16:38:47 +0000},
date-modified = {2015-09-24 11:23:10 +0000},
doi = {10.3389/fgene.2015.00040},
issn = {1664-8021},
journal = {Frontiers in Genetics},
keywords = {bioinformatics, fastflow, interomics, hirma, mimomics},
number = {40},
title = {Integrating Multi-omic features exploiting {Chromosome Conformation Capture} data},
url = {http://journal.frontiersin.org/Journal/10.3389/fgene.2015.00040/pdf},
volume = {6},
year = {2015},
bdsk-url-1 = {http://journal.frontiersin.org/Journal/10.3389/fgene.2015.00040/pdf},
bdsk-url-2 = {http://dx.doi.org/10.3389/fgene.2015.00040}
}

• M. Aldinucci, M. Torquati, C. Spampinato, M. Drocco, C. Misale, C. Calcagno, and M. Coppo, “Parallel stochastic systems biology in the cloud,” Briefings in Bioinformatics, vol. 15, iss. 5, pp. 798-813, 2014. doi:10.1093/bib/bbt040
[BibTeX] [Abstract] [Download PDF]

The stochastic modelling of biological systems, coupled with Monte Carlo simulation of models, is an increasingly popular technique in bioinformatics. The simulation-analysis workflow may result computationally expensive reducing the interactivity required in the model tuning. In this work, we advocate the high-level software design as a vehicle for building efficient and portable parallel simulators for the cloud. In particular, the Calculus of Wrapped Components (CWC) simulator for systems biology, which is designed according to the FastFlow pattern-based approach, is presented and discussed. Thanks to the FastFlow framework, the CWC simulator is designed as a high-level workflow that can simulate CWC models, merge simulation results and statistically analyse them in a single parallel workflow in the cloud. To improve interactivity, successive phases are pipelined in such a way that the workflow begins to output a stream of analysis results immediately after simulation is started. Performance and effectiveness of the CWC simulator are validated on the Amazon Elastic Compute Cloud.

@article{cwc:cloud:bib:13,
abstract = {The stochastic modelling of biological systems, coupled with Monte Carlo simulation of models, is an increasingly popular technique in bioinformatics. The simulation-analysis workflow may result computationally expensive reducing the interactivity required in the model tuning. In this work, we advocate the high-level software design as a vehicle for building efficient and portable parallel simulators for the cloud. In particular, the Calculus of Wrapped Components (CWC) simulator for systems biology, which is designed according to the FastFlow pattern-based approach, is presented and discussed. Thanks to the FastFlow framework, the CWC simulator is designed as a high-level workflow that can simulate CWC models, merge simulation results and statistically analyse them in a single parallel workflow in the cloud. To improve interactivity, successive phases are pipelined in such a way that the workflow begins to output a stream of analysis results immediately after simulation is started. Performance and effectiveness of the CWC simulator are validated on the Amazon Elastic Compute Cloud.},
author = {Marco Aldinucci and Massimo Torquati and Concetto Spampinato and Maurizio Drocco and Claudia Misale and Cristina Calcagno and Mario Coppo},
date-added = {2014-12-21 17:49:54 +0000},
date-modified = {2015-09-27 12:33:52 +0000},
doi = {10.1093/bib/bbt040},
issn = {1467-5463},
journal = {Briefings in Bioinformatics},
keywords = {fastflow, bioinformatics, cloud, paraphrase, impact, biobits},
number = {5},
pages = {798-813},
title = {Parallel stochastic systems biology in the cloud},
url = {http://calvados.di.unipi.it/storage/paper_files/2013_ff_bio_cloud_briefings.pdf},
volume = {15},
year = {2014},
bdsk-url-1 = {http://dx.doi.org/10.1093/bib/bbt040},
bdsk-url-2 = {http://calvados.di.unipi.it/storage/paper_files/2013_ff_bio_cloud_briefings.pdf}
}

• M. Aldinucci, S. Ruggieri, and M. Torquati, “Decision Tree Building on Multi-Core using FastFlow,” Concurrency and Computation: Practice and Experience, vol. 26, iss. 3, pp. 800-820, 2014. doi:10.1002/cpe.3063
[BibTeX] [Abstract] [Download PDF]

The whole computer hardware industry embraced multi-core. The extreme optimisation of sequential algorithms is then no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an in-depth study of the parallelisation of an implementation of the C4.5 algorithm for multi-core architectures. We characterise elapsed time lower bounds for the forms of parallelisations adopted, and achieve close to optimal performances. Our implementation is based on the FastFlow parallel programming environment and it requires minimal changes to the original sequential code.

@article{yadtff:ccpe:13,
abstract = {The whole computer hardware industry embraced multi-core. The extreme optimisation of sequential algorithms is then no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an in-depth study of the parallelisation of an implementation of the C4.5 algorithm for multi-core architectures. We characterise elapsed time lower bounds for the forms of parallelisations adopted, and achieve close to optimal performances. Our implementation is based on the FastFlow parallel programming environment and it requires minimal changes to the original sequential code.},
author = {Marco Aldinucci and Salvatore Ruggieri and Massimo Torquati},
date-added = {2014-12-21 17:46:33 +0000},
date-modified = {2015-09-27 12:17:52 +0000},
doi = {10.1002/cpe.3063},
journal = {Concurrency and Computation: Practice and Experience},
keywords = {fastflow, paraphrase},
number = {3},
pages = {800-820},
title = {Decision Tree Building on Multi-Core using FastFlow},
url = {http://calvados.di.unipi.it/storage/paper_files/2013_yadtff_ccpe.pdf},
volume = {26},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2013_yadtff_ccpe.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1002/cpe.3063}
}

• C. Misale, G. Ferrero, M. Torquati, and M. Aldinucci, “Sequence alignment tools: one parallel pattern to rule them all?,” BioMed Research International, 2014. doi:10.1155/2014/539410
[BibTeX] [Abstract] [Download PDF]

In this paper we advocate high-level programming methodology for Next Generation Sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools against their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel patterns. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols and task scheduling, gaining more possibility for seamless performance tuning. In this work we show some use case in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets.

@article{bowtie-bwa:ff:multicore:biomed:14,
abstract = {In this paper we advocate high-level programming methodology for Next Generation Sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools against their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel patterns. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols and task scheduling, gaining more possibility for seamless performance tuning. In this work we show some use case in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets.
},
author = {Claudia Misale and Giulio Ferrero and Massimo Torquati and Marco Aldinucci},
date-added = {2013-01-15 15:55:59 +0000},
date-modified = {2015-09-27 12:16:28 +0000},
doi = {10.1155/2014/539410},
journal = {BioMed Research International},
keywords = {fastflow,bioinformatics, paraphrase, repara},
title = {Sequence alignment tools: one parallel pattern to rule them all?},
url = {http://downloads.hindawi.com/journals/bmri/2014/539410.pdf},
year = {2014},
bdsk-url-1 = {http://downloads.hindawi.com/journals/bmri/2014/539410.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1155/2014/539410}
}

• M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, and M. Torquati, “Design patterns percolating to parallel programming framework implementation,” International Journal of Parallel Programming, vol. 42, iss. 6, pp. 1012-1031, 2014. doi:10.1007/s10766-013-0273-6
[BibTeX] [Abstract] [Download PDF]

Structured parallel programming is recognised as a viable and effective means of tackling parallel programming problems. Recently, a set of simple and powerful parallel building blocks (RISC-pb2l) has been proposed to support modelling and implementation of parallel frameworks. In this work we demonstrate how that same parallel building block set may be used to model both general purpose parallel programming abstractions, not usually listed in classical skeleton sets, and more specialized domain specific parallel patterns. We show how an implementation of RISC-pb2l can be realised via the FastFlow framework and present experimental evidence of the feasibility and efficiency of the approach.

@article{ijpp:patterns:13,
abstract = {Structured parallel programming is recognised as a viable and effective means of tackling parallel programming problems. Recently, a set of simple and powerful parallel building blocks (RISC-pb2l) has been proposed to support modelling and implementation of parallel frameworks. In this work we demonstrate how that same parallel building block set may be used to model both general purpose parallel programming abstractions, not usually listed in classical skeleton sets, and more specialized domain specific parallel patterns. We show how an implementation of RISC-pb2l can be realised via the FastFlow framework and present experimental evidence of the feasibility and efficiency of the approach.},
author = {Marco Aldinucci and Sonia Campa and Marco Danelutto and Peter Kilpatrick and Massimo Torquati},
date-added = {2014-12-21 17:47:21 +0000},
date-modified = {2015-09-27 12:32:37 +0000},
doi = {10.1007/s10766-013-0273-6},
issn = {0885-7458},
journal = {International Journal of Parallel Programming},
keywords = {fastflow, paraphrase},
number = {6},
pages = {1012-1031},
title = {Design patterns percolating to parallel programming framework implementation},
url = {http://calvados.di.unipi.it/storage/paper_files/2013_ijpp_patterns-web.pdf},
volume = {42},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2013_ijpp_patterns.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/s10766-013-0273-6},
bdsk-url-3 = {http://calvados.di.unipi.it/storage/paper_files/2013_ijpp_patterns-web.pdf}
}

• M. Aldinucci, C. Calcagno, M. Coppo, F. Damiani, M. Drocco, E. Sciacca, S. Spinella, M. Torquati, and A. Troina, “On designing multicore-aware simulators for systems biology endowed with on-line statistics,” BioMed Research International, 2014. doi:10.1155/2014/207041
[BibTeX] [Abstract] [Download PDF]

The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.

@article{cwcsim:ff:multicore:biomed:14,
abstract = {The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.},
author = {Marco Aldinucci and Cristina Calcagno and Mario Coppo and Ferruccio Damiani and Maurizio Drocco and Eva Sciacca and Salvatore Spinella and Massimo Torquati and Angelo Troina},
date-added = {2014-06-26 21:30:32 +0000},
date-modified = {2015-09-27 12:17:05 +0000},
doi = {10.1155/2014/207041},
journal = {BioMed Research International},
keywords = {fastflow,bioinformatics, paraphrase, biobits},
title = {On designing multicore-aware simulators for systems biology endowed with on-line statistics},
url = {http://downloads.hindawi.com/journals/bmri/2014/207041.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_ff_cwc_bmri.pdf},
bdsk-url-2 = {http://downloads.hindawi.com/journals/bmri/2014/207041.pdf},
bdsk-url-3 = {http://dx.doi.org/10.1155/2014/207041}
}

• M. Aldinucci, M. Danelutto, P. Kilpatrick, and M. Torquati, “Targeting heterogeneous architectures via macro data flow,” Parallel Processing Letters, vol. 22, iss. 2, 2012. doi:10.1142/S0129626412400063
[BibTeX] [Abstract] [Download PDF]

We propose a data flow based run time system as an efficient tool for supporting execution of parallel code on heterogeneous architectures hosting both multicore CPUs and GPUs. We discuss how the proposed run time system may be the target of both structured parallel applications developed using algorithmic skeletons/parallel design patterns and also more “domain specific” programming models. Experimental results demonstrating the feasibility of the approach are presented.

@article{mdf:hplgpu:ppl:12,
abstract = {We propose a data flow based run time system as an efficient tool for supporting execution of parallel code on heterogeneous architectures hosting both multicore CPUs and GPUs. We discuss how the proposed run time system may be the target of both structured parallel applications developed using algorithmic skeletons/parallel design patterns and also more domain specific'' programming models. Experimental results demonstrating the feasibility of the approach are presented.},
annote = {Extended version of Intl. Workshop on High-level Programming for Heterogeneous and Hierarchical Parallel Systems (HLPGPU)},
author = {Marco Aldinucci and Marco Danelutto and Peter Kilpatrick and Massimo Torquati},
date-added = {2012-04-25 13:20:40 +0000},
date-modified = {2015-09-27 12:55:11 +0000},
doi = {10.1142/S0129626412400063},
issn = {0129-6264},
journal = {Parallel Processing Letters},
keywords = {fastflow, paraphrase},
month = jun,
number = {2},
title = {Targeting heterogeneous architectures via macro data flow},
url = {http://calvados.di.unipi.it/storage/paper_files/2012_mdf_PPL-hplgpu.pdf},
volume = {22},
year = {2012},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2012_mdf_PPL-hplgpu.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1142/S0129626412400063}
}

• M. Aldinucci, A. Bracciali, and P. LiÒ, “Formal Synthetic Immunology,” ERCIM News, vol. 82, pp. 40-41, 2010.
[BibTeX] [Abstract] [Download PDF]

The human immune system fights pathogens using an articulated set of strategies whose function is to maintain in health the organism. A large effort to formally model such a complex system using a computational approach is currently underway, with the goal of developing a discipline for engineering "synthetic" immune responses. This requires the integration of a range of analysis techniques developed for formally reasoning about the behaviour of complex dynamical systems. Furthermore, a novel class of software tools has to be developed, capable of efficiently analysing these systems on widely accessible computing platforms, such as commodity multi-core architectures..

@article{stochkitff:ercimnews:10,
abstract = {The human immune system fights pathogens using an articulated set of strategies whose function is to maintain in health the organism. A large effort to formally model such a complex system using a computational approach is currently underway, with the goal of developing a discipline for engineering "synthetic" immune responses. This requires the integration of a range of analysis techniques developed for formally reasoning about the behaviour of complex dynamical systems. Furthermore, a novel class of software tools has to be developed, capable of efficiently analysing these systems on widely accessible computing platforms, such as commodity multi-core architectures..},
author = {Marco Aldinucci and Andrea Bracciali and Pietro Li\o},
date-added = {2010-07-02 20:32:31 +0200},
date-modified = {2013-11-24 00:38:19 +0000},
issn = {0926-4981},
journal = {ERCIM News},
keywords = {bioinformatics, fastflow},
month = jul,
pages = {40-41},
title = {Formal Synthetic Immunology},
url = {http://ercim-news.ercim.eu/images/stories/EN82/EN82-web.pdf},
volume = {82},
year = {2010},
bdsk-url-1 = {http://ercim-news.ercim.eu/images/stories/EN82/EN82-web.pdf}
}

### Books

• M. Aldinucci, M. Danelutto, M. Meneghin, M. Torquati, and P. Kilpatrick, Efficient streaming applications on multi-core with FastFlow: The biosequence alignment test-bed, Elsevier, 2010, vol. 19. doi:10.3233/978-1-60750-530-3-273
[BibTeX] [Abstract] [Download PDF]

Shared-memory multi-core architectures are becoming increasingly popular. While their parallelism and peak performance is ever increasing, their efficiency is often disappointing due to memory fence overheads. In this paper we present FastFlow, a programming methodology based on lock-free queues explicitly designed for programming streaming applications on multi-cores. The potential of FastFlow is evaluated on micro-benchmarks and on the Smith-Waterman sequence alignment application, which exhibits a substantial speedup against the state-of-the-art multi-threaded implementation (SWPS3 x86/SSE2).

@book{fastflow:parco:09,
abstract = {Shared-memory multi-core architectures are becoming increasingly popular. While their parallelism and peak performance is ever increasing, their efficiency is often disappointing due to memory fence overheads. In this paper we present FastFlow, a programming methodology based on lock-free queues explicitly designed for programming streaming applications on multi-cores. The potential of FastFlow is evaluated on micro-benchmarks and on the Smith-Waterman sequence alignment application, which exhibits a substantial speedup against the state-of-the-art multi-threaded implementation (SWPS3 x86/SSE2).},
author = {Aldinucci,M. and Danelutto,M. and Meneghin,M. and Torquati,M. and Kilpatrick,P.},
doi = {10.3233/978-1-60750-530-3-273},
keywords = {fastflow},
language = {English},
opteditor = {Barbara Chapman and Fr{\'e}d{\'e}ric Desprez and Gerhard R. Joubert and Alain Lichnewsky and Frans Peters and Thierry Priol},
pages = {273-280},
publisher = {Elsevier},
series = {Advances in Parallel Computing},
title = {Efficient streaming applications on multi-core with FastFlow: The biosequence alignment test-bed},
url = {http://calvados.di.unipi.it/storage/paper_files/2009_fastflow_parco.pdf},
volume = {19},
year = {2010},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2009_fastflow_parco.pdf},
bdsk-url-2 = {http://dx.doi.org/10.3233/978-1-60750-530-3-273}
}

### In Collections

• M. Aldinucci, M. Danelutto, P. Kilpatrick, and M. Torquati, “FastFlow: high-level and efficient streaming on multi-core,” in Programming Multi-core and Many-core Computing Systems, S. Pllana and F. Xhafa, Eds., Wiley, 2017.
[BibTeX] [Abstract] [Download PDF]

A FastFlow short tutorial

@incollection{ff:wileybook:14,
abstract = {A FastFlow short tutorial},
annote = {ISBN: 0470936908},
author = {Marco Aldinucci and Marco Danelutto and Peter Kilpatrick and Massimo Torquati},
booktitle = {Programming Multi-core and Many-core Computing Systems},
chapter = {13},
date-added = {2011-06-18 18:28:00 +0200},
date-modified = {2014-12-31 14:14:28 +0000},
editor = {Sabri Pllana and Fatos Xhafa},
keywords = {fastflow},
publisher = {Wiley},
series = {Parallel and Distributed Computing},
title = {FastFlow: high-level and efficient streaming on multi-core},
url = {http://calvados.di.unipi.it/storage/paper_files/2011_FF_tutorial-draft.pdf},
year = {2017},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2011_FF_tutorial-draft.pdf}
}

• F. Tordini, I. Merelli, P. LiÒ, L. Milanesi, and M. Aldinucci, “NuchaRt: embedding high-level parallel computing in R for augmented Hi-C data analysis,” in Computational Intelligence Methods for Bioinformatics and Biostatistics, S. I. Publishing, Ed., Cham (ZG): Springer International Publishing, 2016, vol. 9874, pp. 259-272. doi:10.1007/978-3-319-44332-4
[BibTeX] [Abstract] [Download PDF]

Recent advances in molecular biology and Bioinformatics techniques brought to an explosion of the information about the spatial organisation of the DNA in the nucleus. High-throughput chromosome conformation capture techniques provide a genome-wide capture of chromatin contacts at unprecedented scales, which permit to identify physical interactions between genetic elements located throughout the human genome. These important studies are hampered by the lack of biologists-friendly software. In this work we present NuchaRt, an R package that wraps NuChart-II, an efficient and highly optimized C++ tool for the exploration of Hi-C data. By rising the level of abstraction, NuchaRt proposes a high-performance pipeline that allows users to orchestrate analysis and visualisation of multi-omics data, making optimal use of the computing capabilities offered by modern multi-core architectures, combined with the versatile and well known R environment for statistical analysis and data visualisation.

@incollection{15:lnbi:nuchaRt,
abstract = {Recent advances in molecular biology and Bioinformatics techniques brought to an explosion of the information about the spatial organisation of the DNA in the nucleus. High-throughput chromosome conformation capture techniques provide a genome-wide capture of chromatin contacts at unprecedented scales, which permit to identify physical interactions between genetic elements located throughout the human genome. These important studies are hampered by the lack of biologists-friendly software. In this work we present NuchaRt, an R package that wraps NuChart-II, an efficient and highly optimized C++ tool for the exploration of Hi-C data. By rising the level of abstraction, NuchaRt proposes a high-performance pipeline that allows users to orchestrate analysis and visualisation of multi-omics data, making optimal use of the computing capabilities offered by modern multi-core architectures, combined with the versatile and well known R environment for statistical analysis and data visualisation.},
address = {Cham (ZG)},
author = {Fabio Tordini and Ivan Merelli and Pietro Li{\o} and Luciano Milanesi and Marco Aldinucci},
booktitle = {Computational Intelligence Methods for Bioinformatics and Biostatistics},
doi = {10.1007/978-3-319-44332-4},
editor = {Springer International Publishing},
isbn = {978-3-319-44331-7},
keywords = {fastflow, bioinformatics, repara, interomics, mimomics},
pages = {259--272},
publisher = {Springer International Publishing},
series = {{Lecture Notes in Computer Science}},
title = {{NuchaRt}: embedding high-level parallel computing in {R} for augmented {Hi-C} data analysis},
url = {http://link.springer.com/book/10.1007%2F978-3-319-44332-4},
volume = {9874},
year = {2016},
bdsk-url-1 = {http://link.springer.com/book/10.1007%2F978-3-319-44332-4},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-319-44332-4}
}

• M. Danelutto and M. Torquati, “Structured Parallel Programming with “core” FastFlow,” in Central European Functional Programming School, V. Zsók, Z. Horváth, and L. Csató, Eds., Springer, 2015, vol. 8606, pp. 29-75. doi:10.1007/978-3-319-15940-9_2
[BibTeX] [Abstract] [Download PDF]

FastFlow is an open source, structured parallel programming framework originally conceived to support highly efficient stream parallel computation while targeting shared memory multi cores. Its efficiency mainly comes from the optimized implementation of the base communication mechanisms and from its layered design. FastFlow eventually provides the parallel applications programmers with a set of ready-to-use, parametric algorithmic skeletons modeling the most common parallelism exploitation patterns. The algorithmic skeleton provided by FastFlow may be freely nested to model more and more complex parallelism exploitation patterns. This tutorial describes the “core” FastFlow, that is the set of skeletons supported since version 1.0 in FastFlow, and outlines the recent advances aimed at (i) introducing new, higher level skeletons and (ii) targeting networked multi cores, possibly equipped with GPUs, in addition to single multi/many core processing elements.

@incollection{tutorial:ff:15,
abstract = {FastFlow is an open source, structured parallel programming framework originally conceived to support highly efficient stream parallel computation while targeting shared memory multi cores. Its efficiency mainly comes from the optimized implementation of the base communication mechanisms and from its layered design. FastFlow eventually provides the parallel applications programmers with a set of ready-to-use, parametric algorithmic skeletons modeling the most common parallelism exploitation patterns. The algorithmic skeleton provided by FastFlow may be freely nested to model more and more complex parallelism exploitation patterns. This tutorial describes the core'' FastFlow, that is the set of skeletons supported since version 1.0 in FastFlow, and outlines the recent advances aimed at (i) introducing new, higher level skeletons and (ii) targeting networked multi cores, possibly equipped with GPUs, in addition to single multi/many core processing elements.
},
author = {Danelutto, Marco and Torquati, Massimo},
booktitle = {Central European Functional Programming School},
date-added = {2015-05-07 14:30:40 +0000},
date-modified = {2015-09-27 12:12:49 +0000},
doi = {10.1007/978-3-319-15940-9_2},
editor = {Zs{\'o}k, Vikt{\'o}ria and Horv{\'a}th, Zolt{\'a}n and Csat{\'o}, Lehel},
isbn = {978-3-319-15939-3},
keywords = {fastflow, paraphrase},
pages = {29-75},
publisher = {Springer},
series = {LNCS},
title = {Structured Parallel Programming with core'' FastFlow},
url = {http://dx.doi.org/10.1007/978-3-319-15940-9_2},
volume = {8606},
year = {2015},
bdsk-url-1 = {http://dx.doi.org/10.1007/978-3-319-15940-9_2}
}

• M. Aldinucci, S. Campa, F. Tordini, M. Torquati, and P. Kilpatrick, “An abstract annotation model for skeletons,” in Formal Methods for Components and Objects: Intl. Symposium, FMCO 2011, Torino, Italy, October 3-5, 2011, Revised Invited Lectures, B. Beckert, F. Damiani, F. S. de Boer, and M. M. Bonsangue, Eds., Springer, 2013, vol. 7542, pp. 257-276. doi:10.1007/978-3-642-35887-6_14
[BibTeX] [Abstract] [Download PDF]

Multi-core and many-core platforms are becoming increasingly heterogeneous and asymmetric. This significantly increases the porting and tuning effort required for parallel codes, which in turn often leads to a growing gap between peak machine power and actual application performance. In this work a first step toward the automated optimization of high level skeleton-based parallel code is discussed. The paper presents an abstract annotation model for skeleton programs aimed at formally describing suitable mapping of parallel activities on a high-level platform representation. The derived mapping and scheduling strategies are used to generate optimized run-time code.

@incollection{toolchain:fmco:11,
abstract = {Multi-core and many-core platforms are becoming increasingly heterogeneous and asymmetric. This significantly increases the porting and tuning effort required for parallel codes, which in turn often leads to a growing gap between peak machine power and actual application performance. In this work a first step toward the automated optimization of high level skeleton-based parallel code is discussed. The paper presents an abstract annotation model for skeleton programs aimed at formally describing suitable mapping of parallel activities on a high-level platform representation. The derived mapping and scheduling strategies are used to generate optimized run-time code.},
author = {Marco Aldinucci and Sonia Campa and Fabio Tordini and Massimo Torquati and Peter Kilpatrick},
booktitle = {Formal Methods for Components and Objects: Intl. Symposium, FMCO 2011, Torino, Italy, October 3-5, 2011, Revised Invited Lectures},
date-added = {2012-06-04 19:23:25 +0200},
date-modified = {2013-11-24 00:33:41 +0000},
doi = {10.1007/978-3-642-35887-6_14},
editor = {Bernhard Beckert and Ferruccio Damiani and Frank S. de Boer and Marcello M. Bonsangue},
isbn = {978-3-642-35886-9},
keywords = {fastflow, paraphrase},
pages = {257-276},
publisher = {Springer},
series = {LNCS},
title = {An abstract annotation model for skeletons},
url = {http://calvados.di.unipi.it/storage/paper_files/2013_fmco11_annotation.pdf},
volume = {7542},
year = {2013},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2013_fmco11_annotation.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-642-35887-6_14}
}

• M. Aldinucci, “Efficient Parallel MonteCarlo with FastFlow,” in HPC-Europa2: Science and Supercomputing in Europe, research highlights 2010, Cineca, 2010.
[BibTeX] [Abstract] [Download PDF]

The stochastic simulation of natural systems is a very informative but happens be computationally expensive. We present StochKit-FF, an parallel version of StochKit, a reference toolkit for stochastic simulations that sustantially improves StochKit performances on multi-core platforms.

@incollection{ff:hpc-europa:10,
abstract = {The stochastic simulation of natural systems is a very informative but happens be computationally expensive. We present StochKit-FF, an parallel version of StochKit, a reference toolkit for stochastic simulations that sustantially improves StochKit performances on multi-core platforms.},
author = {Marco Aldinucci},
booktitle = {HPC-Europa2: Science and Supercomputing in Europe, research highlights 2010},
date-added = {2011-06-18 18:43:19 +0200},
date-modified = {2013-11-24 00:40:04 +0000},
keywords = {bioinformatics, fastflow},
publisher = {Cineca},
title = {Efficient Parallel {MonteCarlo} with {FastFlow}},
url = {http://calvados.di.unipi.it/storage/paper_files/2010-ff_hpceuropa2_092-inform-Aldinucci.pdf},
year = {2010},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2010-ff_hpceuropa2_092-inform-Aldinucci.pdf}
}

### In Proceedings

• M. Aldinucci, M. Danelutto, D. D. Sensi, G. Mencagli, and M. Torquati, “Towards Power-Aware Data Pipelining on Multicores,” in Proceedings of the 10th International Symposium on High-Level Parallel Programming and Applications, Valladolid, Spain, 2017.
[BibTeX] [Abstract] [Download PDF]

Power consumption management has become a major concern in software development. Continuous streaming computations are usually com- posed by different modules, exchanging data through shared message queues. The selection of the algorithm used to access such queues (i.e., the concurrency control) is a critical aspect for both performance and power consumption. In this paper, we describe the design of an adaptive concurrency control algo- rithm for implementing power-efficient communications on shared memory multicores. The algorithm provides the throughput offered by a nonblocking implementation and the power efficiency of a blocking protocol. We demon- strate that our algorithm reduces the power consumption of data streaming computations without decreasing their throughput.

@inproceedings{17:hlpp:powerstream,
abstract = {Power consumption management has become a major concern in software development. Continuous streaming computations are usually com- posed by different modules, exchanging data through shared message queues. The selection of the algorithm used to access such queues (i.e., the concurrency control) is a critical aspect for both performance and power consumption. In this paper, we describe the design of an adaptive concurrency control algo- rithm for implementing power-efficient communications on shared memory multicores. The algorithm provides the throughput offered by a nonblocking implementation and the power efficiency of a blocking protocol. We demon- strate that our algorithm reduces the power consumption of data streaming computations without decreasing their throughput.},
address = {Valladolid, Spain},
author = {Marco Aldinucci and Marco Danelutto and Daniele De Sensi and Gabriele Mencagli and Massimo Torquati},
booktitle = {Proceedings of the 10th International Symposium on High-Level Parallel Programming and Applications},
date-added = {2017-07-13 09:02:32 +0000},
date-modified = {2017-07-13 09:05:21 +0000},
keywords = {rephrase, fastflow},
title = {Towards Power-Aware Data Pipelining on Multicores},
url = {https://iris.unito.it/retrieve/handle/2318/1644982/351415/17_HLPP_powerstream.pdf},
year = {2017},
bdsk-url-1 = {https://iris.unito.it/retrieve/handle/2318/1644982/351415/17_HLPP_powerstream.pdf}
}

• M. F. Dolz, D. del Rio Astorga, J. Fernández, D. J. Garc’ia, F. Garc’ia-Carballeira, M. Danelutto, and M. Torquati, “Embedding Semantics of the Single-Producer/Single-Consumer Lock-Free Queue into a Race Detection Tool,” in Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores, New York, NY, USA, 2016, pp. 20-29. doi:10.1145/2883404.2883406
[BibTeX] [Download PDF]
@inproceedings{16:PMAM:SPSC,
acmid = {2883406},
address = {New York, NY, USA},
author = {Dolz, Manuel F. and del Rio Astorga, David and Fern\'{a}ndez, Javier and Garc\'{\i}a, J. Daniel and Garc\'{\i}a-Carballeira, F{\'e}lix and Danelutto, Marco and Torquati, Massimo},
booktitle = {Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores},
date-modified = {2016-04-21 17:33:36 +0000},
doi = {10.1145/2883404.2883406},
isbn = {978-1-4503-4196-7},
keywords = {fastflow, repara},
location = {Barcelona, Spain},
numpages = {10},
pages = {20--29},
publisher = {ACM},
series = {PMAM'16},
title = {Embedding Semantics of the Single-Producer/Single-Consumer Lock-Free Queue into a Race Detection Tool},
url = {http://doi.acm.org/10.1145/2883404.2883406},
year = {2016},
bdsk-url-1 = {http://doi.acm.org/10.1145/2883404.2883406},
bdsk-url-2 = {http://dx.doi.org/10.1145/2883404.2883406}
}

• V. Janjic, C. Brown, K. MacKenzie, K. and Hammond, M. Danelutto, M. Aldinucci, and J. D. Garcia, “RPL: A Domain-Specific Language for Designing and Implementing Parallel C++ Applications,” in Proc. of Intl. Euromicro PDP 2016: Parallel Distributed and network-based Processing, Crete, Greece, 2016. doi:10.1109/PDP.2016.122
[BibTeX] [Abstract] [Download PDF]

Parallelising sequential applications is usually a very hard job, due to many different ways in which an application can be parallelised and a large number of programming models (each with its own advantages and disadvantages) that can be used. In this paper, we describe a method to semi- automatically generate and evaluate different parallelisations of the same application, allowing programmers to find the best parallelisation without significant manual reengineering of the code. We describe a novel, high-level domain-specific language, Refactoring Pattern Language (RPL), that is used to represent the parallel structure of an application and to capture its extra-functional properties (such as service time). We then describe a set of RPL rewrite rules that can be used to generate alternative, but semantically equivalent, parallel structures (parallelisations) of the same application. We also describe the RPL Shell that can be used to evaluate these parallelisations, in terms of the desired extra-functional properties. Finally, we describe a set of C++ refactorings, targeting OpenMP, Intel TBB and FastFlow parallel programming models, that semi-automatically apply the desired parallelisation to the application’s source code, therefore giving a parallel version of the code. We demonstrate how the RPL and the refactoring rules can be used to derive efficient parallelisations of two realistic C++ use cases (Image Convolution and Ant Colony Optimisation).

@inproceedings{rpl:pdp:16,
abstract = {Parallelising sequential applications is usually a very hard job, due to many different ways in which an application can be parallelised and a large number of programming models (each with its own advantages and disadvantages) that can be used. In this paper, we describe a method to semi- automatically generate and evaluate different parallelisations of the same application, allowing programmers to find the best parallelisation without significant manual reengineering of the code. We describe a novel, high-level domain-specific language, Refactoring Pattern Language (RPL), that is used to represent the parallel structure of an application and to capture its extra-functional properties (such as service time). We then describe a set of RPL rewrite rules that can be used to generate alternative, but semantically equivalent, parallel structures (parallelisations) of the same application. We also describe the RPL Shell that can be used to evaluate these parallelisations, in terms of the desired extra-functional properties. Finally, we describe a set of C++ refactorings, targeting OpenMP, Intel TBB and FastFlow parallel programming models, that semi-automatically apply the desired parallelisation to the application's source code, therefore giving a parallel version of the code. We demonstrate how the RPL and the refactoring rules can be used to derive efficient parallelisations of two realistic C++ use cases (Image Convolution and Ant Colony Optimisation).},
address = {Crete, Greece},
author = {Vladimir Janjic and Christopher Brown and Kenneth MacKenzie and and Kevin Hammond and Marco Danelutto and Marco Aldinucci and Jose Daniel Garcia},
booktitle = {Proc. of Intl. Euromicro PDP 2016: Parallel Distributed and network-based Processing},
date-modified = {2017-06-20 08:19:39 +0000},
doi = {10.1109/PDP.2016.122},
keywords = {rephrase, fastflow},
publisher = {IEEE},
title = {{RPL}: A Domain-Specific Language for Designing and Implementing Parallel C++ Applications},
url = {https://iris.unito.it/retrieve/handle/2318/1597172/299237/2016_jsupe_stencil_pp_4aperto.pdf},
year = {2016},
bdsk-url-1 = {http://hdl.handle.net/2318/1597172},
bdsk-url-2 = {https://iris.unito.it/retrieve/handle/2318/1597172/299237/2016_jsupe_stencil_pp_4aperto.pdf},
bdsk-url-3 = {http://dx.doi.org/10.1109/PDP.2016.122}
}

• M. Drocco, C. Misale, and M. Aldinucci, “A Cluster-As-Accelerator approach for SPMD-free Data Parallelism,” in Proc. of Intl. Euromicro PDP 2016: Parallel Distributed and network-based Processing, Crete, Greece, 2016, pp. 350-353. doi:10.1109/PDP.2016.97
[BibTeX] [Abstract] [Download PDF]

In this paper we present a novel approach for functional-style programming of distributed-memory clusters, targeting data-centric applications. The programming model proposed is purely sequential, SPMD-free and based on high- level functional features introduced since C++11 specification. Additionally, we propose a novel cluster-as-accelerator design principle. In this scheme, cluster nodes act as general inter- preters of user-defined functional tasks over node-local portions of distributed data structures. We envision coupling a simple yet powerful programming model with a lightweight, locality- aware distributed runtime as a promising step along the road towards high-performance data analytics, in particular under the perspective of the upcoming exascale era. We implemented the proposed approach in SkeDaTo, a prototyping C++ library of data-parallel skeletons exploiting cluster-as-accelerator at the bottom layer of the runtime software stack.

@inproceedings{skedato:pdp:16,
abstract = {In this paper we present a novel approach for functional-style programming of distributed-memory clusters, targeting data-centric applications. The programming model proposed is purely sequential, SPMD-free and based on high- level functional features introduced since C++11 specification. Additionally, we propose a novel cluster-as-accelerator design principle. In this scheme, cluster nodes act as general inter- preters of user-defined functional tasks over node-local portions of distributed data structures. We envision coupling a simple yet powerful programming model with a lightweight, locality- aware distributed runtime as a promising step along the road towards high-performance data analytics, in particular under the perspective of the upcoming exascale era. We implemented the proposed approach in SkeDaTo, a prototyping C++ library of data-parallel skeletons exploiting cluster-as-accelerator at the bottom layer of the runtime software stack.},
address = {Crete, Greece},
author = {Maurizio Drocco and Claudia Misale and Marco Aldinucci},
booktitle = {Proc. of Intl. Euromicro PDP 2016: Parallel Distributed and network-based Processing},
date-modified = {2016-04-21 17:33:00 +0000},
doi = {10.1109/PDP.2016.97},
keywords = {rephrase, fastflow},
pages = {350--353},
publisher = {IEEE},
title = {A Cluster-As-Accelerator approach for {SPMD}-free Data Parallelism},
url = {http://calvados.di.unipi.it/storage/paper_files/2016_pdp_skedato.pdf},
year = {2016},
bdsk-url-1 = {http://hdl.handle.net/2318/1611858},
bdsk-url-2 = {http://calvados.di.unipi.it/storage/paper_files/2016_pdp_skedato.pdf},
bdsk-url-3 = {http://dx.doi.org/10.1109/PDP.2016.97}
}

• F. Tordini, “A cloud solution for multi-omics data integration,” in Proceedings of the 16th IEEE International Conference on Scalable Computing and Communication, 2016, pp. 559-566. doi:10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.131
[BibTeX] [Abstract] [Download PDF]

Recent advances in molecular biology and Bioinformatics techniques have brought to an explosion of the information about the spatial organisation of the DNA inside the nucleus. In particular, 3C-based techniques are revealing the genome folding for many different cell types, and permit to create a more effective representation of the disposition of genes in the three-dimensional space. This information can be used to re-interpret heterogeneous genomic data (multi-omic) relying on 3D maps of the chromosome. The storage and computational requirements needed to accomplish such operations on raw sequenced data have to be fulfilled using HPC solutions, and the the Cloud paradigm is a valuable and convenient mean for delivering HPC to Bioinformatics. In this work we describe a data analysis work-flow that allows the integration and the interpretation of multi-omic data on a sort of “topographical” nuclear map, capable of representing the effective disposition of genes in a graph-based representation. We propose a cloud-based task farm pattern to orchestrate the services needed to accomplish genomic data analysis, where each service represents a special-purpose tool, playing a part in well known data analysis pipelines.

@inproceedings{16:scalcom:cloud,
abstract = {Recent advances in molecular biology and Bioinformatics techniques have brought to an explosion of the information about the spatial organisation of the DNA inside the nucleus. In particular, 3C-based techniques are revealing the genome folding for many different cell types, and permit to create a more effective representation of the disposition of genes in the three-dimensional space. This information can be used to re-interpret heterogeneous genomic data (multi-omic) relying on 3D maps of the chromosome. The storage and computational requirements needed to accomplish such operations on raw sequenced data have to be fulfilled using HPC solutions, and the the Cloud paradigm is a valuable and convenient mean for delivering HPC to Bioinformatics. In this work we describe a data analysis work-flow that allows the integration and the interpretation of multi-omic data on a sort of topographical'' nuclear map, capable of representing the effective disposition of genes in a graph-based representation. We propose a cloud-based task farm pattern to orchestrate the services needed to accomplish genomic data analysis, where each service represents a special-purpose tool, playing a part in well known data analysis pipelines.},
author = {Fabio Tordini},
booktitle = {Proceedings of the 16th IEEE International Conference on Scalable Computing and Communication},
date-modified = {2016-08-30 10:26:12 +0000},
doi = {10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.131},
keywords = {fastflow, bioinformatics, rephrase},
note = {Best paper award},
pages = {559--566},
publisher = {IEEE Computer Society},
title = {{A cloud solution for multi-omics data integration}},
url = {http://calvados.di.unipi.it/storage/paper_files/2016_cloudpipeline_scalcom.pdf},
year = {2016},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2016_cloudpipeline_scalcom.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.131}
}

• M. Aldinucci, M. Danelutto, M. Drocco, P. Kilpatrick, G. P. Pezzi, and M. Torquati, “The Loop-of-Stencil-Reduce paradigm,” in Proc. of Intl. Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms (RePara), Helsinki, Finland, 2015, pp. 172-177. doi:10.1109/Trustcom.2015.628
[BibTeX] [Abstract] [Download PDF]

In this paper we advocate the Loop-of-stencil-reduce pattern as a way to simplify the parallel programming of heterogeneous platforms (multicore+GPUs). Loop-of-Stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop. It transparently targets (by using OpenCL) combinations of CPU cores and GPUs, and it makes it possible to simplify the deployment of a single stencil computation kernel on different GPUs. The paper discusses the implementation of Loop-of-stencil-reduce within the FastFlow parallel framework, considering a simple iterative data-parallel application as running example (Game of Life) and a highly effective parallel filter for visual data restoration to assess performance. Thanks to the high-level design of the Loop-of-stencil-reduce, it was possible to run the filter seamlessly on a multicore machine, on multi-GPUs, and on both.

@inproceedings{opencl:ff:ispa:15,
abstract = {In this paper we advocate the Loop-of-stencil-reduce pattern as a way to simplify the parallel programming of heterogeneous platforms (multicore+GPUs). Loop-of-Stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop. It transparently targets (by using OpenCL) combinations of CPU cores and GPUs, and it makes it possible to simplify the deployment of a single stencil computation kernel on different GPUs. The paper discusses the implementation of Loop-of-stencil-reduce within the FastFlow parallel framework, considering a simple iterative data-parallel application as running example (Game of Life) and a highly effective parallel filter for visual data restoration to assess performance. Thanks to the high-level design of the Loop-of-stencil-reduce, it was possible to run the filter seamlessly on a multicore machine, on multi-GPUs, and on both.},
address = {Helsinki, Finland},
author = {Marco Aldinucci and Marco Danelutto and Maurizio Drocco and Peter Kilpatrick and Guilherme {Peretti Pezzi} and Massimo Torquati},
booktitle = {Proc. of Intl. Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms (RePara)},
date-added = {2015-07-05 09:48:33 +0000},
date-modified = {2015-09-24 11:14:56 +0000},
doi = {10.1109/Trustcom.2015.628},
keywords = {fastflow, repara, nvidia},
month = aug,
pages = {172-177},
publisher = {IEEE},
title = {The Loop-of-Stencil-Reduce paradigm},
url = {http://calvados.di.unipi.it/storage/paper_files/2015_RePara_ISPA.pdf},
year = {2015},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2015_RePara_ISPA.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/Trustcom.2015.628}
}

• F. Tordini, M. Drocco, C. Misale, L. Milanesi, P. LiÒ, I. Merelli, and M. Aldinucci, “Parallel Exploration of the Nuclear Chromosome Conformation with NuChart-II,” in Proc. of Intl. Euromicro PDP 2015: Parallel Distributed and network-based Processing, 2015. doi:10.1109/PDP.2015.104
[BibTeX] [Abstract] [Download PDF]

High-throughput molecular biology techniques are widely used to identify physical interactions between genetic elements located throughout the human genome. Chromosome Conformation Capture (3C) and other related techniques allow to investigate the spatial organisation of chromosomes in the cell’s natural state. Recent results have shown that there is a large correlation between co-localization and co-regulation of genes, but these important information are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we introduce NuChart-II, a tool for Hi-C data analysis that provides a gene-centric view of the chromosomal neighbour- hood in a graph-based manner. NuChart-II is an efficient and highly optimized C++ re-implementation of a previous prototype package developed in R. Representing Hi-C data using a graph- based approach overcomes the common view relying on genomic coordinates and permits the use of graph analysis techniques to explore the spatial conformation of a gene neighbourhood.

@inproceedings{nuchar:tool:15,
abstract = {High-throughput molecular biology techniques are widely used to identify physical interactions between genetic elements located throughout the human genome. Chromosome Conformation Capture (3C) and other related techniques allow to investigate the spatial organisation of chromosomes in the cell's natural state. Recent results have shown that there is a large correlation between co-localization and co-regulation of genes, but these important information are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we introduce NuChart-II, a tool for Hi-C data analysis that provides a gene-centric view of the chromosomal neighbour- hood in a graph-based manner. NuChart-II is an efficient and highly optimized C++ re-implementation of a previous prototype package developed in R. Representing Hi-C data using a graph- based approach overcomes the common view relying on genomic coordinates and permits the use of graph analysis techniques to explore the spatial conformation of a gene neighbourhood.
},
author = {Fabio Tordini and Maurizio Drocco and Claudia Misale and Luciano Milanesi and Pietro Li{\o} and Ivan Merelli and Marco Aldinucci},
booktitle = {Proc. of Intl. Euromicro PDP 2015: Parallel Distributed and network-based Processing},
date-added = {2014-12-03 13:51:17 +0000},
date-modified = {2015-09-24 11:16:43 +0000},
doi = {10.1109/PDP.2015.104},
keywords = {fastflow, bioinformatics, paraphrase, repara, impact},
month = mar,
publisher = {IEEE},
title = {Parallel Exploration of the Nuclear Chromosome Conformation with {NuChart-II}},
url = {http://calvados.di.unipi.it/storage/paper_files/2015_pdp_nuchartff.pdf},
year = {2015},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2015_pdp_nuchartff.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2015.104}
}

• M. Drocco, C. Misale, G. P. Pezzi, F. Tordini, and M. Aldinucci, “Memory-Optimised Parallel Processing of Hi-C Data,” in Proc. of Intl. Euromicro PDP 2015: Parallel Distributed and network-based Processing, 2015, pp. 1-8. doi:10.1109/PDP.2015.63
[BibTeX] [Abstract] [Download PDF]

This paper presents the optimisation efforts on the creation of a graph-based mapping representation of gene adjacency. The method is based on the Hi-C process, starting from Next Generation Sequencing data, and it analyses a huge amount of static data in order to produce maps for one or more genes. Straightforward parallelisation of this scheme does not yield acceptable performance on multicore architectures since the scalability is rather limited due to the memory bound nature of the problem. This work focuses on the memory optimisations that can be applied to the graph construction algorithm and its (complex) data structures to derive a cache-oblivious algorithm and eventually to improve the memory bandwidth utilisation. We used as running example NuChart-II, a tool for annotation and statistic analysis of Hi-C data that creates a gene-centric neigh- borhood graph. The proposed approach, which is exemplified for Hi-C, addresses several common issue in the parallelisation of memory bound algorithms for multicore. Results show that the proposed approach is able to increase the parallel speedup from 7x to 22x (on a 32-core platform). Finally, the proposed C++ implementation outperforms the first R NuChart prototype, by which it was not possible to complete the graph generation because of strong memory-saturation problems.

@inproceedings{nuchart:speedup:15,
abstract = {This paper presents the optimisation efforts on the creation of a graph-based mapping representation of gene adjacency. The method is based on the Hi-C process, starting from Next Generation Sequencing data, and it analyses a huge amount of static data in order to produce maps for one or more genes. Straightforward parallelisation of this scheme does not yield acceptable performance on multicore architectures since the scalability is rather limited due to the memory bound nature of the problem. This work focuses on the memory optimisations that can be applied to the graph construction algorithm and its (complex) data structures to derive a cache-oblivious algorithm and eventually to improve the memory bandwidth utilisation. We used as running example NuChart-II, a tool for annotation and statistic analysis of Hi-C data that creates a gene-centric neigh- borhood graph. The proposed approach, which is exemplified for Hi-C, addresses several common issue in the parallelisation of memory bound algorithms for multicore. Results show that the proposed approach is able to increase the parallel speedup from 7x to 22x (on a 32-core platform). Finally, the proposed C++ implementation outperforms the first R NuChart prototype, by which it was not possible to complete the graph generation because of strong memory-saturation problems.},
author = {Maurizio Drocco and Claudia Misale and Guilherme {Peretti Pezzi} and Fabio Tordini and Marco Aldinucci},
booktitle = {Proc. of Intl. Euromicro PDP 2015: Parallel Distributed and network-based Processing},
date-added = {2014-12-03 13:54:08 +0000},
date-modified = {2015-09-24 11:17:47 +0000},
doi = {10.1109/PDP.2015.63},
keywords = {fastflow,bioinformatics, paraphrase, repara, impact},
month = mar,
pages = {1-8},
publisher = {IEEE},
title = {Memory-Optimised Parallel Processing of {Hi-C} Data},
url = {http://calvados.di.unipi.it/storage/paper_files/2015_pdp_memopt.pdf},
year = {2015},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2015_pdp_memopt.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2015.63}
}

• D. D. Sensi, M. Danelutto, and M. Torquati, “Energy driven adaptivity in stream parallel computations,” in Proc. of Intl. Euromicro PDP 2015: Parallel Distributed and network-based Processing, Turku, Finland, 2015.
[BibTeX]
@inproceedings{ff:energy:pdp:15,
address = {Turku, Finland},
author = {Daniele De Sensi and Marco Danelutto and Massimo Torquati},
booktitle = {Proc. of Intl. Euromicro PDP 2015: Parallel Distributed and network-based Processing},
date-added = {2015-02-28 10:59:38 +0000},
date-modified = {2015-02-28 11:01:23 +0000},
keywords = {fastflow},
publisher = {IEEE},
title = {Energy driven adaptivity in stream parallel computations},
year = {2015},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_ff_infiniband_pdp.pdf}
}

• M. Aldinucci, A. Bracciali, T. Marschall, M. Patterson, N. Pisanti, and M. Torquati, “High-Performance Haplotype Assembly,” in Computational Intelligence Methods for Bioinformatics and Biostatistics – 11th International Meeting, CIBB 2014, Cambridge, UK, June 26-28, 2014, Revised Selected Papers, Cambridge, UK, 2015, pp. 245-258. doi:10.1007/978-3-319-24462-4_21
[BibTeX] [Abstract] [Download PDF]

The problem of Haplotype Assembly is an essential step in human genome analysis. It is typically formalised as the Minimum Error Correction (MEC) problem which is NP-hard. MEC has been approached using heuristics, integer linear programming, and fixed-parameter tractability (FPT), including approaches whose runtime is exponential in the length of the DNA fragments obtained by the sequencing process. Technological improvements are currently increasing fragment length, which drastically elevates computational costs for such methods. We present pWhatsHap, a multi-core parallelisation of WhatsHap, a recent FPT optimal approach to MEC. WhatsHap moves complexity from fragment length to fragment overlap and is hence of particular interest when considering sequencing technology’s current trends. pWhatsHap further improves the efficiency in solving the MEC problem, as shown by experiments performed on datasets with high coverage.

@inproceedings{14:ff:whatsapp:cibb,
abstract = {The problem of Haplotype Assembly is an essential step in human genome analysis. It is typically formalised as the Minimum Error Correction (MEC) problem which is NP-hard. MEC has been approached using heuristics, integer linear programming, and fixed-parameter tractability (FPT), including approaches whose runtime is exponential in the length of the DNA fragments obtained by the sequencing process. Technological improvements are currently increasing fragment length, which drastically elevates computational costs for such methods. We present pWhatsHap, a multi-core parallelisation of WhatsHap, a recent FPT optimal approach to MEC. WhatsHap moves complexity from fragment length to fragment overlap and is hence of particular interest when considering sequencing technology's current trends. pWhatsHap further improves the efficiency in solving the MEC problem, as shown by experiments performed on datasets with high coverage.},
address = {Cambridge, UK},
author = {Marco Aldinucci and Andrea Bracciali and Tobias Marschall and Murray Patterson and Nadia Pisanti and Massimo Torquati},
booktitle = {Computational Intelligence Methods for Bioinformatics and Biostatistics - 11th International Meeting, {CIBB} 2014, Cambridge, UK, June 26-28, 2014, Revised Selected Papers},
date-added = {2014-12-01 23:07:21 +0000},
date-modified = {2016-08-20 14:15:59 +0000},
doi = {10.1007/978-3-319-24462-4_21},
editor = {Clelia Di Serio and Pietro Li{\{o}} and Alessandro Nonis and Roberto Tagliaferri},
keywords = {fastflow, bioinformatics},
pages = {245--258},
publisher = {Springer},
series = {{LNCS}},
title = {High-Performance Haplotype Assembly},
url = {http://calvados.di.unipi.it/storage/paper_files/2014_pHaplo_cibb.pdf},
volume = {8623},
year = {2015},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_pHaplo_cibb.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-319-24462-4_21}
}

• F. Tordini, M. Drocco, I. Merelli, L. Milanesi, P. LiÒ, and M. Aldinucci, “NuChart-II: a graph-based approach for the analysis and interpretation of Hi-C data,” in Computational Intelligence Methods for Bioinformatics and Biostatistics – 11th International Meeting, CIBB 2014, Cambridge, UK, June 26-28, 2014, Revised Selected Papers, Cambridge, UK, 2015, pp. 298-311. doi:10.1007/978-3-319-24462-4_25
[BibTeX] [Abstract] [Download PDF]

Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expressions, and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between co-localization and co-regulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we present NuChart-II, a software that allows the user to annotate and visualize a list of input genes with information relying on Hi-C data, integrating knowledge data about genomic features that are involved in the chromosome spatial organization. This software works directly with sequenced reads to identify related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. NuChart-II is a highly optimized implementation of a previous prototype package developed in R, in which the graph-based representation of Hi-C data was tested. The prototype showed inevitable problems of scalability while working genome-wide on large datasets: particular attention has been paid in optimizing the data structures employed while constructing the neighbourhood graph, so as to foster an efficient parallel implementation of the software. The normalization of Hi-C data has been modified and improved, in order to provide a reliable estimation of proximity likelihood for the genes.

@inproceedings{14:ff:nuchart:cibb,
abstract = {Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expressions, and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between co-localization and co-regulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. In this work we present NuChart-II, a software that allows the user to annotate and visualize a list of input genes with information relying on Hi-C data, integrating knowledge data about genomic features that are involved in the chromosome spatial organization. This software works directly with sequenced reads to identify related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. NuChart-II is a highly optimized implementation of a previous prototype package developed in R, in which the graph-based representation of Hi-C data was tested. The prototype showed inevitable problems of scalability while working genome-wide on large datasets: particular attention has been paid in optimizing the data structures employed while constructing the neighbourhood graph, so as to foster an efficient parallel implementation of the software. The normalization of Hi-C data has been modified and improved, in order to provide a reliable estimation of proximity likelihood for the genes.},
address = {Cambridge, UK},
author = {Fabio Tordini and Maurizio Drocco and Ivan Merelli and Luciano Milanesi and Pietro Li{\o} and Marco Aldinucci},
booktitle = {Computational Intelligence Methods for Bioinformatics and Biostatistics - 11th International Meeting, {CIBB} 2014, Cambridge, UK, June 26-28, 2014, Revised Selected Papers},
date-modified = {2015-09-24 11:22:30 +0000},
doi = {10.1007/978-3-319-24462-4_25},
editor = {Clelia Di Serio and Pietro Li{\{o}} and Alessandro Nonis and Roberto Tagliaferri},
isbn = {978-3-319-24461-7},
keywords = {fastflow, bioinformatics, paraphrase, repara, interomics, mimomics, hirma},
pages = {298-311},
publisher = {Springer},
series = {{LNCS}},
title = {{NuChart-II}: a graph-based approach for the analysis and interpretation of {Hi-C} data},
url = {http://calvados.di.unipi.it/storage/paper_files/2014_nuchart_cibb.pdf},
volume = {8623},
year = {2015},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_nuchart_cibb.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-319-24462-4_25}
}

• M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, and M. Torquati, “Pool evolution: a domain specific parallel pattern,” in Proc.of the 7th Intl. Symposium on High-level Parallel Programming and Applications (HLPP), Amsterdam, The Netherlands, 2014.
[BibTeX] [Abstract] [Download PDF]

We introduce a new parallel pattern derived from a specific application domain and show how it turns out to have application beyond its domain of origin. The pool evolution pattern models the parallel evolution of a population subject to mutations and evolving in such a way that a given fitness function is optimized. The pattern has been demonstrated to be suitable for capturing and modeling the parallel patterns underpinning various evolutionary algorithms, as well as other parallel patterns typical of symbolic computation. In this paper we introduce the pattern, developed in the framework of the ParaPhrase EU-funded FP7 project, we discuss its implementation on modern multi/many core architectures and finally present experimental results obtained with FastFlow and Erlang implementations to assess its feasibility and scalability.

@inproceedings{2014:ff:pool:hlpp,
abstract = {We introduce a new parallel pattern derived from a specific application domain and show how it turns out to have application beyond its domain of origin. The pool evolution pattern models the parallel evolution of a population subject to mutations and evolving in such a way that a given fitness function is optimized. The pattern has been demonstrated to be suitable for capturing and modeling the parallel patterns underpinning various evolutionary algorithms, as well as other parallel patterns typical of symbolic computation. In this paper we introduce the pattern, developed in the framework of the ParaPhrase EU-funded FP7 project, we discuss its implementation on modern multi/many core architectures and finally present experimental results obtained with FastFlow and Erlang implementations to assess its feasibility and scalability.},
address = {Amsterdam, The Netherlands},
author = {Marco Aldinucci and Sonia Campa and Marco Danelutto and Peter Kilpatrick and Massimo Torquati},
booktitle = {Proc.of the 7th Intl. Symposium on High-level Parallel Programming and Applications (HLPP)},
date-modified = {2015-09-27 12:14:30 +0000},
keywords = {fastflow, paraphrase, repara},
month = jul,
title = {Pool evolution: a domain specific parallel pattern},
url = {http://calvados.di.unipi.it/storage/paper_files/2014_hlpp_pool.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_hlpp_pool.pdf}
}

• M. Aldinucci, M. Torquati, M. Drocco, G. P. Pezzi, and C. Spampinato, “FastFlow: Combining Pattern-Level Abstraction and Efficiency in GPGPUs,” in GPU Technology Conference (GTC 2014), San Jose, CA, USA, 2014.
[BibTeX] [Abstract] [Download PDF]

Learn how FastFlow’s parallel patterns can be used to design parallel applications for execution on both CPUs and GPGPUs while avoiding most of the complex low-level detail needed to make them efficient, portable and rapid to prototype. As use case, we will show the design and effectiveness of a novel universal image filtering template based on the variational approach.

@inproceedings{ff:gtc:2014,
abstract = {Learn how FastFlow's parallel patterns can be used to design parallel applications for execution on both CPUs and GPGPUs while avoiding most of the complex low-level detail needed to make them efficient, portable and rapid to prototype. As use case, we will show the design and effectiveness of a novel universal image filtering template based on the variational approach.},
address = {San Jose, CA, USA},
author = {Marco Aldinucci and Massimo Torquati and Maurizio Drocco and Guilherme {Peretti Pezzi} and Concetto Spampinato},
booktitle = {GPU Technology Conference (GTC 2014)},
date-added = {2014-04-19 12:52:40 +0000},
date-modified = {2016-08-19 21:45:39 +0000},
keywords = {fastflow, gpu, nvidia, impact, paraphrase},
month = mar,
title = {FastFlow: Combining Pattern-Level Abstraction and Efficiency in {GPGPUs}},
url = {http://calvados.di.unipi.it/storage/talks/2014_S4729-Marco-Aldinucci.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/talks/2014_S4729-Marco-Aldinucci.pdf}
}

• M. Aldinucci, M. Torquati, M. Drocco, G. P. Pezzi, and C. Spampinato, “An Overview of FastFlow: Combining Pattern-Level Abstraction and Efficiency in GPGPUs,” in GPU Technology Conference (GTC 2014), San Jose, CA, USA, 2014.
[BibTeX] [Abstract] [Download PDF]

Get an overview of FastFlow’s parallel patterns can be used to design parallel applications for execution on both CPUs and GPGPUs while avoiding most of the complex low-level detail needed to make them efficient, portable and rapid to prototype. For a more detailed and technical review of FastFlow’s parallel patterns as well as a use case where we will show the design and effectiveness of a novel universal image filtering template based on the variational approach.

@inproceedings{ff:gtc:2014:short,
abstract = {Get an overview of FastFlow's parallel patterns can be used to design parallel applications for execution on both CPUs and GPGPUs while avoiding most of the complex low-level detail needed to make them efficient, portable and rapid to prototype. For a more detailed and technical review of FastFlow's parallel patterns as well as a use case where we will show the design and effectiveness of a novel universal image filtering template based on the variational approach.},
address = {San Jose, CA, USA},
author = {Marco Aldinucci and Massimo Torquati and Maurizio Drocco and Guilherme {Peretti Pezzi} and Concetto Spampinato},
booktitle = {GPU Technology Conference (GTC 2014)},
date-added = {2014-04-13 23:20:52 +0000},
date-modified = {2016-08-19 21:45:51 +0000},
keywords = {fastflow, gpu, nvidia, impact, paraphrase},
month = mar,
title = {An Overview of FastFlow: Combining Pattern-Level Abstraction and Efficiency in {GPGPUs}},
url = {http://calvados.di.unipi.it/storage/talks/2014_S4585-Marco-Aldinucci.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/talks/2014_S4585-Marco-Aldinucci.pdf}
}

• D. Buono, M. Danelutto, T. D. Matteis, G. Mencagli, and M. Torquati, “A Lightweight Run-Time Support For Fast Dense Linear Algebra on Multi-Core,” in Proc. of the 12th International Conference on Parallel and Distributed Computing and Networks (PDCN 2014), 2014.
[BibTeX]
@inproceedings{ff:ffmdf:pdcn:14,
author = {Daniele Buono and Marco Danelutto and Tiziano De Matteis and Gabriele Mencagli and Massimo Torquati},
booktitle = {Proc. of the 12th International Conference on Parallel and Distributed Computing and Networks (PDCN 2014)},
date-modified = {2015-02-01 16:49:46 +0000},
keywords = {fastflow},
month = feb,
publisher = {IASTED, ACTA press},
title = {A Lightweight Run-Time Support For Fast Dense Linear Algebra on Multi-Core},
year = {2014}
}

• M. Drocco, M. Aldinucci, and M. Torquati, “A Dynamic Memory Allocator for heterogeneous platforms,” in Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) — Poster Abstracts, Fiuggi, Italy, 2014.
[BibTeX] [Abstract] [Download PDF]

Modern computers are built upon heterogeneous multi-core/many cores architectures (e.g. GPGPU connected to multi-core CPU). Achieving peak performance on these architectures is hard and may require a substantial programming effort. High-level programming patterns, coupled with efficient low-level runtime supports, have been proposed to relieve the programmer from worrying about low-level details such as synchronisation of racing processes as well as those fine tunings needed to improve the overall performance. Among them are (parallel) dynamic memory allocation and effective exploitation of the memory hierarchy. The memory allocator is often a bottleneck that severely limits program scalability, robustness and portability on parallel systems. In this work we introduce a novel memory allocator, based on the FastFlow’s allocator and the recently proposed CUDA Unified Memory, which aims to efficiently integrate host and device memories into a unique dynamic-allocable memory space, accessible transparently by both host and device code.

@inproceedings{ff:acaces:14,
abstract = {Modern computers are built upon heterogeneous multi-core/many cores architectures (e.g. GPGPU connected to multi-core CPU). Achieving peak performance on these architectures is hard and may require a substantial programming effort. High-level programming patterns, coupled with efficient low-level runtime supports, have been proposed to relieve the programmer from worrying about low-level details such as synchronisation of racing processes as well as those fine tunings needed to improve the overall performance. Among them are (parallel) dynamic memory allocation and effective exploitation of the memory hierarchy. The memory allocator is often a bottleneck that severely limits program scalability, robustness and portability on parallel systems.
In this work we introduce a novel memory allocator, based on the FastFlow's allocator and the recently proposed CUDA Unified Memory, which aims to efficiently integrate host and device memories into a unique dynamic-allocable memory space, accessible transparently by both host and device code.},
address = {Fiuggi, Italy},
author = {Maurizio Drocco and Marco Aldinucci and Massimo Torquati},
booktitle = {Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) -- Poster Abstracts},
date-modified = {2016-08-20 17:29:47 +0000},
keywords = {fastflow, nvidia},
publisher = {HiPEAC},
title = {A Dynamic Memory Allocator for heterogeneous platforms},
url = {http://calvados.di.unipi.it/storage/paper_files/2014_ACACES_ex-abstract.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2013_ACACES_ex-abstract.pdf},
bdsk-url-2 = {http://calvados.di.unipi.it/storage/paper_files/2014_ACACES_ex-abstract.pdf}
}

• M. Aldinucci, G. P. Pezzi, M. Drocco, F. Tordini, P. Kilpatrick, and M. Torquati, “Parallel video denoising on heterogeneous platforms,” in Proc. of Intl. Workshop on High-level Programming for Heterogeneous and Hierarchical Parallel Systems (HLPGPU), 2014.
[BibTeX] [Abstract] [Download PDF]

In this paper, a highly-effective parallel filter for video denoising is presented. The filter is designed using a skeletal approach, and has been implemented by way of the FastFlow parallel programming library. As a result of its high-level design, it is possible to run the filter seamlessly on a multi-core machine, on GPGPU(s), or on both. The design and the implementation of the filter are discussed, and an experimental evaluation is presented. Various mappings of the filtering stages are comparatively discussed.

@inproceedings{ff:video:hlpgpu:14,
abstract = {In this paper, a highly-effective parallel filter for video denoising is presented. The filter is designed using a skeletal approach, and has been implemented by way of the FastFlow parallel programming library. As a result of its high-level design, it is possible to run the filter seamlessly on a multi-core machine, on GPGPU(s), or on both. The design and the implementation of the filter are discussed, and an experimental evaluation is presented. Various mappings of the filtering stages are comparatively discussed.},
author = {Marco Aldinucci and Guilherme {Peretti Pezzi} and Maurizio Drocco and Fabio Tordini and Peter Kilpatrick and Massimo Torquati},
booktitle = {Proc. of Intl. Workshop on High-level Programming for Heterogeneous and Hierarchical Parallel Systems (HLPGPU)},
date-added = {2013-12-07 18:28:32 +0000},
date-modified = {2015-09-27 12:42:02 +0000},
keywords = {fastflow, paraphrase, impact},
title = {Parallel video denoising on heterogeneous platforms},
url = {http://calvados.di.unipi.it/storage/paper_files/2014_ff_video_denoiser_hlpgpu.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_ff_video_denoiser_hlpgpu.pdf}
}

• M. Aldinucci, M. Drocco, G. P. Pezzi, C. Misale, F. Tordini, and M. Torquati, “Exercising high-level parallel programming on streams: a systems biology use case,” in Proc. of the 2014 IEEE 34th Intl. Conference on Distributed Computing Systems Workshops (ICDCS), Madrid, Spain, 2014. doi:10.1109/ICDCSW.2014.38
[BibTeX] [Abstract] [Download PDF]

The stochastic modelling of biological systems, cou- pled with Monte Carlo simulation of models, is an increasingly popular technique in Bioinformatics. The simulation-analysis workflow may result into a computationally expensive task reducing the interactivity required in the model tuning. In this work, we advocate high-level software design as a vehicle for building efficient and portable parallel simulators for a variety of platforms, ranging from multi-core platforms to GPGPUs to cloud. In particular, the Calculus of Wrapped Compartments (CWC) parallel simulator for systems biology equipped with on- line mining of results, which is designed according to the FastFlow pattern-based approach, is discussed as a running example. In this work, the CWC simulator is used as a paradigmatic example of a complex C++ application where the quality of results is correlated with both computation and I/O bounds, and where high-quality results might turn into big data. The FastFlow parallel programming framework, which advocates C++ pattern- based parallel programming makes it possible to develop portable parallel code without relinquish neither run-time efficiency nor performance tuning opportunities. Performance and effectiveness of the approach are validated on a variety of platforms, inter-alia cache-coherent multi-cores, cluster of multi-core (Ethernet and Infiniband) and the Amazon Elastic Compute Cloud.

@inproceedings{cwc:gpu:dcperf:14,
abstract = {The stochastic modelling of biological systems, cou- pled with Monte Carlo simulation of models, is an increasingly popular technique in Bioinformatics. The simulation-analysis workflow may result into a computationally expensive task reducing the interactivity required in the model tuning. In this work, we advocate high-level software design as a vehicle for building efficient and portable parallel simulators for a variety of platforms, ranging from multi-core platforms to GPGPUs to cloud. In particular, the Calculus of Wrapped Compartments (CWC) parallel simulator for systems biology equipped with on- line mining of results, which is designed according to the FastFlow pattern-based approach, is discussed as a running example. In this work, the CWC simulator is used as a paradigmatic example of a complex C++ application where the quality of results is correlated with both computation and I/O bounds, and where high-quality results might turn into big data. The FastFlow parallel programming framework, which advocates C++ pattern- based parallel programming makes it possible to develop portable parallel code without relinquish neither run-time efficiency nor performance tuning opportunities. Performance and effectiveness of the approach are validated on a variety of platforms, inter-alia cache-coherent multi-cores, cluster of multi-core (Ethernet and Infiniband) and the Amazon Elastic Compute Cloud.},
address = {Madrid, Spain},
author = {Marco Aldinucci and Maurizio Drocco and Guilherme {Peretti Pezzi} and Claudia Misale and Fabio Tordini and Massimo Torquati},
booktitle = {Proc. of the 2014 IEEE 34th Intl. Conference on Distributed Computing Systems Workshops (ICDCS)},
date-added = {2014-04-19 12:44:39 +0000},
date-modified = {2015-09-27 12:43:13 +0000},
doi = {10.1109/ICDCSW.2014.38},
keywords = {fastflow, gpu, bioinformatics, paraphrase, impact, nvidia},
publisher = {IEEE},
title = {Exercising high-level parallel programming on streams: a systems biology use case},
url = {http://calvados.di.unipi.it/storage/paper_files/2014_dcperf_cwc_gpu.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_dcperf_cwc_gpu.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/ICDCSW.2014.38}
}

• M. Danelutto, L. Deri, D. D. Sensi, and M. Torquati, “Deep Packet Inspection on Commodity Hardware using FastFlow,” in Parallel Computing: Accelerating Computational Science and Engineering (CSE) (Proc. of PARCO 2013, Munich, Germany), Munich, Germany, 2014, pp. 92-99.
[BibTeX]
@inproceedings{ff:DPI:14,
address = {Munich, Germany},
author = {Marco Danelutto and Luca Deri and Daniele De Sensi and Massimo Torquati},
booktitle = {Parallel Computing: Accelerating Computational Science and Engineering (CSE) (Proc. of {PARCO 2013}, Munich, Germany)},
editor = {Michael Bader and Arndt Bode and Hans-Joachim Bungartz and Michael Gerndt and Gerhard R. Joubert and Frans Peters},
keywords = {fastflow},
pages = {92 -- 99},
publisher = {IOS Press},
series = {Advances in Parallel Computing},
title = {Deep Packet Inspection on Commodity Hardware using FastFlow},
volume = {25},
year = {2014}
}

• M. Danelutto and M. Torquati, “Loop parallelism: a new skeleton perspective on data parallel patterns,” in Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing, Torino, Italy, 2014. doi:10.1109/PDP.2014.13
[BibTeX] [Abstract] [Download PDF]

Traditionally, skeleton based parallel programming frameworks support data parallelism by providing the pro- grammer with a comprehensive set of data parallel skeletons, based on different variants of map and reduce patterns. On the other side, more conventional parallel programming frameworks provide application programmers with the possibility to introduce parallelism in the execution of loops with a relatively small programming effort. In this work, we discuss a “ParallelFor” skeleton provided within the FastFlow framework and aimed at filling the usability and expressivity gap between the classical data parallel skeleton approach and the loop parallelisation facilities offered by frameworks such as OpenMP and Intel TBB. By exploiting the low run-time overhead of the FastFlow parallel skeletons and the new facilities offered by the C++11 standard, our ParallelFor skeleton succeeds to obtain comparable or better performance than both OpenMP and TBB on the Intel Phi many-core and Intel Nehalem multi-core for a set of benchmarks considered, yet requiring a comparable programming effort.

@inproceedings{ff:looppar:pdp:14,
abstract = {Traditionally, skeleton based parallel programming frameworks support data parallelism by providing the pro- grammer with a comprehensive set of data parallel skeletons, based on different variants of map and reduce patterns. On the other side, more conventional parallel programming frameworks provide application programmers with the possibility to introduce parallelism in the execution of loops with a relatively small programming effort. In this work, we discuss a ParallelFor'' skeleton provided within the FastFlow framework and aimed at filling the usability and expressivity gap between the classical data parallel skeleton approach and the loop parallelisation facilities offered by frameworks such as OpenMP and Intel TBB. By exploiting the low run-time overhead of the FastFlow parallel skeletons and the new facilities offered by the C++11 standard, our ParallelFor skeleton succeeds to obtain comparable or better performance than both OpenMP and TBB on the Intel Phi many-core and Intel Nehalem multi-core for a set of benchmarks considered, yet requiring a comparable programming effort.},
address = {Torino, Italy},
author = {Marco Danelutto and Massimo Torquati},
booktitle = {Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing},
date-added = {2014-02-15 16:53:29 +0000},
date-modified = {2015-09-27 12:42:31 +0000},
doi = {10.1109/PDP.2014.13},
editor = {Marco Aldinucci and Daniele D'Agostino and Peter Kilpatrick},
keywords = {fastflow, paraphrase},
publisher = {IEEE},
title = {Loop parallelism: a new skeleton perspective on data parallel patterns},
url = {http://calvados.di.unipi.it/storage/paper_files/2014_ff_looppar_pdp.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_ff_looppar_pdp.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2014.13}
}

• C. Misale, “Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity,” in Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing, Torino, Italy, 2014. doi:10.1109/PDP.2014.50
[BibTeX] [Abstract] [Download PDF]

The implementation of DNA alignment tools for Bioinformatics lead to face different problems that dip into performances. A single alignment takes an amount of time that is not predictable and there are different factors that can affect performances, for instance the length of sequences can determine the computational grain of the task and mismatches or insertion/deletion (indels) increase time needed to complete an alignment. Moreover, an alignment is a strong memory- bound problem because of the irregular memory access pat- terns and limitations in memory-bandwidth. Over the years, many alignment tools were implemented. A concrete example is Bowtie2, one of the fastest (concurrent, Pthread-based) and state of the art not GPU-based alignment tool. Bowtie2 exploits concurrency by instantiating a pool of threads, which have access to a global input dataset, share the reference genome and have access to different objects for collecting alignment results. In this paper a modified implementation of Bowtie2 is presented, in which the concurrency structure has been changed. The proposed implementation exploits the task-farm skeleton pattern implemented as a Master-Worker. The Master-Worker pattern permits to delegate only to the Master thread dataset reading and to make private to each Worker data structures that are shared in the original version. Only the reference genome is left shared. As a further optimisation, the Master and each Worker were pinned on cores and the reference genome was allocated interleaved among memory nodes. The proposed implementation is able to gain up to 10 speedup points over the original implementation.

@inproceedings{ff:bowtie2:pdp:14,
abstract = {The implementation of DNA alignment tools for Bioinformatics lead to face different problems that dip into performances. A single alignment takes an amount of time that is not predictable and there are different factors that can affect performances, for instance the length of sequences can determine the computational grain of the task and mismatches or insertion/deletion (indels) increase time needed to complete an alignment. Moreover, an alignment is a strong memory- bound problem because of the irregular memory access pat- terns and limitations in memory-bandwidth. Over the years, many alignment tools were implemented. A concrete example is Bowtie2, one of the fastest (concurrent, Pthread-based) and state of the art not GPU-based alignment tool. Bowtie2 exploits concurrency by instantiating a pool of threads, which have access to a global input dataset, share the reference genome and have access to different objects for collecting alignment results. In this paper a modified implementation of Bowtie2 is presented, in which the concurrency structure has been changed. The proposed implementation exploits the task-farm skeleton pattern implemented as a Master-Worker. The Master-Worker pattern permits to delegate only to the Master thread dataset reading and to make private to each Worker data structures that are shared in the original version. Only the reference genome is left shared. As a further optimisation, the Master and each Worker were pinned on cores and the reference genome was allocated interleaved among memory nodes. The proposed implementation is able to gain up to 10 speedup points over the original implementation.},
address = {Torino, Italy},
author = {Claudia Misale},
booktitle = {Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing},
date-added = {2013-12-07 18:25:55 +0000},
date-modified = {2015-09-27 12:41:24 +0000},
doi = {10.1109/PDP.2014.50},
editor = {Marco Aldinucci and Daniele D'Agostino and Peter Kilpatrick},
keywords = {fastflow, paraphrase},
note = {(Best paper award)},
publisher = {IEEE},
title = {Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity},
url = {http://calvados.di.unipi.it/storage/paper_files/2014_pdp_bowtieff.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_pdp_bowtieff.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2014.50}
}

• A. Secco, I. Uddin, G. P. Pezzi, and M. Torquati, “Message passing on InfiniBand RDMA for parallel run-time supports,” in Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing, Torino, Italy, 2014. doi:10.1109/PDP.2014.23
[BibTeX] [Abstract] [Download PDF]

InfiniBand networks are commonly used in the high performance computing area. They offer RDMA-based opera- tions that help to improve the performance of communication subsystems. In this paper, we propose a minimal message-passing communication layer providing the programmer with a point-to- point communication channel implemented by way of InfiniBand RDMA features. Differently from other libraries exploiting the InfiniBand features, such as the well-known Message Passing Interface (MPI), the proposed library is a communication layer only rather than a programming model, and can be easily used as building block for high-level parallel programming frameworks. Evaluated on micro-benchmarks, the proposed RDMA-based communication channel implementation achieves a comparable performance with highly optimised MPI/InfiniBand implemen- tations. Eventually, the flexibility of the communication layer is evaluated by integrating it within the FastFlow parallel frame- work, currently supporting TCP/IP networks (via the ZeroMQ communication library).

@inproceedings{ff:infiniband:pdp:14,
abstract = {InfiniBand networks are commonly used in the high performance computing area. They offer RDMA-based opera- tions that help to improve the performance of communication subsystems. In this paper, we propose a minimal message-passing communication layer providing the programmer with a point-to- point communication channel implemented by way of InfiniBand RDMA features. Differently from other libraries exploiting the InfiniBand features, such as the well-known Message Passing Interface (MPI), the proposed library is a communication layer only rather than a programming model, and can be easily used as building block for high-level parallel programming frameworks. Evaluated on micro-benchmarks, the proposed RDMA-based communication channel implementation achieves a comparable performance with highly optimised MPI/InfiniBand implemen- tations. Eventually, the flexibility of the communication layer is evaluated by integrating it within the FastFlow parallel frame- work, currently supporting TCP/IP networks (via the ZeroMQ communication library).},
address = {Torino, Italy},
author = {Alessandro Secco and Irfan Uddin and Guilherme {Peretti Pezzi} and Massimo Torquati},
booktitle = {Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing},
date-added = {2013-12-07 18:22:35 +0000},
date-modified = {2015-09-27 12:35:04 +0000},
doi = {10.1109/PDP.2014.23},
editor = {Marco Aldinucci and Daniele D'Agostino and Peter Kilpatrick},
keywords = {fastflow, paraphrase, impact},
publisher = {IEEE},
title = {Message passing on InfiniBand {RDMA} for parallel run-time supports},
url = {http://calvados.di.unipi.it/storage/paper_files/2014_ff_infiniband_pdp.pdf},
year = {2014},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2014_ff_infiniband_pdp.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2014.23}
}

• M. Aldinucci, F. Tordini, M. Drocco, M. Torquati, and M. Coppo, “Parallel stochastic simulators in system biology: the evolution of the species,” in Proc. of Intl. Euromicro PDP 2013: Parallel Distributed and network-based Processing, Belfast, Nothern Ireland, U.K., 2013. doi:10.1109/PDP.2013.66
[BibTeX] [Abstract] [Download PDF]

The stochastic simulation of biological systems is an increasingly popular technique in Bioinformatics. It is often an enlightening technique, especially for multi-stable systems which dynamics can be hardly captured with ordinary differential equations. To be effective, stochastic simulations should be supported by powerful statistical analysis tools. The simulation-analysis workflow may however result in being computationally expensive, thus compromising the interactivity required in model tuning. In this work we advocate the high-level design of simulators for stochastic systems as a vehicle for building efficient and portable parallel simulators. In particular, the Calculus of Wrapped Components (CWC) simulator, which is designed according to the FastFlow’s pattern-based approach, is presented and discussed in this work. FastFlow has been extended to support also clusters of multi-cores with minimal coding effort, assessing the portability of the approach.

@inproceedings{ff_cwc_distr:pdp:13,
abstract = {The stochastic simulation of biological systems is an increasingly popular technique in Bioinformatics. It is often an enlightening technique, especially for multi-stable systems which dynamics can be hardly captured with ordinary differential equations. To be effective, stochastic simulations should be supported by powerful statistical analysis tools. The simulation-analysis workflow may however result in being computationally expensive, thus compromising the interactivity required in model tuning. In this work we advocate the high-level design of simulators for stochastic systems as a vehicle for building efficient and portable parallel simulators. In particular, the Calculus of Wrapped Components (CWC) simulator, which is designed according to the FastFlow's pattern-based approach, is presented and discussed in this work. FastFlow has been extended to support also clusters of multi-cores with minimal coding effort, assessing the portability of the approach.},
address = {Belfast, Nothern Ireland, U.K.},
author = {Marco Aldinucci and Fabio Tordini and Maurizio Drocco and Massimo Torquati and Mario Coppo},
booktitle = {Proc. of Intl. Euromicro PDP 2013: Parallel Distributed and network-based Processing},
date-added = {2012-01-20 19:22:15 +0100},
date-modified = {2013-11-24 00:30:43 +0000},
doi = {10.1109/PDP.2013.66},
keywords = {fastflow, bioinformatics},
month = feb,
publisher = {IEEE},
title = {Parallel stochastic simulators in system biology: the evolution of the species},
url = {http://calvados.di.unipi.it/storage/paper_files/2013_cwc_d_PDP.pdf},
year = {2013},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2013_cwc_d_PDP.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2013.66}
}

• D. Buono, M. Danelutto, S. Lametti, and M. Torquati, “Parallel Patterns for General Purpose Many-Core,” in Proc. of Intl. Euromicro PDP 2013: Parallel Distributed and network-based Processing, Belfast, Nothern Ireland, U.K., 2013. doi:10.1109/PDP.2013.27
[BibTeX] [Abstract]

Efficient programming of general purpose many-core accelerators poses several challenging problems. The high number of cores available, the peculiarity of the interconnection network, and the complex memory hierarchy organization, all contribute to make efficient programming of such devices difficult. We propose to use parallel design patterns, implemented using algorithmic skeletons, to abstract and hide most of the difficulties related to the efficient programming of many-core accelerators. In particular, we discuss the porting of the FastFlow framework on the Tilera TilePro64 architecture and the results obtained running synthetic benchmarks as well as true application kernels. These results demonstrate the efficiency achieved while using patterns on the TilePro64 both to program stand-alone skeleton-based parallel applications and to accelerate existing sequential code.

@inproceedings{ff_tilera:pdp:13,
abstract = {Efficient programming of general purpose many-core accelerators poses several challenging problems. The high number of cores available, the peculiarity of the interconnection network, and the complex memory hierarchy organization, all contribute to make efficient programming of such devices difficult. We propose to use parallel design patterns, implemented using algorithmic skeletons, to abstract and hide most of the difficulties related to the efficient programming of many-core accelerators. In particular, we discuss the porting of the FastFlow framework on the Tilera TilePro64 architecture and the results obtained running synthetic benchmarks as well as true application kernels. These results demonstrate the efficiency achieved while using patterns on the TilePro64 both to program stand-alone skeleton-based parallel applications and to accelerate existing sequential code.},
address = {Belfast, Nothern Ireland, U.K.},
author = {Daniele Buono and Marco Danelutto and Silvia Lametti and Massimo Torquati},
booktitle = {Proc. of Intl. Euromicro PDP 2013: Parallel Distributed and network-based Processing},
date-modified = {2013-11-24 00:31:22 +0000},
doi = {10.1109/PDP.2013.27},
keywords = {fastflow},
month = feb,
publisher = {IEEE},
title = {Parallel Patterns for General Purpose Many-Core},
year = {2013},
bdsk-url-1 = {http://dx.doi.org/10.1109/PDP.2013.27}
}

• M. Danelutto and M. Torquati, “A RISC building block set for structured parallel programming,” in Proc. of Intl. Euromicro PDP 2013: Parallel Distributed and network-based Processing, Belfast, Nothern Ireland, U.K., 2013. doi:10.1109/PDP.2013.17
[BibTeX] [Abstract]

We propose a set of building blocks (RISC-pb2l) suitable to build high-level structured parallel programming frameworks. The set is designed following a RISC approach. RISC-pb2l is architecture independent but the implementation of the different blocks may be specialized to make the best usage of the target architecture peculiarities. A number of optimizations may be designed transforming basic building blocks compositions into more efficient compositions, such that parallel application efficiency may be derived by construction rather than by debugging.

@inproceedings{RISCbb:pdp:13,
abstract = {We propose a set of building blocks (RISC-pb2l) suitable to build high-level structured parallel programming frameworks. The set is designed following a RISC approach. RISC-pb2l is architecture independent but the implementation of the different blocks may be specialized to make the best usage of the target architecture peculiarities. A number of optimizations may be designed transforming basic building blocks compositions into more efficient compositions, such that parallel application efficiency may be derived by construction rather than by debugging.},
address = {Belfast, Nothern Ireland, U.K.},
author = {Marco Danelutto and Massimo Torquati},
booktitle = {Proc. of Intl. Euromicro PDP 2013: Parallel Distributed and network-based Processing},
date-modified = {2015-09-27 12:46:20 +0000},
doi = {10.1109/PDP.2013.17},
keywords = {fastflow},
month = feb,
publisher = {IEEE},
title = {A RISC building block set for structured parallel programming},
year = {2013},
bdsk-url-1 = {http://dx.doi.org/10.1109/PDP.2013.17}
}

• M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, and M. Torquati, “Targeting Distributed Systems in FastFlow,” in Euro-Par 2012 Workshops, Proc. of the CoreGrid Workshop on Grids, Clouds and P2P Computing, 2013, pp. 47-56. doi:10.1007/978-3-642-36949-0_7
[BibTeX] [Abstract] [Download PDF]

FastFlow is a structured parallel programming framework targeting shared memory multi-core architectures. In this paper we introduce a FastFlow extension aimed at supporting a network of multi-core workstation as well. The extension supports the execution of FastFlow programs by coordinating — in a structured way — the fine grain parallel activities running on a single workstation. We discuss the design and the implementation of this extension presenting preliminary experimental results validating it on state-of-the-art networked multi-core nodes.

@inproceedings{ff:distr:cgs:12,
abstract = {FastFlow is a structured parallel programming framework targeting shared memory multi-core architectures. In this paper we introduce a FastFlow extension aimed at supporting a network of multi-core workstation as well. The extension supports the execution of FastFlow programs by coordinating -- in a structured way -- the fine grain parallel activities running on a single workstation. We discuss the design and the implementation of this extension presenting preliminary experimental results validating it on state-of-the-art networked multi-core nodes.},
author = {Marco Aldinucci and Sonia Campa and Marco Danelutto and Peter Kilpatrick and Massimo Torquati},
booktitle = {Euro-Par 2012 Workshops, Proc. of the CoreGrid Workshop on Grids, Clouds and P2P Computing},
date-added = {2012-07-23 21:22:03 +0000},
date-modified = {2015-09-27 12:47:54 +0000},
doi = {10.1007/978-3-642-36949-0_7},
keywords = {fastflow, paraphrase},
pages = {47-56},
publisher = {Springer},
series = {LNCS},
title = {Targeting Distributed Systems in FastFlow},
url = {http://calvados.di.unipi.it/storage/paper_files/2012_distr_ff_cgsymph.pdf},
volume = {7640},
year = {2013},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2012_distr_ff_cgsymph.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-642-36949-0_7}
}

• M. Aldinucci, S. Campa, P. Kilpatrick, and M. Torquati, “Structured Data Access Annotations for Massively Parallel Computations,” in Euro-Par 2012 Workshops, Proc. of the ParaPhrase Workshop on Parallel Processing, 2013, pp. 381-390. doi:10.1007/978-3-642-36949-0_42
[BibTeX] [Abstract] [Download PDF]

We describe an approach aimed at addressing the issue of joint exploitation of control (stream) and data parallelism in a skele-ton based parallel programming environment, based on annotations and refactoring. Annotations drive efficient implementation of a parallel com-putation. Refactoring is used to transform the associated skeleton tree into a more efficient, functionally equivalent skeleton tree. In most cases,cost models are used to drive the refactoring process. We show howsample use case applications/kernels may be optimized and discuss pre-liminary experiments with FastFlow assessing the theoretical results.

@inproceedings{annotation:para:12,
abstract = {We describe an approach aimed at addressing the issue of joint exploitation of control (stream) and data parallelism in a skele-ton based parallel programming environment, based on annotations and refactoring. Annotations drive efficient implementation of a parallel com-putation. Refactoring is used to transform the associated skeleton tree into a more efficient, functionally equivalent skeleton tree. In most cases,cost models are used to drive the refactoring process. We show howsample use case applications/kernels may be optimized and discuss pre-liminary experiments with FastFlow assessing the theoretical results.},
author = {Marco Aldinucci and Sonia Campa and Peter Kilpatrick and Massimo Torquati},
booktitle = {Euro-Par 2012 Workshops, Proc. of the ParaPhrase Workshop on Parallel Processing},
date-added = {2012-07-23 21:22:03 +0000},
date-modified = {2015-09-27 12:49:52 +0000},
doi = {10.1007/978-3-642-36949-0_42},
keywords = {fastflow, paraphrase},
pages = {381-390},
publisher = {Springer},
series = {LNCS},
title = {Structured Data Access Annotations for Massively Parallel Computations},
url = {http://calvados.di.unipi.it/storage/paper_files/2013_annot_europar_workshops.pdf},
volume = {7640},
year = {2013},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2013_annot_europar_workshops.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-642-36949-0_42}
}

• C. Misale, M. Aldinucci, and M. Torquati, “Memory affinity in multi-threading: the Bowtie2 case study,” in Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) — Poster Abstracts, Fiuggi, Italy, 2013.
[BibTeX] [Abstract] [Download PDF]

The diffusion of the Next Generation Sequencing (NGS) has increased the amount of data obtainable by genomic experiments. From a DNA sample a NGS run is able to produce millions of short sequences (called reads), which should be mapped into a reference genome. In this paper, we analyse the performance of Bowtie2, a fast and popular DNA mapping tool. Bowtie2 exhibits a multithreading implementation on top of pthreads, spin-locks and SSE2 SIMD extension. From parallel computing viewpoint, is a paradigmatic example of a software requiring to address three fundamental problems in shared-memory programming for cache-coherent multi-core platforms: synchronisation efficiency at very fine grain (due to short reads), load-balancing (due to long reads), and efficient usage of memory subsystem (due to SSE2 memory pressure). We compare the original implementation against an alternative implementation on top of the FastFlow pattern-based programming framework. The proposed design exploits the high-level farm pattern of FastFlow, which is implemented top of nonblocking multi-threading and lock-less (CAS-free) queues, and provides the programmer with high-level mechanism to tune task scheduling to achieve both load-balancing and memory affinity. The proposed design, despite the high-level design, is always faster and more scalable with respect to the original one. The design of both original and alternative version will be presented along with their experimental evaluation on real-world data sets.

@inproceedings{ff:acaces:13,
abstract = {The diffusion of the Next Generation Sequencing (NGS) has increased
the amount of data obtainable by genomic experiments. From a DNA sample a NGS run is able to produce millions of short sequences (called reads), which should be mapped into a reference genome. In this paper, we analyse the performance of Bowtie2, a fast and popular DNA mapping tool. Bowtie2 exhibits a multithreading implementation on top of pthreads, spin-locks and SSE2 SIMD extension.
From parallel computing viewpoint, is a paradigmatic example of a software requiring to address three
fundamental problems in shared-memory programming for cache-coherent multi-core platforms: synchronisation efficiency at very fine grain (due to short reads), load-balancing (due to long reads), and efficient usage of memory subsystem (due to SSE2 memory pressure).
We compare the original implementation against an alternative implementation on top of the
FastFlow pattern-based programming framework. The proposed design exploits the high-level farm pattern of FastFlow, which is implemented top of nonblocking multi-threading and lock-less (CAS-free) queues, and provides the programmer with high-level mechanism to tune task scheduling to achieve both load-balancing and memory affinity. The proposed design, despite the high-level  design, is always faster and more scalable with respect to the original one.
The design of both original and alternative version will be presented along with their experimental evaluation on real-world data sets.},
address = {Fiuggi, Italy},
author = {Claudia Misale and Marco Aldinucci and Massimo Torquati},
booktitle = {Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) -- Poster Abstracts},
date-added = {2015-03-21 15:12:59 +0000},
date-modified = {2015-03-21 15:12:59 +0000},
isbn = {9789038221908},
keywords = {fastflow},
publisher = {HiPEAC},
title = {Memory affinity in multi-threading: the Bowtie2 case study},
url = {http://calvados.di.unipi.it/storage/paper_files/2013_ACACES_ex-abstract.pdf},
year = {2013},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2013_ACACES_ex-abstract.pdf}
}

• M. Aldinucci, C. Spampinato, M. Drocco, M. Torquati, and S. Palazzo, “A Parallel Edge Preserving Algorithm for Salt and Pepper Image Denoising,” in Proc. of 2nd Intl. Conference on Image Processing Theory Tools and Applications (IPTA), Istambul, Turkey, 2012, pp. 97-102. doi:10.1109/IPTA.2012.6469567
[BibTeX] [Abstract] [Download PDF]

In this paper a two-phase filter for removing “salt and pepper” noise is proposed. In the first phase, an adaptive median filter is used to identify the set of the noisy pixels; in the second phase, these pixels are restored according to a regularization method, which contains a data-fidelity term reflecting the impulse noise characteristics. The algorithm, which exhibits good performance both in denoising and in restoration, can be easily and effectively parallelized to exploit the full power of multi-core CPUs and GPGPUs; the proposed implementation based on the FastFlow library achieves both close-to-ideal speedup and very good wall-clock execution figures.

@inproceedings{denoiser:ff:ipta:12,
abstract = {In this paper a two-phase filter for removing salt and pepper'' noise is proposed. In the first phase, an adaptive median filter is used to identify the set of the noisy pixels; in the second phase, these pixels are restored according to a regularization method, which contains a data-fidelity term reflecting the impulse noise characteristics. The algorithm, which exhibits good performance both in denoising and in restoration, can be easily and effectively parallelized to exploit the full power of multi-core CPUs and GPGPUs; the proposed implementation based on the FastFlow library achieves both close-to-ideal speedup and very good wall-clock execution figures.},
address = {Istambul, Turkey},
author = {Marco Aldinucci and Concetto Spampinato and Maurizio Drocco and Massimo Torquati and Simone Palazzo},
booktitle = {Proc. of 2nd Intl. Conference on Image Processing Theory Tools and Applications (IPTA)},
date-added = {2012-06-04 18:38:01 +0200},
date-modified = {2015-09-27 12:53:53 +0000},
doi = {10.1109/IPTA.2012.6469567},
editor = {K. Djemal and M. Deriche and W. Puech and Osman N. Ucan},
isbn = {978-1-4673-2582-0},
keywords = {fastflow, impact},
month = oct,
pages = {97-102},
publisher = {IEEE},
title = {A Parallel Edge Preserving Algorithm for Salt and Pepper Image Denoising},
url = {http://calvados.di.unipi.it/storage/paper_files/2012_2phasedenoiser_ff_ipta.pdf},
year = {2012},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2012_2phasedenoiser_ff_ipta.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/IPTA.2012.6469567}
}

• M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati, “An Efficient Unbounded Lock-Free Queue for Multi-core Systems,” in Proc. of 18th Intl. Euro-Par 2012 Parallel Processing, Rhodes Island, Greece, 2012, pp. 662-673. doi:10.1007/978-3-642-32820-6_65
[BibTeX] [Abstract] [Download PDF]

The use of efficient synchronization mechanisms is crucial for implementing fine grained parallel programs on modern shared cache multi-core architectures. In this paper we study this problem by considering Single-Producer/Single-Consumer (SPSC) coordination using unbounded queues. A novel unbounded SPSC algorithm capable of reducing the row synchronization latency and speeding up Producer-Consumer coordination is presented. The algorithm has been extensively tested on a shared-cache multi-core platform and a sketch proof of correctness is presented. The queues proposed have been used as basic building blocks to implement the FastFlow parallel framework, which has been demonstrated to offer very good performance for fine-grain parallel applications.

@inproceedings{ff:spsc:europar:12,
abstract = {The use of efficient synchronization mechanisms is crucial for implementing fine grained parallel programs on modern shared cache multi-core architectures. In this paper we study this problem by considering Single-Producer/Single-Consumer (SPSC) coordination using unbounded queues. A novel unbounded SPSC algorithm capable of reducing the row synchronization latency and speeding up Producer-Consumer coordination is presented. The algorithm has been extensively tested on a shared-cache multi-core platform and a sketch proof of correctness is presented. The queues proposed have been used as basic building blocks to implement the FastFlow parallel framework, which has been demonstrated to offer very good performance for fine-grain parallel applications.},
address = {Rhodes Island, Greece},
author = {Marco Aldinucci and Marco Danelutto and Peter Kilpatrick and Massimiliano Meneghin and Massimo Torquati},
booktitle = {Proc. of 18th Intl. Euro-Par 2012 Parallel Processing},
date-added = {2011-04-19 10:22:00 +0200},
date-modified = {2015-09-27 12:55:20 +0000},
doi = {10.1007/978-3-642-32820-6_65},
keywords = {fastflow, paraphrase},
month = aug,
pages = {662-673},
publisher = {Springer},
series = {LNCS},
title = {An Efficient Unbounded Lock-Free Queue for Multi-core Systems},
url = {http://calvados.di.unipi.it/storage/paper_files/2012_spsc_europar.pdf},
volume = {7484},
year = {2012},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2012_spsc_europar.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-642-32820-6_65}
}

• M. Aldinucci, M. Danelutto, L. Anardu, M. Torquati, and P. Kilpatrick, “Parallel patterns + Macro Data Flow for multi-core programming,” in Proc. of Intl. Euromicro PDP 2012: Parallel Distributed and network-based Processing, Garching, Germany, 2012, pp. 27-36. doi:10.1109/PDP.2012.44
[BibTeX] [Abstract] [Download PDF]

Data flow techniques have been around since the early ’70s when they were used in compilers for sequential languages. Shortly after their introduction they were also considered as a possible model for parallel computing, although the impact here was limited. Recently, however, data flow has been identified as a candidate for efficient implementation of various programming models on multi-core architectures. In most cases, however, the burden of determining data flow “macro” instructions is left to the programmer, while the compiler/run time system manages only the efficient scheduling of these instructions. We discuss a structured parallel programming approach supporting automatic compilation of programs to macro data flow and we show experimental results demonstrating the feasibility of the approach and the efficiency of the resulting “object” code on different classes of state-of-the-art multi-core architectures. The experimental results use different base mechanisms to implement the macro data flow run time support, from plain pthreads with condition variables to more modern and effective lock- and fence-free parallel frameworks. Experimental results comparing efficiency of the proposed approach with those achieved using other, more classical, parallel frameworks are also presented.

@inproceedings{dataflow:pdp:12,
abstract = {Data flow techniques have been around since the early '70s when they were used in compilers for sequential languages. Shortly after their introduction they were also considered as a possible model for parallel computing, although the impact here was limited. Recently, however, data flow has been identified as a candidate for efficient implementation of various programming models on multi-core architectures. In most cases, however, the burden of determining data flow macro'' instructions is left to the programmer, while the compiler/run time system manages only the efficient scheduling of these instructions. We discuss a structured parallel programming approach supporting automatic compilation of programs to macro data flow and we show experimental results demonstrating the feasibility of the approach and the efficiency of the resulting object'' code on different classes of state-of-the-art multi-core architectures. The experimental results use different base mechanisms to implement the
macro data flow run time support, from plain pthreads with condition variables to more modern and effective lock- and fence-free parallel frameworks. Experimental results comparing efficiency of the proposed approach with those achieved using other, more classical, parallel frameworks are also presented.},
address = {Garching, Germany},
author = {Marco Aldinucci and Marco Danelutto and Lorenzo Anardu and Massimo Torquati and Peter Kilpatrick},
booktitle = {Proc. of Intl. Euromicro PDP 2012: Parallel Distributed and network-based Processing},
date-added = {2012-10-24 17:29:14 +0000},
date-modified = {2013-11-24 00:35:34 +0000},
doi = {10.1109/PDP.2012.44},
keywords = {fastflow},
month = feb,
pages = {27-36},
publisher = {IEEE},
title = {Parallel patterns + Macro Data Flow for multi-core programming},
url = {http://calvados.di.unipi.it/storage/paper_files/2012_mdf_PDP.pdf},
year = {2012},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2012_mdf_PDP.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2012.44}
}

• F. Tordini, M. Aldinucci, and M. Torquati, “High-level lock-less programming for multicore,” in Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) — Poster Abstracts, Fiuggi, Italy, 2012.
[BibTeX] [Abstract] [Download PDF]

Modern computers are built upon multi-core architectures. Achieving peak performance on these architectures is hard and may require a substantial programming effort. The synchronisation of many processes racing to access a common resource (the shared memory) has been a fundamental problem on parallel computing for years, and many solutions have been proposed to address this issue. Non-blocking synchronisation and transactional primitives have been envisioned as a way to reduce memory wall problem. Despite sometimes effective (and exhibiting a great momentum in the research community), they are only one facet of the problem, as their exploitation still requires non-trivial programming skills. With non-blocking philosophy in mind, we propose high-level programming patterns that will relieve the programmer from worrying about low-level details such as synchronisation of racing processes as well as those fine tunings needed to improve the overall performance, like proper (distributed) dynamic memory allocation and effective exploitation of the memory hierarchy.

@inproceedings{ff:acaces:12,
abstract = {Modern computers are built upon multi-core architectures. Achieving peak performance on these architectures is hard and may require a substantial programming effort. The synchronisation of many processes racing to access a common resource (the shared memory) has been a fundamental problem on parallel computing for years, and many solutions have been proposed to address this issue. Non-blocking synchronisation and transactional primitives have been envisioned as a way to reduce memory wall problem. Despite sometimes effective (and exhibiting a great momentum in the research community), they are only one facet of the problem, as their exploitation still requires non-trivial programming skills.
With non-blocking philosophy in mind, we propose high-level programming patterns that will relieve the programmer from worrying about low-level details such as synchronisation of racing processes as well as those fine tunings needed to improve the overall performance, like proper (distributed) dynamic memory allocation and effective exploitation of the memory hierarchy.},
address = {Fiuggi, Italy},
author = {Fabio Tordini and Marco Aldinucci and Massimo Torquati},
booktitle = {Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) -- Poster Abstracts},
date-added = {2012-07-17 17:58:06 +0200},
date-modified = {2013-11-24 00:36:10 +0000},
isbn = {9789038219875},
keywords = {fastflow},
publisher = {HiPEAC},
title = {High-level lock-less programming for multicore},
url = {http://calvados.di.unipi.it/storage/paper_files/2012_ACACES_ex-abstract.pdf},
year = {2012},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2012_ACACES_ex-abstract.pdf}
}

• M. Aldinucci, M. Coppo, F. Damiani, M. Drocco, E. Sciacca, S. Spinella, M. Torquati, and A. Troina, “On Parallelizing On-Line Statistics for Stochastic Biological Simulations,” in Euro-Par 2011 Workshops, Proc. of the 2st Workshop on High Performance Bioinformatics and Biomedicine (HiBB), Bordeaux, France, 2012, pp. 3-12. doi:10.1007/978-3-642-29740-3_2
[BibTeX] [Abstract] [Download PDF]

This work concerns a general technique to enrich parallel version of stochastic simulators for biological systems with tools for on-line statistical analysis of the results. In particular, within the FastFlow parallel programming framework, we describe the methodology and the implementation of a parallel Monte Carlo simulation infrastructure extended with user-defined on-line data filtering and mining functions. The simulator and the on-line analysis were validated on large multi-core platforms and representative proof-of-concept biological systems.

@inproceedings{cwcsim:onlinestats:ff:hibb:11,
abstract = {This work concerns a general technique to enrich parallel version of stochastic simulators for biological systems with tools for on-line statistical analysis of the results. In particular, within the FastFlow parallel programming framework, we describe the methodology and the implementation of a parallel Monte Carlo simulation infrastructure extended with user-defined on-line data filtering and mining functions. The simulator and the on-line analysis were validated on large multi-core platforms and representative proof-of-concept biological systems.},
address = {Bordeaux, France},
author = {Marco Aldinucci and Mario Coppo and Ferruccio Damiani and Maurizio Drocco and Eva Sciacca and Salvatore Spinella and Massimo Torquati and Angelo Troina},
booktitle = {Euro-Par 2011 Workshops, Proc. of the 2st Workshop on High Performance Bioinformatics and Biomedicine (HiBB)},
date-added = {2010-08-15 00:50:09 +0200},
date-modified = {2013-11-24 00:35:51 +0000},
doi = {10.1007/978-3-642-29740-3_2},
editor = {Michael Alexander and Pasqua D'Ambra and Adam Belloum and George Bosilca and Mario Cannataro and Marco Danelutto and Beniamino Di Martino and Michael Gerndt and Emmanuel Jeannot and Raymond Namyst and Jean Roman and Stephen L. Scott and Jesper Larsson Tr{\"a}ff and Geoffroy Vall{\'e}e and Josef Weidendorfer},
keywords = {bioinformatics, fastflow},
pages = {3-12},
publisher = {Springer},
series = {LNCS},
title = {On Parallelizing On-Line Statistics for Stochastic Biological Simulations},
url = {http://calvados.di.unipi.it/storage/paper_files/2012_onlinestat_HiBB2011.pdf},
volume = {7156},
year = {2012},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2012_onlinestat_HiBB2011.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-642-29740-3_2}
}

• M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati, “Accelerating code on multi-cores with FastFlow,” in Proc. of 17th Intl. Euro-Par 2011 Parallel Processing, Bordeaux, France, 2011, pp. 170-181. doi:10.1007/978-3-642-23397-5_17
[BibTeX] [Abstract] [Download PDF]

FastFlow is a programming framework specifically targeting cache-coherent shared-memory multicores. It is implemented as a stack of C++ template libraries built on top of lock-free (and memory fence free) synchronization mechanisms. Its philosophy is to combine programmability with performance. In this paper a new FastFlow programming methodology aimed at supporting parallelization of existing sequential code via offloading onto a dynamically created software accelerator is presented. The new methodology has been validated using a set of simple micro-benchmarks and some real applications.

@inproceedings{ff:acc:europar:11,
abstract = {FastFlow is a programming framework specifically targeting cache-coherent shared-memory multicores. It is implemented as a stack of C++ template libraries built on top of lock-free (and memory fence free) synchronization mechanisms. Its philosophy is to combine programmability with performance. In this paper a new FastFlow programming methodology aimed at supporting parallelization of existing sequential code via offloading onto a dynamically created software accelerator is presented. The new methodology has been validated using a set of simple micro-benchmarks and some real applications.},
address = {Bordeaux, France},
author = {Marco Aldinucci and Marco Danelutto and Peter Kilpatrick and Massimiliano Meneghin and Massimo Torquati},
booktitle = {Proc. of 17th Intl. Euro-Par 2011 Parallel Processing},
date-added = {2012-06-04 18:35:57 +0200},
date-modified = {2013-12-12 00:46:59 +0000},
doi = {10.1007/978-3-642-23397-5_17},
editor = {E. Jeannot and R. Namyst and J. Roman},
keywords = {fastflow},
month = aug,
pages = {170-181},
publisher = {Springer},
series = {LNCS},
title = {Accelerating code on multi-cores with FastFlow},
url = {http://calvados.di.unipi.it/storage/paper_files/2011_fastflow_acc_europar.pdf},
volume = {6853},
year = {2011},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2011_fastflow_acc_europar.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-642-23397-5_17}
}

• M. Aldinucci, M. Coppo, F. Damiani, M. Drocco, M. Torquati, and A. Troina, “On Designing Multicore-Aware Simulators for Biological Systems,” in Proc. of Intl. Euromicro PDP 2011: Parallel Distributed and network-based Processing, Ayia Napa, Cyprus, 2011, pp. 318-325. doi:10.1109/PDP.2011.81
[BibTeX] [Abstract] [Download PDF]

The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It often is an enlightening technique, which may however result in being computational expensive. We discuss the main opportunities to speed it up on multi-core platforms, which pose new challenges for parallelisation techniques. These opportunities are developed in two general families of solutions involving both the single simulation and a bulk of independent simulations (either replicas of derived from parameter sweep). Proposed solutions are tested on the parallelisation of the CWC simulator (Calculus of Wrapped Compartments) that is carried out according to proposed solutions by way of the FastFlow programming framework making possible fast development and efficient execution on multi-cores.

@inproceedings{ff:cwc:pdp:11,
abstract = {The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It often is an enlightening technique, which may however result in being computational expensive. We discuss the main opportunities to speed it up on multi-core platforms, which pose new challenges for parallelisation techniques. These opportunities are developed in two general families of solutions involving both the single simulation and a bulk of independent simulations (either replicas of derived from parameter sweep). Proposed solutions are tested on the parallelisation of the CWC simulator (Calculus of Wrapped Compartments) that is carried out according to proposed solutions by way of the FastFlow programming framework making possible fast development and efficient execution on multi-cores.},
address = {Ayia Napa, Cyprus},
author = {Marco Aldinucci and Mario Coppo and Ferruccio Damiani and Maurizio Drocco and Massimo Torquati and Angelo Troina},
booktitle = {Proc. of Intl. Euromicro PDP 2011: Parallel Distributed and network-based Processing},
date-added = {2012-02-25 01:21:25 +0000},
date-modified = {2013-11-24 00:37:16 +0000},
doi = {10.1109/PDP.2011.81},
editor = {Yiannis Cotronis and Marco Danelutto and George Angelos Papadopoulos},
keywords = {fastflow},
month = feb,
pages = {318-325},
publisher = {IEEE},
title = {On Designing Multicore-Aware Simulators for Biological Systems},
url = {http://calvados.di.unipi.it/storage/paper_files/2011_ff_cwc_sim_PDP.pdf},
year = {2011},
bdsk-url-1 = {http://arxiv.org/pdf/1010.2438v2},
bdsk-url-2 = {http://calvados.di.unipi.it/storage/paper_files/2011_ff_cwc_sim_PDP.pdf},
bdsk-url-3 = {http://dx.doi.org/10.1109/PDP.2011.81}
}

• M. Aldinucci, S. Ruggieri, and M. Torquati, “Porting Decision Tree Algorithms to Multicore using FastFlow,” in Proc. of European Conference in Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Barcelona, Spain, 2010, pp. 7-23. doi:10.1007/978-3-642-15880-3_7
[BibTeX] [Abstract] [Download PDF]

The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.

@inproceedings{fastflow_c45:emclpkdd,
abstract = {The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.},
address = {Barcelona, Spain},
author = {Marco Aldinucci and Salvatore Ruggieri and Massimo Torquati},
booktitle = {Proc. of European Conference in Machine Learning and Knowledge Discovery in Databases (ECML PKDD)},
date-added = {2010-06-15 21:03:56 +0200},
date-modified = {2013-11-24 00:38:07 +0000},
doi = {10.1007/978-3-642-15880-3_7},
editor = {Jos{\'e} L. Balc{\'a}zar and Francesco Bonchi and Aristides Gionis and Mich{\e}le Sebag},
keywords = {fastflow},
month = sep,
pages = {7-23},
publisher = {Springer},
series = {LNCS},
title = {Porting Decision Tree Algorithms to Multicore using {FastFlow}},
url = {http://calvados.di.unipi.it/storage/paper_files/2010_c45FF_ECMLPKDD.pdf},
volume = {6321},
year = {2010},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2010_c45FF_ECMLPKDD.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-642-15880-3_7}
}

• M. Aldinucci, M. Meneghin, and M. Torquati, “Efficient Smith-Waterman on multi-core with FastFlow,” in Proc. of Intl. Euromicro PDP 2010: Parallel Distributed and network-based Processing, Pisa, Italy, 2010, pp. 195-199. doi:10.1109/PDP.2010.93
[BibTeX] [Abstract] [Download PDF]

Shared memory multiprocessors have returned to popularity thanks to rapid spreading of commodity multi-core architectures. However, little attention has been paid to supporting effective streaming applications on these architectures. In this paper we describe FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than them on a given real world application: the speedup of FastFlow over other solutions may be substantial for fine grain tasks, for example +35% over OpenMP, +226% over Cilk, +96% over TBB for the alignment of protein P01111 against UniProt DB using the Smith-Waterman algorithm.

@inproceedings{fastflow:pdp:10,
abstract = {Shared memory multiprocessors have returned to popularity thanks to rapid spreading of commodity multi-core architectures. However, little attention has been paid to supporting effective streaming applications on these architectures. In this paper we describe FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than them on a given real world application: the speedup of FastFlow over other solutions may be substantial for fine grain tasks, for example +35% over OpenMP, +226% over Cilk, +96% over TBB for the alignment of protein P01111 against UniProt DB using the Smith-Waterman algorithm.},
address = {Pisa, Italy},
author = {Marco Aldinucci and Massimiliano Meneghin and Massimo Torquati},
booktitle = {Proc. of Intl. Euromicro PDP 2010: Parallel Distributed and network-based Processing},
date-added = {2007-10-26 01:02:32 +0200},
date-modified = {2013-11-24 00:38:51 +0000},
doi = {10.1109/PDP.2010.93},
editor = {Marco Danelutto and Tom Gross and Julien Bourgeois},
keywords = {fastflow},
month = feb,
pages = {195-199},
publisher = {IEEE},
title = {Efficient {Smith-Waterman} on multi-core with FastFlow},
url = {http://calvados.di.unipi.it/storage/paper_files/2010_fastflow_SW_PDP.pdf},
year = {2010},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2010_fastflow_SW_PDP.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1109/PDP.2010.93}
}

### Master’s Theses

• P. Inaudi, “Progettazione e sviluppo di un provider libfabric per la rete ad alte prestazioni Ronniee/A3Cube,” Master Thesis, 2015.
[BibTeX]
@mastersthesis{tesi:inaudi:15,
author = {Paolo Inaudi},
keywords = {fastflow},
month = {October},
school = {Computer Science Department, University of Torino},
title = {Progettazione e sviluppo di un provider libfabric per la rete ad alte prestazioni Ronniee/A3Cube},
year = {2015}
}

• P. Viviani, “Parallel Computing Techniques for High Energy Physics,” Master Thesis, 2015.
[BibTeX] [Abstract]

Modern experimental achievements, with LHC results as a prominent but not exclusive representative, have undisclosed a new range of challenges concerning theoretical com- putations. Tree level QED calculation are no more satisfactory due to the very small experimental uncertainty of precision e+ e- measurements, so Next To Leading and Next to Next to Leading Order calculations are required. At the same time many-legs, high-order QCD processes needed to simulate LHC events are raising even more the bar of computational complexity. The drive for the present work has been the interest in calculating high multiplicity Higgs boson processes with a dedicated software library (RECOLA) currently under development at the University of Torino, as well as the related technological challenges. This thesis undertakes the task of exploring the possibilities offered by present and upcoming computing technologies in order to face these challenges properly. The first two chapters outlines the theoretical context and the available technologies. In chapter 3 a a case study is examined in full detail, in order to explore the suitability of different parallel computing solutions. In the chapter 4, some of those solutions are implemented in the context of the RECOLA library, allowing it to handle processes at a previously unexplored scale of complexity. Alongside, the potential of new, cost-effective parallel architectures is tested.

@mastersthesis{tesi:viviani:15,
abstract = { Modern experimental achievements, with LHC results as a prominent but not exclusive representative, have undisclosed a new range of challenges concerning theoretical com- putations. Tree level QED calculation are no more satisfactory due to the very small experimental uncertainty of precision e+ e- measurements, so Next To Leading and Next to Next to Leading Order calculations are required. At the same time many-legs, high-order QCD processes needed to simulate LHC events are raising even more the bar of computational complexity. The drive for the present work has been the interest in calculating high multiplicity Higgs boson processes with a dedicated software library (RECOLA) currently under development at the University of Torino, as well as the related technological challenges.
This thesis undertakes the task of exploring the possibilities offered by present and upcoming computing technologies in order to face these challenges properly. The first two chapters outlines the theoretical context and the available technologies. In chapter 3 a a case study is examined in full detail, in order to explore the suitability of different parallel computing solutions. In the chapter 4, some of those solutions are implemented in the context of the RECOLA library, allowing it to handle processes at a previously unexplored scale of complexity. Alongside, the potential of new, cost-effective parallel architectures is tested.},
author = {Paolo Viviani},
date-added = {2015-09-27 12:36:54 +0000},
date-modified = {2015-09-27 13:28:24 +0000},
keywords = {fastflow,impact},
school = {Physics Department, University of Torino},
title = {Parallel Computing Techniques for High Energy Physics},
year = {2015}
}

• M. Drocco, “Parallel stochastic simulators in systems biology: the evolution of the species,” Master Thesis, 2013.
[BibTeX] [Abstract] [Download PDF]

The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It is often an enlightening technique, especially for multi-stable systems whose dynamics can be hardly captured with ordinary differential equations. To be effective, stochastic simulations should be supported by powerful statistical analysis tools. The simulation/analysis workflow may however result in being computationally expensive, thus compromising the interactivity required especially in model tuning. In this work we discuss the main opportunities to speed up the framework by parallelisation on modern multicore and hybrid multicore and distributed platforms, advocating the high-level design of simulators for stochastic systems as a vehicle for building efficient and portable parallel simulators endowed with on-line statistical analysis. In particular, the Calculus of Wrapped Compartments (CWC) Simulator, which is designed according to the FastFlow’s pattern-based approach, is presented and discussed in this work.

@mastersthesis{tesi:drocco:13,
abstract = {The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It is often an enlightening technique, especially for multi-stable systems whose dynamics can be hardly captured with ordinary differential equations. To be effective, stochastic simulations should be supported by powerful statistical analysis tools. The simulation/analysis workflow may however result in being computationally expensive, thus compromising the interactivity required especially in model tuning. In this work we discuss the main opportunities to speed up the framework by parallelisation on modern multicore and hybrid multicore and distributed platforms, advocating the high-level design of simulators for stochastic systems as a vehicle for building efficient and portable parallel simulators endowed with on-line statistical analysis. In particular, the Calculus of Wrapped Compartments (CWC) Simulator, which is designed according to the FastFlow's pattern-based approach, is presented and discussed in this work.},
author = {Maurizio Drocco},
date-modified = {2013-11-24 00:29:54 +0000},
keywords = {fastflow},
month = jul,
school = {Computer Science Department, University of Torino, Italy},
title = {Parallel stochastic simulators in systems biology: the evolution of the species},
url = {http://calvados.di.unipi.it/storage/paper_files/2013_tesi_drocco.pdf},
year = {2013},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2013_tesi_drocco.pdf}
}

### Dissertations

• M. Drocco, “Parallel Programming with Global Asynchronous Memory: Models, C++ APIs and Implementations,” PhD Thesis, 2017.
[BibTeX] [Abstract]

In the realm of High Performance Computing (HPC), message passing has been the programming paradigm of choice for over twenty years. The durable MPI (Message Passing Interface) standard, with send/receive communication, broadcast, gather/scatter, and reduction collectives is still used to construct parallel programs where each communication is orchestrated by the de\-vel\-oper-based precise knowledge of data distribution and overheads; collective communications simplify the orchestration but might induce excessive synchronization. Early attempts to bring shared-memory programming model—with its programming adv\-antages—to distributed computing, referred as the Distributed Shared Memory (DSM) model, faded away; one of the main issue was to combine performance and programmability with the memory consistency model. The recently proposed Partitioned Global Address Space (PGAS) model is a modern revamp of DSM that exposes data placement to enable optimizations based on locality, but it still addresses (simple) data-parallelism only and it relies on expensive sharing protocols. We advocate an alternative programming model for distributed computing based on a Global Asynchronous Memory (GAM), aiming to \emphavoid coherency and consistency problems rather than solving them. We materialize GAM by designing and implementing a \emphdistributed smart pointers library, inspired by C++ smart pointers. In this model, public and private pointers (resembling C++ shared and unique pointers, respectively) are moved around instead of messages (i.e., data), thus alleviating the user from the burden of minimizing transfers. On top of smart pointers, we propose a high-level C++ template library for writing applications in terms of dataflow-like networks, namely GAM nets, consisting of stateful processors exchanging pointers in fully asynchronous fashion. We demonstrate the validity of the proposed approach, from the expressiveness perspective, by showing how GAM nets can be exploited to implement higher-level parallel programming models, such as data and task parallelism. As for the performance perspective, the execution of two non-toy benchmarks on a number of different small-scale HPC clusters exhibits both close-to-ideal scalability and negligible overhead with respect to state-of-the-art benchmark implementations. For instance, the GAM implementation of a high-quality video restoration filter sustains a 100 fps throughput over 70\%-noisy high-quality video streams on a 4-node cluster of Graphics Processing Units (GPUs), with minimal programming effort.

@phdthesis{17:gam:drocco:thesis,
abstract = {In the realm of High Performance Computing (HPC), message passing
has been the programming paradigm of choice for over twenty years.
The durable MPI (Message Passing Interface) standard, with send/receive
communication,
broadcast, gather/scatter, and reduction collectives is still used to construct
parallel programs where each communication is orchestrated by the
de\-vel\-oper-based precise knowledge of data distribution and overheads;
collective communications simplify the orchestration but might induce excessive
synchronization.
Early attempts to bring shared-memory programming model---with its programming
adv\-antages---to distributed computing, referred as the Distributed Shared
Memory (DSM) model, faded away; one of the main issue was to combine
performance and programmability with the memory consistency model.
The recently proposed Partitioned Global Address Space (PGAS) model is a modern
revamp of DSM that exposes data placement to enable optimizations based on
locality, but it still addresses (simple) data-parallelism only and it relies
on expensive sharing protocols.
We advocate an alternative programming model for distributed computing based on
a Global Asynchronous Memory (GAM), aiming to \emph{avoid} coherency and
consistency problems rather than solving them.
We materialize GAM by designing and implementing a \emph{distributed smart
pointers} library, inspired by C++ smart pointers.
In this model, public and private pointers (resembling C++ shared and unique
pointers, respectively) are moved around instead of messages (i.e., data), thus
alleviating the user from the burden of minimizing transfers.
On top of smart pointers, we propose a high-level C++ template library for
writing applications in terms of dataflow-like networks, namely GAM nets,
consisting of stateful processors exchanging pointers in fully asynchronous
fashion.
We demonstrate the validity of the proposed approach, from the expressiveness
perspective, by showing how GAM nets can be exploited to implement higher-level
parallel programming models, such as data and task parallelism.
As for the performance perspective, the execution of two non-toy benchmarks on
a number of different small-scale HPC clusters exhibits both close-to-ideal
scalability and negligible overhead with respect to state-of-the-art benchmark
implementations.
For instance, the GAM implementation of a high-quality video restoration filter
sustains a 100 fps throughput over 70\%-noisy high-quality video streams on a
4-node cluster of Graphics Processing Units (GPUs), with minimal programming
effort.},
author = {Maurizio Drocco},
keywords = {fastflow, rephrase, toreador, repara, paraphrase},
month = {October},
note = {To appear},
school = {Computer Science Department, University of Torino},
title = {Parallel Programming with Global Asynchronous Memory: Models, {C++} {API}s and Implementations},
year = {2017}
}

• C. Misale, “PiCo: A Domain-Specific Language for Data Analytics Pipelines,” PhD Thesis, 2017. doi:10.5281/zenodo.579753
[BibTeX] [Abstract] [Download PDF]

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world.

@phdthesis{17:pico:misale:thesis,
abstract = {In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models---for which only informal (and often confusing) semantics is generally provided---all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks.

From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics.

The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.

Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world.},
author = {Claudia Misale},
date-added = {2017-06-19 15:15:52 +0000},
date-modified = {2017-06-19 15:55:21 +0000},
doi = {10.5281/zenodo.579753},
keywords = {fastflow, rephrase, toreador, repara, paraphrase},
month = may,
school = {Computer Science Department, University of Torino},
title = {PiCo: A Domain-Specific Language for Data Analytics Pipelines},
url = {https://iris.unito.it/retrieve/handle/2318/1633743/320170/Misale_thesis.pdf},
year = {2017},
bdsk-url-1 = {https://iris.unito.it/retrieve/handle/2318/1633743/320170/Misale_thesis.pdf},
bdsk-url-2 = {http://dx.doi.org/10.5281/zenodo.579753}
}

• F. Tordini, “The road towards a Cloud-based High-Performance solution for genomic data analysis,” PhD Thesis, 2016.
[BibTeX] [Abstract] [Download PDF]

Nowadays, molecular biology laboratories are delivering more and more data about DNA organisation, at increasing resolution and in a large number of samples. So much that genomic research is now facing many of the scale-out issues that high-performance computing has been addressing for years: they require powerful infrastructures with fast computing and storage capabilities, with substantial challenges in terms of data processing, statistical analysis and data representation. With this thesis we propose a high-performance pipeline for the analysis and interpretation of heterogeneous genomic information: beside performance, usability and availability are two essential requirements that novel Bioinformatics tools should satisfy. In this perspective, we propose and discuss our efforts towards a solid infrastructure for data processing and storage, where software that operates over data is exposed as a service, and is accessible by users through the Internet. We begin by presenting NuChart-II, a tool for the analysis and interpretation of spatial genomic information. With NuChart-II we propose a graph-based representation of genomic data, which can provide insights on the disposition of genomic elements in the DNA. We also discuss our approach for the normalisation of biases that affect raw sequenced data. We believe that many currently available tools for genomic data analysis are perceived as tricky and troublesome applications, that require highly specialised skills to obtain the desired outcomes. Concerning usability, we want to rise the level of abstraction perceived by the user, but maintain high performance and correctness while providing an exhaustive solution for data visualisation. We also intend to foster the availability of novel tools: in this work we also discuss a cloud solution that delivers computation and storage as dynamically allocated virtual resources via the Internet, while needed software is provided as a service. In this way, the computational demand of genomic research can be satisfied more economically by using lab-scale and enterprise-oriented technologies. Here we discuss our idea of a task farm for the integration of heterogeneous data resulting from different sequencing experiments: we believe that the integration of multi-omic features on a nuclear map can be a valuable mean for studying the interactions among genetic elements. This can reveal insights on biological mechanisms, such as genes regulation, translocations and epigenetic patterns.

@phdthesis{tordiniThesis16,
abstract = {Nowadays, molecular biology laboratories are delivering more and more data about DNA organisation, at increasing resolution and in a large number of samples. So much that genomic research is now facing many of the scale-out issues that high-performance computing has been addressing for years: they require powerful infrastructures with fast computing and storage capabilities, with substantial challenges in terms of data processing, statistical analysis and data representation.
With this thesis we propose a high-performance pipeline for the analysis and interpretation of heterogeneous genomic information: beside performance, usability and availability are two essential requirements that novel Bioinformatics tools should satisfy. In this perspective, we propose and discuss our efforts towards a solid infrastructure for data processing and storage, where software that operates over data is exposed as a service, and is accessible by users through the Internet.
We begin by presenting NuChart-II, a tool for the analysis and interpretation of spatial genomic information. With NuChart-II we propose a graph-based representation of genomic data, which can provide insights on the disposition of genomic elements in the DNA. We also discuss our approach for the normalisation of biases that affect raw sequenced data.
We believe that many currently available tools for genomic data analysis are perceived as tricky and troublesome applications, that require highly specialised skills to obtain the desired outcomes. Concerning usability, we want to rise the level of abstraction perceived by the user, but maintain high performance and correctness while providing an exhaustive solution for data visualisation.
We also intend to foster the availability of novel tools: in this work we also discuss a cloud solution that delivers computation and storage as dynamically allocated virtual resources via the Internet, while needed software is provided as a service. In this way, the computational demand of genomic research can be satisfied more economically by using lab-scale and enterprise-oriented technologies. Here we discuss our idea of a task farm for the integration of heterogeneous data resulting from different sequencing experiments: we believe that the integration of multi-omic features on a nuclear map can be a valuable mean for studying the interactions among genetic elements. This can reveal insights on biological mechanisms, such as genes regulation, translocations and epigenetic patterns.},
author = {Fabio Tordini},
keywords = {fastflow, bioinformatics},
month = {4},
school = {Computer Science Department, University of Torino, Italy},
title = {{The road towards a Cloud-based High-Performance solution for genomic data analysis}},
url = {http://calvados.di.unipi.it/storage/paper_files/2016_tordini_phdthesis.pdf},
year = {2016},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2016_tordini_phdthesis.pdf}
}

### Technical Reports

• M. Aldinucci, M. Danelutto, and M. Torquati, “FastFlow tutorial,” Università di Pisa, Dipartimento di Informatica, Italy, TR-12-04, 2012.
[BibTeX] [Download PDF]
@techreport{fastflow_tutorial:TR-12-04:12,
author = {Marco Aldinucci and Marco Danelutto and Massimo Torquati},
date-added = {2011-03-17 23:19:05 +0100},
date-modified = {2013-11-24 00:34:55 +0000},
institution = {Universit\a di Pisa, Dipartimento di Informatica, Italy},
keywords = {fastflow},
month = mar,
number = {TR-12-04},
title = {FastFlow tutorial},
url = {http://compass2.di.unipi.it/TR/Files/TR-12-04.pdf.gz},
year = {2012},
bdsk-url-1 = {http://compass2.di.unipi.it/TR/Files/TR-12-04.pdf.gz}
}

• M. Aldinucci, M. Drocco, D. Giordano, C. Spampinato, and M. Torquati, “A Parallel Edge Preserving Algorithm for Salt and Pepper Image Denoising,” Università degli Studi di Torino, Dip. di Informatica, Italy, 138/2011, 2011.
[BibTeX] [Download PDF]
@techreport{ff:denoiser:tr138-2011,
author = {Marco Aldinucci and Maurizio Drocco and Daniela Giordano and Concetto Spampinato and Massimo Torquati},
date-added = {2010-12-08 19:31:00 +0100},
date-modified = {2013-11-24 00:36:56 +0000},
institution = {Universit\a degli Studi di Torino, Dip. di Informatica, Italy},
keywords = {fastflow},
month = may,
number = {138/2011},
title = {A Parallel Edge Preserving Algorithm for Salt and Pepper Image Denoising},
url = {http://calvados.di.unipi.it/storage/paper_files/2012_2phasedenoiser_ff_ipta.pdf},
year = {2011},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/2012_2phasedenoiser_ff_ipta.pdf}
}

• M. Aldinucci, S. Ruggieri, and M. Torquati, “Porting Decision Tree Building and Pruning Algorithms to Multicore using FastFlow,” Università di Pisa, Dipartimento di Informatica, Italy, TR-11-06, 2011.
[BibTeX] [Download PDF]
@techreport{TR-11-06,
author = {Marco Aldinucci and Salvatore Ruggieri and Massimo Torquati},
date-added = {2012-04-15 18:40:07 +0000},
date-modified = {2013-11-24 00:37:04 +0000},
institution = {Universit\a di Pisa, Dipartimento di Informatica, Italy},
keywords = {fastflow},
month = mar,
number = {TR-11-06},
title = {Porting Decision Tree Building and Pruning Algorithms to Multicore using FastFlow},
url = {http://compass2.di.unipi.it/TR/Files/TR-11-06.pdf.gz},
year = {2011},
bdsk-url-1 = {http://compass2.di.unipi.it/TR/Files/TR-11-06.pdf.gz}
}

• M. Torquati, “Single-Producer/Single-Consumer Queues on Shared Cache Multi-Core Systems,” Università di Pisa, Dipartimento di Informatica, Italy, TR-10-20, 2010.
[BibTeX] [Download PDF]
@techreport{ff:ubuffer:pdp:11,
author = {Massimo Torquati},
date-added = {2010-10-25 16:30:17 +0200},
date-modified = {2013-11-24 00:37:32 +0000},
institution = {Universit\a di Pisa, Dipartimento di Informatica, Italy},
keywords = {fastflow},
month = dec,
number = {TR-10-20},
title = {Single-Producer/Single-Consumer Queues on Shared Cache Multi-Core Systems},
url = {http://compass2.di.unipi.it/TR/Files/TR-10-20.pdf.gz},
year = {2010},
bdsk-url-1 = {http://compass2.di.unipi.it/TR/Files/TR-10-20.pdf.gz}
}

• M. Aldinucci, M. Coppo, F. Damiani, M. Drocco, M. Torquati, and A. Troina, “On Designing Multicore-Aware Simulators for Biological Systems,” Università degli Studi di Torino, Dipartimento di Informatica, Italy, 131/2010, 2010.
[BibTeX]
@techreport{ff:cwc:pdp:11-tr,
author = {Marco Aldinucci and Mario Coppo and Ferruccio Damiani and Maurizio Drocco and Massimo Torquati and Angelo Troina},
date-added = {2011-05-19 19:07:36 +0200},
date-modified = {2013-11-24 00:38:00 +0000},
institution = {Universit\a degli Studi di Torino, Dipartimento di Informatica, Italy},
keywords = {fastflow},
month = oct,
number = {131/2010},
title = {On Designing Multicore-Aware Simulators for Biological Systems},
year = {2010}
}

• M. Aldinucci, A. Bracciali, P. LiÒ, A. Sorathiya, and M. Torquati, “StochKit-FF: Efficient Systems Biology on Multicore Architectures,” Università di Pisa, Dipartimento di Informatica, Italy, TR-10-12, 2010. doi:10.1007/978-3-642-21878-1_21
[BibTeX] [Abstract] [Download PDF]

The stochastic modelling of biological systems is an informative, and in some cases, very adequate technique, which may however result in being more expensive than other modelling approaches, such as differential equations. We present StochKit-FF, a parallel version of StochKit, a reference toolkit for stochastic simulations. StochKit-FF is based on the FastFlow programming toolkit for multicores and exploits the novel concept of selective memory. We experiment StochKit-FF on a model of HIV infection dynamics, with the aim of extracting information from efficiently run experiments, here in terms of average and variance and, on a longer term, of more structured data.

@techreport{stochkit-ff:tr-10-12,
abstract = {The stochastic modelling of biological systems is an informative, and in some cases, very adequate technique, which may however result in being more expensive than other modelling approaches, such as differential equations. We present StochKit-FF, a parallel version of StochKit, a reference toolkit for stochastic simulations. StochKit-FF is based on the FastFlow programming toolkit for multicores and exploits the novel concept of selective memory. We experiment StochKit-FF on a model of HIV infection dynamics, with the aim of extracting information from efficiently run experiments, here in terms of average and variance and, on a longer term, of more structured data.},
author = {Marco Aldinucci and Andrea Bracciali and Pietro Li\o and Anil Sorathiya and Massimo Torquati},
date-added = {2010-06-27 16:39:46 +0200},
date-modified = {2013-11-24 00:38:32 +0000},
doi = {10.1007/978-3-642-21878-1_21},
institution = {Universit{\a} di Pisa, Dipartimento di Informatica, Italy},
keywords = {fastflow},
month = jul,
number = {TR-10-12},
title = {{StochKit-FF}: Efficient Systems Biology on Multicore Architectures},
url = {http://calvados.di.unipi.it/storage/paper_files/TR-10-12.pdf},
year = {2010},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/TR-10-12.pdf},
bdsk-url-2 = {http://dx.doi.org/10.1007/978-3-642-21878-1_21}
}

• M. Aldinucci, S. Ruggieri, and M. Torquati, “Porting Decision Tree Algorithms to Multicore using FastFlow,” Università di Pisa, Dipartimento di Informatica, Italy, TR-10-11, 2010.
[BibTeX] [Abstract] [Download PDF]

The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.

@techreport{fastflow_c45:tr-10-11,
abstract = {The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.},
author = {Marco Aldinucci and Salvatore Ruggieri and Massimo Torquati},
date-added = {2010-07-11 16:54:09 +0200},
date-modified = {2013-11-24 00:38:41 +0000},
institution = {Universit{\a} di Pisa, Dipartimento di Informatica, Italy},
keywords = {fastflow},
month = may,
number = {TR-10-11},
title = {Porting Decision Tree Algorithms to Multicore using {FastFlow}},
url = {http://calvados.di.unipi.it/storage/paper_files/TR-10-11.pdf},
year = {2010},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/TR-09-12.pdf},
bdsk-url-2 = {http://calvados.di.unipi.it/storage/paper_files/TR-10-11.pdf}
}

• M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati, “Accelerating sequential programs using FastFlow and self-offloading,” Università di Pisa, Dipartimento di Informatica, Italy, TR-10-03, 2010.
[BibTeX] [Abstract]

Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers with optimising tools and programming frameworks is a nowadays challenge. Few efforts have been done to support effective streaming applications on these architectures. In this paper we introduce FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than all of them in a set of micro-benchmarks and on a real world application; the speedup edge of FastFlow over other solutions might be bold for fine grain tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the alignment of protein P01111 against UniProt DB using Smith-Waterman algorithm.

@techreport{fastflow_acc:tr-10-03,
abstract = {Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers with optimising tools and programming frameworks is a nowadays challenge. Few efforts have been done to support effective streaming applications on these architectures. In this paper we introduce FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than all of them in a set of micro-benchmarks and on a real world application; the speedup edge of FastFlow over other solutions might be bold for fine grain tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the alignment of protein
P01111 against UniProt DB using Smith-Waterman algorithm.},
author = {Marco Aldinucci and Marco Danelutto and Peter Kilpatrick and Massimiliano Meneghin and Massimo Torquati},
date-added = {2009-09-08 16:14:34 +0200},
date-modified = {2013-11-24 00:39:01 +0000},
institution = {Universit{\a} di Pisa, Dipartimento di Informatica, Italy},
keywords = {fastflow},
month = feb,
number = {TR-10-03},
title = {Accelerating sequential programs using {FastFlow} and self-offloading},
year = {2010},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/TR-09-12.pdf}
}

• M. Aldinucci, M. Torquati, and M. Meneghin, “FastFlow: Efficient Parallel Streaming Applications on Multi-core,” Università di Pisa, Dipartimento di Informatica, Italy, TR-09-12, 2009.
[BibTeX] [Abstract] [Download PDF]

Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers with optimising tools and programming frameworks is a nowadays challenge. Few efforts have been done to support effective streaming applications on these architectures. In this paper we introduce FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than all of them in a set of micro-benchmarks and on a real world application; the speedup edge of FastFlow over other solutions might be bold for fine grain tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the alignment of protein P01111 against UniProt DB using Smith-Waterman algorithm.

@techreport{fastflow:tr-09-12,
abstract = {Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers with optimising tools and programming frameworks is a nowadays challenge. Few efforts have been done to support effective streaming applications on these architectures. In this paper we introduce FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than all of them in a set of micro-benchmarks and on a real world application; the speedup edge of FastFlow over other solutions might be bold for fine grain tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the alignment of protein
P01111 against UniProt DB using Smith-Waterman algorithm.},
author = {Marco Aldinucci and Massimo Torquati and Massimiliano Meneghin},
date-added = {2010-02-13 16:20:18 +0100},
date-modified = {2013-11-24 00:39:38 +0000},
institution = {Universit{\a} di Pisa, Dipartimento di Informatica, Italy},
keywords = {fastflow},
month = sep,
number = {TR-09-12},
title = {{FastFlow}: Efficient Parallel Streaming Applications on Multi-core},
url = {http://arxiv.org/abs/0909.1187},
year = {2009},
bdsk-url-1 = {http://calvados.di.unipi.it/storage/paper_files/TR-09-12.pdf},
bdsk-url-2 = {http://arxiv.org/abs/0909.1187}
}`

End of FastFlow papers