Selected for two talks at the NVidia 2014 GPU Technology Conference (GTC 2014)

FastFlow: combining pattern-level abstraction and efficiency in GPGPUs

Speaker: Marco Aldinucci
Slides: talks slides

Date: March 24-27 at the San Jose McEnery Convention Center in San Jose, California.

Abstract: The shift toward GPGPU technology has many drivers, such as memory and power efficiency, which are likely to sustain this trend for several years to come. In the long term, writing efficient, portable, and correct parallel programs must be no more onerous than writing sequential programs. To date GPGPU programming has not embraced much more than low-level languages, such as NVidia CUDA and OpenCL, which mainly provide kernel offloading. OpenACC addresses loop parallelism, but still lacks full support for other paradigms, such as streaming and nesting.

The FastFlow parallel programming environment, designed to support efficient streaming on cache-coherent multi-core platforms, is realized as a C++ header-only template library providing parallel patterns such as pipeline, farm, master-worker, map, reduce, MapReduce, Divide&conquer and their composition. FastFlow has been recently extended to support distributed platforms, GPGPUs (via CUDA and OpenCL), and other accelerators (Tilera, MIC). FastFlow can be compiled on various platforms, such as Linux/Win/MacOS/x86, Linux/Arm, Linux/PPC, etc.

FastFlow’s parallel patterns can be arbitrarily nested and relieve the programmer of the burden of data sharing, synchronization and communications.

Thanks to its the layered design, new patterns can be easily defined as parametric graphs on top of the next layer down, i.e. the streaming network layer. The bottom layer implements nodes as threads or processes and low-latency channels (~20 clock cycles for a shared-memory core-to-core message). Patterns using GPGPUs can benefit from computation and communication overlapping (via CUDA streams), and self-adaptive dispatch policies to exploit multiple devices.

As testbed, an efficient variational image restoration template is used. Variational methods are rarely used due to their high computational cost. Given a traditional noise reduction filter the template generates an efficient parallel variational filter running on both multi-core and GPUs. The computation is in two steps:

  1. Accurate detection of the location of noise using a parallel implementation of the original filter.
  2. Edge-preserving restoration of outlier pixels by iterated application of a variational filter.

Phase 1 is coded by plugging the code of a standard sequential filter in a FastFlow map.
Phase 2 is implemented as a GPU kernel (as a map-reduce pattern).

Phases 1 and 2 are pipelined to ensure high or real-time throughput on video applications. The approach obtains a significant quality edge over traditional approaches.

This work was partially supported by NVidia academic programme CUDA research center 2013, and the EU FP7 Paraphrase project.

FastFlow’s technical details have been published in a number of scientific papers since 2009, see    FastFlow papers and Application demo.

This entry was posted in talks on by .

About Marco Aldinucci

Marco Aldinucci is an assistant professor at Computer Science Department of the University of Torino since 2008. Previously, he has been researcher at University of Pisa and Italian National Research Agency. He is the author of over a hundred papers in international journals and conference proceeding (Google scholar h-index 21). He has been participating in over 20 national and international research projects concerning parallel and autonomic computing. He is the recipient of the HPC Advisory Council University Award 2011 and the NVidia Research award 2013. He has been leading the “Low-Level Virtualization and Platform-Specific Deployment” workpackage within the EU-STREP FP7 ParaPhrase (Parallel Patterns for Adaptive Heterogeneous Multicore Systems) project, the GPGPU workpackage within the IMPACT project (Innovative Methods for Particle Colliders at the Terascale), and he is the contact person for University of Torino for the European Network of Excellence on High Performance and Embedded Architecture and Compilation. In the last year he delivered 5 invited talks in international workshops (March 2012 – March 2013). He co-designed, together with Massimo Torquati, the FastFlow programming framework and several other programming frameworks and libraries for parallel computing. His research is focused on parallel and distributed computing.