FastFlow: combining pattern-level abstraction and efficiency in GPGPUs
Speaker: Marco Aldinucci
Slides: talks slides
Date: March 24-27 at the San Jose McEnery Convention Center in San Jose, California.
Abstract: The shift toward GPGPU technology has many drivers, such as memory and power efficiency, which are likely to sustain this trend for several years to come. In the long term, writing efficient, portable, and correct parallel programs must be no more onerous than writing sequential programs. To date GPGPU programming has not embraced much more than low-level languages, such as NVidia CUDA and OpenCL, which mainly provide kernel offloading. OpenACC addresses loop parallelism, but still lacks full support for other paradigms, such as streaming and nesting.
The FastFlow parallel programming environment, designed to support efficient streaming on cache-coherent multi-core platforms, is realized as a C++ header-only template library providing parallel patterns such as pipeline, farm, master-worker, map, reduce, MapReduce, Divide&conquer and their composition. FastFlow has been recently extended to support distributed platforms, GPGPUs (via CUDA and OpenCL), and other accelerators (Tilera, MIC). FastFlow can be compiled on various platforms, such as Linux/Win/MacOS/x86, Linux/Arm, Linux/PPC, etc.
FastFlow’s parallel patterns can be arbitrarily nested and relieve the programmer of the burden of data sharing, synchronization and communications.
Thanks to its the layered design, new patterns can be easily defined as parametric graphs on top of the next layer down, i.e. the streaming network layer. The bottom layer implements nodes as threads or processes and low-latency channels (~20 clock cycles for a shared-memory core-to-core message). Patterns using GPGPUs can benefit from computation and communication overlapping (via CUDA streams), and self-adaptive dispatch policies to exploit multiple devices.
As testbed, an efficient variational image restoration template is used. Variational methods are rarely used due to their high computational cost. Given a traditional noise reduction filter the template generates an efficient parallel variational filter running on both multi-core and GPUs. The computation is in two steps:
- Accurate detection of the location of noise using a parallel implementation of the original filter.
- Edge-preserving restoration of outlier pixels by iterated application of a variational filter.
Phase 1 is coded by plugging the code of a standard sequential filter in a FastFlow map.
Phase 2 is implemented as a GPU kernel (as a map-reduce pattern).
Phases 1 and 2 are pipelined to ensure high or real-time throughput on video applications. The approach obtains a significant quality edge over traditional approaches.
This work was partially supported by NVidia academic programme CUDA research center 2013, and the EU FP7 Paraphrase project.