Adriano Marques Garcia

Postdoctoral Researcher
Computer Science Department, University of Turin
Via Pessinetto 12, 10149 Torino – Italy 
Email: adriano.marquesgarcia@unito.it
ORCID: 0000-0003-4796-773X

Short Bio

Adriano Marques Garcia is a postdoctoral researcher at the University of Turin. He received his Ph.D. with honors in Computer Science at the Pontifical Catholic University of Rio Grande do Sul with a thesis on easing the benchmarking of parallel stream processing on multi-cores. He also received a Master’s Degree in Electrical Engineering from the Federal University of Pampa with a master’s thesis on a new parallel benchmark suite for evaluating the performance and energy consumption of parallel programming interfaces. Adriano has also worked as a Research Fellow in industry at SAP, researching and developing new methods for fault recovery in distributed data-flow graphs.

Open Source Software

  • Creator and maintainer of , a framework for benchmarking C++ stream processing applications. The main goal of SPBench is to enable users to easily create custom benchmarks from real-world stream processing applications and evaluate multiple parallel programming interfaces.
  • Creator and maintainer of PAMPAR, a parallel benchmark suite that provides a broad set of benchmarks (micro, kernels, and pseudo-applications), all parallelized using four state-of-art parallel programming interfaces: OpenMP, POSIX Threads, MPI-1.0, and MPI-2.0.

Achievements

  • [2023] Winner of the IEEE SBAC-PAD 2023 2nd Best PhD Thesis Award.
  • [2020] Winner of ICCSA 2020 Best Paper Award.
  • [2014] Winner of an 18-month Computer Science visiting student scholarship at Dublin Business School – Dublin – Ireland.

Publications

2023

  • A. M. Garcia, D. Griebler, C. Schepke, J. D. García, J. F. Muñoz, and L. G. Fernandes, “Performance and Programmability of GrPPI for Parallel Stream Processing on Multi-Cores,” Preprint (V1), pp. 1-23, 2023. doi:10.21203/rs.3.rs-3539774/v1
    [BibTeX] [Abstract] [Download PDF]

    GrPPI library aims to simplify the burdening task of parallel programming. It provides a unified, abstract, and generic layer while promising minimal overhead on performance. Although it supports stream parallelism, GrPPI lacks an evaluation regarding representative performance metrics for this domain, such as throughput and latency. This work evaluates GrPPI focused on parallel stream processing. We compare the performance, memory usage, and programmability of GrPPI against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks and benchmarks with handwritten parallel code using the same backends supported by GrPPI. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in many cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications.

    @online{GARCIA:JS:24,
    author={Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Jos\'{e} Daniel Garc\'{i}a and Javier Fern\'{a}ndez Mu\~{n}oz and Luiz Gustavo Fernandes},
    title = {{Performance and Programmability of GrPPI for Parallel Stream Processing on Multi-Cores}},
    pages={1-23},
    publisher={Springer},
    journal = {Preprint (V1)},
    month={Nov.},
    year={2023},
    version = 1,
    doi={10.21203/rs.3.rs-3539774/v1},
    url={http://dx.doi.org/10.21203/rs.3.rs-3539774/v1},
    pubstate = {preprint},
    abstract = {GrPPI library aims to simplify the burdening task of parallel programming. It provides a unified, abstract, and generic layer while promising minimal overhead on performance. Although it supports stream parallelism, GrPPI lacks an evaluation regarding representative performance metrics for this domain, such as throughput and latency. This work evaluates GrPPI focused on parallel stream processing. We compare the performance, memory usage, and programmability of GrPPI against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks and benchmarks with handwritten parallel code using the same backends supported by GrPPI. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in many cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications.}
    }

  • A. M. Garcia, “Easing the benchmarking of parallel stream processing on multi-cores,” PhD Thesis, Porto Alegre, Brazil, 2023.
    [BibTeX] [Abstract] [Download PDF]

    In today’s fast-changing data-driven world, there is increasing demand for realtime/low-latency data processing. Stream processing is a technique that envisages processing data as it becomes available, enabling near real-time data processing. Stream processing applications must resort to parallelism techniques to speed up processing and to cope with processing large volumes of data. Although there are parallel programming interfaces (PPIs) that add several abstraction layers, parallelism in stream processing is still a difficult task, usually demanding expert knowledge to achieve desired performance levels. This generates a lot of research effort toward boosting parallel stream processing performance and making parallel programming more accessible. Typically, benchmarks are used to evaluate the PPIs and new solutions in this context. However, there are a number of limitations in existing benchmarks, including not addressing some categories of stream processing applications, few or no parameterization options, difficulty extending the benchmarks to other PPIs, lack of appropriate performance metrics, poor usability, only targeting JVM-based languages, and others. This work proposes a framework called SPBench for creating custom benchmarks and evaluating parallel stream processing. Our main goal is to ease the benchmarking process in parallel stream processing, including the creation, building, execution, tuning, and evaluating of the benchmarks. Therefore, this doctoral dissertation provides the following main scientific contributions: (I) A framework that simplifies the benchmarking of stream processing applications, providing an API and a command-line interface to simplify, reuse code, customize, extend, and evaluate diferente aspects or properties regarding parallel stream processing. (II) A parallel C++ benchmark suite for stream processing that includes real-world applications and the most state-of-theart Parallel Programming Interfaces (PPIs) in this context. (III) A comprehensive comparative study of the most popular PPIs leveraging C++ stream parallelism. (IV) Mechanisms for dynamic data stream frequency simulation in stream processing applications with a set of algorithms for generating the literature’s most commonly used data stream frequency patterns and an analysis of the data frequency impact on the performance of stream processing applications. (V) An analysis of the performance impact of micro-batch sizing on stream processing applications, including mechanisms for real-time and dynamic batching management, allowing users to adjust batch sizes on the fly either based on specific size targets or time intervals. We test the SPBench framework with five real-world applications of video/image processing, data compression, and fraud detection. We show the benefits of SPBench by using it in combination with PPIs to generate parallel stream processing benchmarks and conduct various analyses. Overall, the results showed that the high-level abstractions of PPIs can cause significant performance penalties when they hide fine-tuning mechanisms. In the data frequency experiments, the FastFlow PPI benefited more from varying frequency scenarios than the TBB in our test cases. Finally, the experimental results showed that the potential performance advantage of using micro-batches on multi-cores tends to show up only in specific scenarios.

    @phdthesis{GARCIA:PHD:23,
    author={Adriano Marques Garcia},
    title={{Easing the benchmarking of parallel stream processing on multi-cores}},
    numpages={214},
    school={School of Technology - PUCRS},
    address={Porto Alegre, Brazil},
    month={Mar},
    year={2023},
    url={https://tede2.pucrs.br/tede2/handle/tede/10884},
    abstract={In today's fast-changing data-driven world, there is increasing demand for realtime/low-latency data processing. Stream processing is a technique that envisages processing data as it becomes available, enabling near real-time data processing. Stream processing applications must resort to parallelism techniques to speed up processing and to cope with processing large volumes of data. Although there are parallel programming interfaces (PPIs) that add several abstraction layers, parallelism in stream processing is still a difficult task, usually demanding expert knowledge to achieve desired performance levels. This generates a lot of research effort toward boosting parallel stream processing performance and making parallel programming more accessible. Typically, benchmarks are used to evaluate the PPIs and new solutions in this context. However, there are a number of limitations in existing benchmarks, including not addressing some categories of stream processing applications, few or no parameterization options, difficulty extending the benchmarks to other PPIs, lack of appropriate performance metrics, poor usability, only targeting JVM-based languages, and others. This work proposes a framework called SPBench for creating custom benchmarks and evaluating parallel stream processing. Our main goal is to ease the benchmarking process in parallel stream processing, including the creation, building, execution, tuning, and evaluating of the benchmarks. Therefore, this doctoral dissertation provides the following main scientific contributions: (I) A framework that simplifies the benchmarking of stream processing applications, providing an API and a command-line interface to simplify, reuse code, customize, extend, and evaluate diferente aspects or properties regarding parallel stream processing. (II) A parallel C++ benchmark suite for stream processing that includes real-world applications and the most state-of-theart Parallel Programming Interfaces (PPIs) in this context. (III) A comprehensive comparative study of the most popular PPIs leveraging C++ stream parallelism. (IV) Mechanisms for dynamic data stream frequency simulation in stream processing applications with a set of algorithms for generating the literature's most commonly used data stream frequency patterns and an analysis of the data frequency impact on the performance of stream processing applications. (V) An analysis of the performance impact of micro-batch sizing on stream processing applications, including mechanisms for real-time and dynamic batching management, allowing users to adjust batch sizes on the fly either based on specific size targets or time intervals. We test the SPBench framework with five real-world applications of video/image processing, data compression, and fraud detection. We show the benefits of SPBench by using it in combination with PPIs to generate parallel stream processing benchmarks and conduct various analyses. Overall, the results showed that the high-level abstractions of PPIs can cause significant performance penalties when they hide fine-tuning mechanisms. In the data frequency experiments, the FastFlow PPI benefited more from varying frequency scenarios than the TBB in our test cases. Finally, the experimental results showed that the potential performance advantage of using micro-batches on multi-cores tends to show up only in specific scenarios.},
    }

  • A. M. Garcia, D. Griebler, C. Schepke, A. S. Santos, J. D. García, J. F. Muñoz, and L. G. Fernandes, “A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores,” in 31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Naples, Italy, 2023, pp. 164-168. doi:10.1109/PDP59025.2023.00033
    [BibTeX] [Abstract] [Download PDF]

    Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications.

    @inproceedings{GARCIA:PDP:23,
    author={Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Andr\'{e} Sacilotto Santos and Jos\'{e} Daniel Garc\'{i}a and Javier Fern\'{a}ndez Mu\~{n}oz and Luiz Gustavo Fernandes},
    title={{A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores}},
    booktitle={31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)},
    series={PDP'23},
    pages={164-168},
    publisher={IEEE},
    address={Naples, Italy},
    month={March},
    year={2023},
    doi={10.1109/PDP59025.2023.00033},
    url={https://doi.org/10.1109/PDP59025.2023.00033},
    abstract={Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications.},
    }

  • A. M. Garcia, D. Griebler, C. Schepke, and L. G. Fernandes, “Micro-batch and data frequency for stream processing on multi-cores,” The Journal of Supercomputing, vol. In press, iss. In press, pp. 1-39, 2023. doi:10.1007/s11227-022-05024-y
    [BibTeX] [Abstract] [Download PDF]

    Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines.

    @article{GARCIA:JS:23,
    title = {Micro-batch and data frequency for stream processing on multi-cores},
    author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes},
    url = {https://doi.org/10.1007/s11227-022-05024-y},
    doi = {10.1007/s11227-022-05024-y},
    year = {2023},
    date = {2023-01-01},
    journal = {The Journal of Supercomputing},
    volume = {In press},
    number = {In press},
    pages = {1-39},
    publisher = {Springer},
    abstract = {Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines.},
    keywords = {},
    pubstate = {published}
    }

2022

  • A. M. Garcia, D. Griebler, C. Schepke, and L. G. Fernandes, “SPBench: a framework for creating benchmarks of stream processing applications,” Computing, vol. In press, iss. In press, pp. 1-23, 2022. doi:10.1007/s00607-021-01025-6
    [BibTeX] [Abstract] [Download PDF]

    In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures.

    @article{GARCIA:Computing:22,
    title = {SPBench: a framework for creating benchmarks of stream processing applications},
    author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes},
    url = {https://doi.org/10.1007/s00607-021-01025-6},
    doi = {10.1007/s00607-021-01025-6},
    year = {2022},
    date = {2022-01-01},
    journal = {Computing},
    volume = {In press},
    number = {In press},
    pages = {1-23},
    publisher = {Springer},
    abstract = {In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures.},
    keywords = {},
    pubstate = {published}
    }

  • A. M. Garcia, D. Griebler, C. Schepke, and L. G. Fernandes, “Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores,” in 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Valladolid, Spain, 2022-04-01 2022, pp. 10-17. doi:10.1109/PDP55904.2022.00011
    [BibTeX] [Abstract] [Download PDF]

    In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33\% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations.

    @inproceedings{GARCIA:PDP:22,
    title = {Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores},
    author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes},
    url = {https://doi.org/10.1109/PDP55904.2022.00011},
    doi = {10.1109/PDP55904.2022.00011},
    year = {2022},
    date = {2022-04-01},
    booktitle = {30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)},
    pages = {10-17},
    publisher = {IEEE},
    address = {Valladolid, Spain},
    series = {PDP'22},
    abstract = {In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33\% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations.},
    keywords = {},
    pubstate = {published}
    }

2021

  • A. M. Garcia, D. Griebler, C. Schepke, and L. G. Fernandes, “Introducing a Stream Processing Framework for Assessing Parallel Programming Interfaces,” in 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Valladolid, Spain, 2021-03-01 2021, pp. 84-88. doi:10.1109/PDP52278.2021.00021
    [BibTeX] [Abstract] [Download PDF]

    Stream Processing applications are spread across different sectors of industry and people’s daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently computation. It can be done through Stream Parallelism, which is still a challenging task and most reserved for experts. We introduce a Stream Processing framework for assessing Parallel Programming Interfaces (PPIs). Our framework targets multi-core architectures and C++ stream processing applications, providing an API that abstracts the details of the stream operators of these applications. Therefore, users can easily identify all the basic operators and implement parallelism through different PPIs. In this paper, we present the proposed framework, implement three applications using its API, and show how it works, by using it to parallelize and evaluate the applications with the PPIs Intel TBB, FastFlow, and SPar. The performance results were consistent with the literature.

    @inproceedings{GARCIA:PDP:21,
    title = {Introducing a Stream Processing Framework for Assessing Parallel Programming Interfaces},
    author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes},
    url = {https://doi.org/10.1109/PDP52278.2021.00021},
    doi = {10.1109/PDP52278.2021.00021},
    year = {2021},
    date = {2021-03-01},
    booktitle = {29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)},
    pages = {84-88},
    publisher = {IEEE},
    address = {Valladolid, Spain},
    series = {PDP'21},
    abstract = {Stream Processing applications are spread across different sectors of industry and people's daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently computation. It can be done through Stream Parallelism, which is still a challenging task and most reserved for experts. We introduce a Stream Processing framework for assessing Parallel Programming Interfaces (PPIs). Our framework targets multi-core architectures and C++ stream processing applications, providing an API that abstracts the details of the stream operators of these applications. Therefore, users can easily identify all the basic operators and implement parallelism through different PPIs. In this paper, we present the proposed framework, implement three applications using its API, and show how it works, by using it to parallelize and evaluate the applications with the PPIs Intel TBB, FastFlow, and SPar. The performance results were consistent with the literature.},
    keywords = {},
    pubstate = {published}
    }

2020

  • A. M. Garcia, M. Serpa, D. Griebler, C. Schepke, L. G. Fernandes, and P. O. A. Navaux, “The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures,” in International Conference on Computational Science and its Applications (ICCSA), Cagliari, Italy, 2020-07-01 2020, pp. 142-157. doi:10.1007/978-3-030-58817-5_12
    [BibTeX] [Abstract] [Download PDF]

    Since the demand for computing power increases, new architectures emerged to obtain better performance. Reducing the power and energy consumption of these architectures is one of the main challenges to achieving high-performance computing. Current research trends aim at developing new software and hardware techniques to achieve the best performance and energy trade-offs. In this work, we investigate the impact of different CPU frequency scaling techniques such as ondemand, performance, and powersave on the power and energy consumption of multi-core based computer infrastructure. We apply these techniques in PAMPAR, a parallel benchmark suite implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn). We measure the energy and execution time of 10 benchmarks, varying the number of threads. Our results show that although powersave consumes up to 43.1\% less power than performance and ondemand governors, it consumes the triple of energy due to the high execution time. Our experiments also show that the performance governor consumes up to 9.8\% more energy than ondemand for CPU-bound benchmarks. Finally, our results show that PThreads has the lowest power consumption, consuming less than the sequential version for memory-bound benchmarks. Regarding performance, the performance governor achieved 3\% of performance over the ondemand.

    @inproceedings{GARCIA:ICCSA:20,
    title = {The Impact of CPU Frequency Scaling on Power Consumption of Computing Infrastructures},
    author = {Adriano Marques Garcia and Matheus Serpa and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes and Philippe O. A. Navaux},
    url = {https://doi.org/10.1007/978-3-030-58817-5_12},
    doi = {10.1007/978-3-030-58817-5_12},
    year = {2020},
    date = {2020-07-01},
    booktitle = {International Conference on Computational Science and its Applications (ICCSA)},
    volume = {12254},
    pages = {142-157},
    publisher = {Springer},
    address = {Cagliari, Italy},
    series = {ICCSA'20},
    abstract = {Since the demand for computing power increases, new architectures emerged to obtain better performance. Reducing the power and energy consumption of these architectures is one of the main challenges to achieving high-performance computing. Current research trends aim at developing new software and hardware techniques to achieve the best performance and energy trade-offs. In this work, we investigate the impact of different CPU frequency scaling techniques such as ondemand, performance, and powersave on the power and energy consumption of multi-core based computer infrastructure. We apply these techniques in PAMPAR, a parallel benchmark suite implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn). We measure the energy and execution time of 10 benchmarks, varying the number of threads. Our results show that although powersave consumes up to 43.1\% less power than performance and ondemand governors, it consumes the triple of energy due to the high execution time. Our experiments also show that the performance governor consumes up to 9.8\% more energy than ondemand for CPU-bound benchmarks. Finally, our results show that PThreads has the lowest power consumption, consuming less than the sequential version for memory-bound benchmarks. Regarding performance, the performance governor achieved 3\% of performance over the ondemand.},
    keywords = {},
    pubstate = {published}
    }

  • A. M. Garcia, C. Schepke, and A. Girardi, “PAMPAR: A new parallel benchmark for performance and energy consumption evaluation,” Concurrency and Computation: Practice and Experience, vol. 32, iss. 20, p. e5504, 2020. doi:https://doi.org/10.1002/cpe.5504
    [BibTeX] [Abstract] [Download PDF]

    Summary This paper presents PAMPAR, a new benchmark to evaluate the performance and energy consumption of different Parallel Programming Interfaces (PPIs). The benchmark is composed of 11 algorithms implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn) PPIs. Previous studies have used some of these pseudo-applications to perform this type of evaluation in different architectures since there is no benchmark that offers this variety of PPIs and communication models. In this work, we measure the energy and performance of each pseudo-application in a single architecture, varying the number of threads/processes. We also organize the pseudo-applications according to their memory accesses, floating-point operations, and branches. The goal is to show that this set of pseudo-applications has enough features to build a parallel benchmark. The results show that there is no single best case that provides both better performance and low energy consumption in the presented scenarios. Moreover, the pseudo-applications usage of the system resources are different enough to represent different scenarios and be efficient as a benchmark.

    @article{GARCIA:CCPE:20,
    author = {Adriano Marques Garcia and Claudio Schepke and Alessandro Girardi},
    title = {PAMPAR: A new parallel benchmark for performance and energy consumption evaluation},
    journal = {Concurrency and Computation: Practice and Experience},
    volume = {32},
    number = {20},
    pages = {e5504},
    keywords = {MPI, OpenMP, parallel benchmark, performance, power consumption, PThreads},
    doi = {https://doi.org/10.1002/cpe.5504},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.5504},
    eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.5504},
    note = {e5504 cpe.5504},
    abstract = {Summary This paper presents PAMPAR, a new benchmark to evaluate the performance and energy consumption of different Parallel Programming Interfaces (PPIs). The benchmark is composed of 11 algorithms implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn) PPIs. Previous studies have used some of these pseudo-applications to perform this type of evaluation in different architectures since there is no benchmark that offers this variety of PPIs and communication models. In this work, we measure the energy and performance of each pseudo-application in a single architecture, varying the number of threads/processes. We also organize the pseudo-applications according to their memory accesses, floating-point operations, and branches. The goal is to show that this set of pseudo-applications has enough features to build a parallel benchmark. The results show that there is no single best case that provides both better performance and low energy consumption in the presented scenarios. Moreover, the pseudo-applications usage of the system resources are different enough to represent different scenarios and be efficient as a benchmark.},
    year = {2020}
    }

2019

  • A. M. Garcia, “Towards a Benchmark for Performance and Power Consumption Evaluation of Parallel Programming Interfaces,” Master Thesis, Alegrete, Brazil, 2019.
    [BibTeX] [Abstract] [Download PDF]

    This wor k presents a set of pseudo-applications and proposes them to be used as a benchmar k to evaluate the perfor mance and power consumption of different Parallel Programming Interfaces (PPIs). The set consists of 11 algor ithms implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn) PPIs. These PPIs were chosen because they are compatible with most of the current multi-core architectures. Previous studies have used some of these pseudo-applications to perfor m this type of evaluation in different architectures since there is no benchmar k that offers this var iety of PPIs and communication models. Recent related wor k that compare PPIs have looked for different alter natives to solve the problem since the available parallel benchmar ks do not meet this demand. The goal of this wor k is to propose the use of these pseudo-applications as a benchmar k to evaluate the perfor mance and power consumption of different PPIs. To achieve this goal, we analyze the behavior of pseudo-applications and PPIs with respect to cache access, branches, and floating point operations. The results of these exper iments showed that there is a good balance among pseudo-applications that make more or less intensive use of these parameters. In addition, we conducted a case study to evaluate the perfor mance, energy consumption, and power consumption (power dissipation) of these pseudo-applications. The results show that the pseudo-applications generally have a good perfor mance. Although the total energy consumption is, in some cases, 300 times greater among different MPI pseudo-applications, this difference does not appear in the power consumption. The PPIs and the pseudo-applications presented to use the hardware resources in a ver y dynamic way and our results show that they are able to represent different scenar ios. Therefore they can be used as a parallel benchmar k. Keywords: benchmark, performance, energy consumption.

    @mastersthesis{GARCIA:MR:23,
    author={Adriano Marques Garcia},
    title={{Towards a Benchmark for Performance and Power Consumption Evaluation of Parallel Programming Interfaces}},
    numpages={80},
    school={Federal University of Pampa},
    address={Alegrete, Brazil},
    month={Mar},
    year={2019},
    url={https://repositorio.unipampa.edu.br/jspui/handle/riu/4136},
    abstract={This wor k presents a set of pseudo-applications and proposes them to be used as a benchmar k to evaluate the perfor mance and power consumption of different Parallel Programming Interfaces (PPIs). The set consists of 11 algor ithms implemented in PThreads, OpenMP, MPI-1, and MPI-2 (spawn) PPIs. These PPIs were chosen because they are compatible with most of the current multi-core architectures. Previous studies have used some of these pseudo-applications to perfor m this type of evaluation in different architectures since there is no benchmar k that offers this var iety of PPIs and communication models. Recent related wor k that compare PPIs have looked for different alter natives to solve the problem since the available parallel benchmar ks do not meet this demand. The goal of this wor k is to propose the use of these pseudo-applications as a benchmar k to evaluate the perfor mance and power consumption of different PPIs. To achieve this goal, we analyze the behavior of pseudo-applications and PPIs with respect to cache access, branches, and floating point operations. The results of these exper iments showed that there is a good balance among pseudo-applications that make more or less intensive use of these parameters. In addition, we conducted a case study to evaluate the perfor mance, energy consumption, and power consumption (power dissipation) of these pseudo-applications. The results show that the pseudo-applications generally have a good perfor mance. Although the total energy consumption is, in some cases, 300 times greater among different MPI pseudo-applications, this difference does not appear in the power consumption. The PPIs and the pseudo-applications presented to use the hardware resources in a ver y dynamic way and our results show that they are able to represent different scenar ios. Therefore they can be used as a parallel benchmar k. Keywords: benchmark, performance, energy consumption.}
    }

  • A. M. Garcia, C. Schepke, A. G. Girardi, and S. A. da Silva, “A New Parallel Benchmark for Performance Evaluation and Energy Consumption,” in High Performance Computing for Computational Science – VECPAR 2018, Cham, 2019, p. 188–201.
    [BibTeX] [Abstract]

    This paper presents a new benchmark to evaluate performance and energy consumption of different Parallel Programming Interfaces (PPIs). The benchmark is composed of 11 algorithms implemented in PThreads, OpenMP, MPI-1 and MPI-2 (spawn) PPIs. Previous studies have used some of these applications to perform this type of evaluation in different architectures, since there is no benchmark that offers this variety of PPIs and communication models. In this work we measure the energy and performance of each application in a single architecture, varying the number of threads/processes. The goal is to show that this set of applications has enough features to form a parallel benchmark. The results show that there is no single best case that provides both better performance and low energy consumption in the presented scenarios. However, PThreads and OpenMP achieve the best trade-offs between performance and energy in most cases.

    @InProceedings{GARCIA:VECPAR:19,
    author="Garcia, Adriano Marques
    and Schepke, Claudio
    and Girardi, Alessandro Gon{\c{c}}alves
    and da Silva, Sherlon Almeida",
    editor="Senger, Hermes
    and Marques, Osni
    and Garcia, Rogerio
    and Pinheiro de Brito, Tatiana
    and Iope, Rog{\'e}rio
    and Stanzani, Silvio
    and Gil-Costa, Veronica",
    title="A New Parallel Benchmark for Performance Evaluation and Energy Consumption",
    booktitle="High Performance Computing for Computational Science -- VECPAR 2018",
    year="2019",
    publisher="Springer International Publishing",
    address="Cham",
    pages="188--201",
    abstract="This paper presents a new benchmark to evaluate performance and energy consumption of different Parallel Programming Interfaces (PPIs). The benchmark is composed of 11 algorithms implemented in PThreads, OpenMP, MPI-1 and MPI-2 (spawn) PPIs. Previous studies have used some of these applications to perform this type of evaluation in different architectures, since there is no benchmark that offers this variety of PPIs and communication models. In this work we measure the energy and performance of each application in a single architecture, varying the number of threads/processes. The goal is to show that this set of applications has enough features to form a parallel benchmark. The results show that there is no single best case that provides both better performance and low energy consumption in the presented scenarios. However, PThreads and OpenMP achieve the best trade-offs between performance and energy in most cases.",
    isbn="978-3-030-15996-2"
    }

2018

  • A. M. Garcia, C. Schepke, A. G. Girardi, and S. A. da Silva, “Power Consumption of Parallel Programming Interfaces in Multicore Architectures: A Case Study,” in 2018 Symposium on High Performance Computing Systems (WSCAD), 2018, pp. 77-83.
    [BibTeX] [Download PDF]
    @INPROCEEDINGS{GARCIA:WSCAD:18,
    author={Adriano Marques Garcia and Claudio Schepke and Alessandro Gonçalves Girardi and Sherlon Almeida da Silva},
    booktitle={2018 Symposium on High Performance Computing Systems (WSCAD)},
    title={Power Consumption of Parallel Programming Interfaces in Multicore Architectures: A Case Study},
    year={2018},
    volume={},
    number={},
    pages={77-83},
    url={https://doi.org/10.1109/WSCAD.2018.00021}
    doi={10.1109/WSCAD.2018.00021}}