Se você prefere baixar um arquivo único com todas as referências do LARCC, você pode encontrá-lo neste link. Você também pode acompanhar novas publicações via RSS.
Adicionalmente, você também pode encontrar as publicações no perfil do LARCC no Google Scholar .
2024 |
Vogel, Adriano; Danelutto, Marco; Torquati, Massimo; Griebler, Dalvan; Fernandes, Luiz Gustavo Enhancing self-adaptation for efficient decision-making at run-time in streaming applications on multicores Journal Article doi The Journal of Supercomputing, pp. 1573-0484, 2024. Abstract | Links | BibTeX | Tags: multicore, Parallel computing, Stream processing @article{Supercomputing, title = {Enhancing self-adaptation for efficient decision-making at run-time in streaming applications on multicores}, author = {Adriano Vogel and Marco Danelutto and Massimo Torquati and Dalvan Griebler and Luiz Gustavo Fernandes }, editor = {Adriano Vogel and Marco Danelutto and Massimo Torquati and Dalvan Griebler and Luiz Gustavo Fernandes }, url = { https://link.springer.com/article/10.1007/s11227-024-06191-w}, doi = {10.1007/s11227-024-06191-w}, year = {2024}, date = {2024-06-21}, journal = {The Journal of Supercomputing}, pages = {1573-0484}, abstract = {Parallel computing is very important to accelerate the performance of computing applications. Moreover, parallel applications are expected to continue executing in more dynamic environments and react to changing conditions. In this context, applying self-adaptation is a potential solution to achieve a higher level of autonomic abstractions and runtime responsiveness. In our research, we aim to explore and assess the possible abstractions attainable through the transparent management of parallel executions by self-adaptation. Our primary objectives are to expand the adaptation space to better reflect real-world applications and assess the potential for self-adaptation to enhance efficiency. We provide the following scientific contributions: (I) A conceptual framework to improve the designing of self-adaptation; (II) A new decision-making strategy for applications with multiple parallel stages; (III) A comprehensive evaluation of the proposed decision-making strategy compared to the state-of-the-art. The results demonstrate that the proposed conceptual framework can help design and implement self-adaptive strategies that are more modular and reusable. The proposed decision-making strategy provides significant gains in accuracy compared to the state-of-the-art, increasing the parallel applications’ performance and efficiency.}, keywords = {multicore, Parallel computing, Stream processing}, pubstate = {published}, tppubtype = {article} } Parallel computing is very important to accelerate the performance of computing applications. Moreover, parallel applications are expected to continue executing in more dynamic environments and react to changing conditions. In this context, applying self-adaptation is a potential solution to achieve a higher level of autonomic abstractions and runtime responsiveness. In our research, we aim to explore and assess the possible abstractions attainable through the transparent management of parallel executions by self-adaptation. Our primary objectives are to expand the adaptation space to better reflect real-world applications and assess the potential for self-adaptation to enhance efficiency. We provide the following scientific contributions: (I) A conceptual framework to improve the designing of self-adaptation; (II) A new decision-making strategy for applications with multiple parallel stages; (III) A comprehensive evaluation of the proposed decision-making strategy compared to the state-of-the-art. The results demonstrate that the proposed conceptual framework can help design and implement self-adaptive strategies that are more modular and reusable. The proposed decision-making strategy provides significant gains in accuracy compared to the state-of-the-art, increasing the parallel applications’ performance and efficiency. |
Mencagli, Gabriele; Torquati, Massimo; Griebler, Dalvan; Fais, Alessandra; Danelutto, Marco General-purpose data stream processing on heterogeneous architectures with WindFlow Journal Article doi Journal of Parallel and Distributed Computing, 184 , pp. 104782, 2024. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @article{MENCAGLI:JPDC:24, title = {General-purpose data stream processing on heterogeneous architectures with WindFlow}, author = {Gabriele Mencagli and Massimo Torquati and Dalvan Griebler and Alessandra Fais and Marco Danelutto}, url = {https://www.sciencedirect.com/science/article/pii/S0743731523001521}, doi = {https://doi.org/10.1016/j.jpdc.2023.104782}, year = {2024}, date = {2024-02-01}, journal = {Journal of Parallel and Distributed Computing}, volume = {184}, pages = {104782}, publisher = {Elsevier}, abstract = {Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {article} } Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink. |
2023 |
Leonarczyk, Ricardo; Griebler, Dalvan; Mencagli, Gabriele; Danelutto, Marco Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing Inproceedings Euro-ParW 2023: Parallel Processing Workshops, pp. 1-8, Springer, Limassol, 2023. Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{LEONARCZYK:Euro-ParW:23, title = {Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing}, author = {Ricardo Leonarczyk and Dalvan Griebler and Gabriele Mencagli and Marco Danelutto}, url = {https://doi.org/}, year = {2023}, date = {2023-08-01}, booktitle = {Euro-ParW 2023: Parallel Processing Workshops}, pages = {1-8}, publisher = {Springer}, address = {Limassol}, series = {Euro-ParW'23}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } |
Leonarczyk, Ricardo; Griebler, Dalvan Avaliação da Auto-Adaptação de Micro-Lote para aplicação de Processamento de Streaming em GPUs Inproceedings doi Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 123-124, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. Abstract | Links | BibTeX | Tags: GPGPU, Self-adaptation, Stream processing @inproceedings{LEONARCZYK:ERAD:23, title = {Avaliação da Auto-Adaptação de Micro-Lote para aplicação de Processamento de Streaming em GPUs}, author = {Ricardo Leonarczyk and Dalvan Griebler}, url = {https://doi.org/10.5753/eradrs.2023.229267}, doi = {10.5753/eradrs.2023.229267}, year = {2023}, date = {2023-05-01}, booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul}, pages = {123-124}, publisher = {Sociedade Brasileira de Computação}, address = {Porto Alegre, Brazil}, abstract = {Este artigo apresenta uma avaliação de algoritmos para regular a latência através da auto-adaptação de micro-lote em sistemas de processamento de streaming acelerados por GPU. Os resultados demonstraram que o algoritmo com o fator de adaptação fixo conseguiu ficar por mais tempo na região de latência especificada para a aplicação.}, keywords = {GPGPU, Self-adaptation, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Este artigo apresenta uma avaliação de algoritmos para regular a latência através da auto-adaptação de micro-lote em sistemas de processamento de streaming acelerados por GPU. Os resultados demonstraram que o algoritmo com o fator de adaptação fixo conseguiu ficar por mais tempo na região de latência especificada para a aplicação. |
Dopke, Luan; Griebler, Dalvan Estudo Sobre Spark nas Aplicações de Processamento de Log e Análise de Cliques Inproceedings doi Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 85-88, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. Abstract | Links | BibTeX | Tags: Benchmark, Stream processing @inproceedings{larcc:DOPKE:ERAD:23, title = {Estudo Sobre Spark nas Aplicações de Processamento de Log e Análise de Cliques}, author = {Luan Dopke and Dalvan Griebler}, url = {https://doi.org/10.5753/eradrs.2023.229298}, doi = {10.5753/eradrs.2023.229298}, year = {2023}, date = {2023-05-01}, booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul}, pages = {85-88}, publisher = {Sociedade Brasileira de Computação}, address = {Porto Alegre, Brazil}, abstract = {O uso de aplicações de processamento de dados de fluxo contínuo vem crescendo cada vez mais, dado este fato o presente estudo visa mensurar a desempenho do framework Apache Spark Strucutured Streaming perante o framework Apache Storm nas aplicações de fluxo contínuo de dados, estas sendo processamento de logs e análise de cliques. Os resultados demonstram melhor desempenho para o Apache Storm em ambas as aplicações.}, keywords = {Benchmark, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } O uso de aplicações de processamento de dados de fluxo contínuo vem crescendo cada vez mais, dado este fato o presente estudo visa mensurar a desempenho do framework Apache Spark Strucutured Streaming perante o framework Apache Storm nas aplicações de fluxo contínuo de dados, estas sendo processamento de logs e análise de cliques. Os resultados demonstram melhor desempenho para o Apache Storm em ambas as aplicações. |
Fim, Gabriel Rustick; Griebler, Dalvan Implementação e Avaliação do Paralelismo de Flink nas Aplicações de Processamento de Log e Análise de Cliques Inproceedings doi Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 69-72, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. Abstract | Links | BibTeX | Tags: Parallel programming, Stream processing @inproceedings{larcc:FIM:ERAD:23, title = {Implementação e Avaliação do Paralelismo de Flink nas Aplicações de Processamento de Log e Análise de Cliques}, author = {Gabriel Rustick Fim and Dalvan Griebler}, url = {https://doi.org/10.5753/eradrs.2023.229290}, doi = {10.5753/eradrs.2023.229290}, year = {2023}, date = {2023-05-01}, booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul}, pages = {69-72}, publisher = {Sociedade Brasileira de Computação}, address = {Porto Alegre, Brazil}, abstract = {Este trabalho visou implementar e avaliar o desempenho das aplicações de Processamento de Log e Análise de Cliques no Apache Flink, comparando o desempenho com Apache Storm em um ambiente computacional distribuído. Os resultados mostram que a execução em Flink apresenta um consumo de recursos relativamente menor quando comparada a execução em Storm, mas possui um desvio padrão alto expondo um desbalanceamento de carga em execuções onde algum componente da aplicação é replicado.}, keywords = {Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho visou implementar e avaliar o desempenho das aplicações de Processamento de Log e Análise de Cliques no Apache Flink, comparando o desempenho com Apache Storm em um ambiente computacional distribuído. Os resultados mostram que a execução em Flink apresenta um consumo de recursos relativamente menor quando comparada a execução em Storm, mas possui um desvio padrão alto expondo um desbalanceamento de carga em execuções onde algum componente da aplicação é replicado. |
Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo A parallel programming assessment for stream processing applications on multi-core systems Journal Article doi Computer Standards & Interfaces, 84 , pp. 103691, 2023. Abstract | Links | BibTeX | Tags: Parallel programming, Stream processing @article{ANDRADE:CSI:2023, title = {A parallel programming assessment for stream processing applications on multi-core systems}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.csi.2022.103691}, doi = {10.1016/j.csi.2022.103691}, year = {2023}, date = {2023-03-01}, journal = {Computer Standards & Interfaces}, volume = {84}, pages = {103691}, publisher = {Elsevier}, abstract = {Multi-core systems are any computing device nowadays and stream processing applications are becoming recurrent workloads, demanding parallelism to achieve the desired quality of service. As soon as data, tasks, or requests arrive, they must be computed, analyzed, or processed. Since building such applications is not a trivial task, the software industry must adopt parallel APIs (Application Programming Interfaces) that simplify the exploitation of parallelism in hardware for accelerating time-to-market. In the last years, research efforts in academia and industry provided a set of parallel APIs, increasing productivity to software developers. However, a few studies are seeking to prove the usability of these interfaces. In this work, we aim to present a parallel programming assessment regarding the usability of parallel API for expressing parallelism on the stream processing application domain and multi-core systems. To this end, we conducted an empirical study with beginners in parallel application development. The study covered three parallel APIs, reporting several quantitative and qualitative indicators involving developers. Our contribution also comprises a parallel programming assessment methodology, which can be replicated in future assessments. This study revealed important insights such as recurrent compile-time and programming logic errors performed by beginners in parallel programming, as well as the programming effort, challenges, and learning curve. Moreover, we collected the participants’ opinions about their experience in this study to understand deeply the results achieved.}, keywords = {Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {article} } Multi-core systems are any computing device nowadays and stream processing applications are becoming recurrent workloads, demanding parallelism to achieve the desired quality of service. As soon as data, tasks, or requests arrive, they must be computed, analyzed, or processed. Since building such applications is not a trivial task, the software industry must adopt parallel APIs (Application Programming Interfaces) that simplify the exploitation of parallelism in hardware for accelerating time-to-market. In the last years, research efforts in academia and industry provided a set of parallel APIs, increasing productivity to software developers. However, a few studies are seeking to prove the usability of these interfaces. In this work, we aim to present a parallel programming assessment regarding the usability of parallel API for expressing parallelism on the stream processing application domain and multi-core systems. To this end, we conducted an empirical study with beginners in parallel application development. The study covered three parallel APIs, reporting several quantitative and qualitative indicators involving developers. Our contribution also comprises a parallel programming assessment methodology, which can be replicated in future assessments. This study revealed important insights such as recurrent compile-time and programming logic errors performed by beginners in parallel programming, as well as the programming effort, challenges, and learning curve. Moreover, we collected the participants’ opinions about their experience in this study to understand deeply the results achieved. |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; García, José Daniel; Muñoz, Javier Fernández; Fernandes, Luiz Gustavo A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores Inproceedings doi 31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 164-168, IEEE, Naples, Italy, 2023. Abstract | Links | BibTeX | Tags: Benchmark, Stream processing @inproceedings{GARCIA:PDP:23, title = {A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and José Daniel García and Javier Fernández Muñoz and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP59025.2023.00033}, doi = {10.1109/PDP59025.2023.00033}, year = {2023}, date = {2023-03-01}, booktitle = {31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {164-168}, publisher = {IEEE}, address = {Naples, Italy}, series = {PDP'23}, abstract = {Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications.}, keywords = {Benchmark, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications. |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo SPBench: a framework for creating benchmarks of stream processing applications Journal Article doi Computing, 105 (5), pp. 1077-1099, 2023. Abstract | Links | BibTeX | Tags: Benchmark, Stream processing @article{GARCIA:Computing:23, title = {SPBench: a framework for creating benchmarks of stream processing applications}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s00607-021-01025-6}, doi = {10.1007/s00607-021-01025-6}, year = {2023}, date = {2023-01-01}, journal = {Computing}, volume = {105}, number = {5}, pages = {1077-1099}, publisher = {Springer}, abstract = {In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures.}, keywords = {Benchmark, Stream processing}, pubstate = {published}, tppubtype = {article} } In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures. |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Micro-batch and data frequency for stream processing on multi-cores Journal Article doi The Journal of Supercomputing, 79 (8), pp. 9206-9244, 2023. Abstract | Links | BibTeX | Tags: Benchmark, Self-adaptation, Stream processing @article{GARCIA:JS:23, title = {Micro-batch and data frequency for stream processing on multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-022-05024-y}, doi = {10.1007/s11227-022-05024-y}, year = {2023}, date = {2023-01-01}, journal = {The Journal of Supercomputing}, volume = {79}, number = {8}, pages = {9206-9244}, publisher = {Springer}, abstract = {Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines.}, keywords = {Benchmark, Self-adaptation, Stream processing}, pubstate = {published}, tppubtype = {article} } Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines. |
2022 |
Löff, Júnior; Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz Gustavo Combining stream with data parallelism abstractions for multi-cores Journal Article doi Journal of Computer Languages, 73 , pp. 101160, 2022. Abstract | Links | BibTeX | Tags: Parallel programming, Stream processing @article{LOFF:COLA:22, title = {Combining stream with data parallelism abstractions for multi-cores}, author = {Júnior Löff and Renato Barreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.cola.2022.101160}, doi = {10.1016/j.cola.2022.101160}, year = {2022}, date = {2022-12-01}, journal = {Journal of Computer Languages}, volume = {73}, pages = {101160}, publisher = {Elsevier}, abstract = {Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. In this work, we introduce improvements to stream processing applications by exploiting fine-grained data parallelism (via Map and MapReduce) inside coarse-grained stream parallelism stages. The improvements are including techniques for identifying data parallelism in sequential codes, a new language, semantic analysis, and a set of definition and transformation rules to perform source-to-source parallel code generation. Moreover, we investigate the feasibility of employing higher-level programming abstractions to support the proposed optimizations. For that, we elect SPar programming model as a use case, and extend it by adding two new attributes to its language and implementing our optimizations as a new algorithm in the SPar compiler. We conduct a set of experiments in representative stream processing and data-parallel applications. The results showed that our new compiler algorithm is efficient and that performance improved by up to 108.4x in data-parallel applications. Furthermore, experiments evaluating stream processing applications towards the composition of stream and data parallelism revealed new insights. The results showed that such composition may improve latencies by up to an order of magnitude. Also, it enables programmers to exploit different degrees of stream and data parallelism to accomplish a balance between throughput and latency according to their necessity.}, keywords = {Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {article} } Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. In this work, we introduce improvements to stream processing applications by exploiting fine-grained data parallelism (via Map and MapReduce) inside coarse-grained stream parallelism stages. The improvements are including techniques for identifying data parallelism in sequential codes, a new language, semantic analysis, and a set of definition and transformation rules to perform source-to-source parallel code generation. Moreover, we investigate the feasibility of employing higher-level programming abstractions to support the proposed optimizations. For that, we elect SPar programming model as a use case, and extend it by adding two new attributes to its language and implementing our optimizations as a new algorithm in the SPar compiler. We conduct a set of experiments in representative stream processing and data-parallel applications. The results showed that our new compiler algorithm is efficient and that performance improved by up to 108.4x in data-parallel applications. Furthermore, experiments evaluating stream processing applications towards the composition of stream and data parallelism revealed new insights. The results showed that such composition may improve latencies by up to an order of magnitude. Also, it enables programmers to exploit different degrees of stream and data parallelism to accomplish a balance between throughput and latency according to their necessity. |
Rockenbach, Dinei A; Löff, Júnior; Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz G High-Level Stream and Data Parallelism in C++ for GPUs Inproceedings doi XXVI Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Uberlândia, Brazil, 2022. Abstract | Links | BibTeX | Tags: GPGPU, Parallel programming, Stream processing @inproceedings{ROCKENBACH:SBLP:22, title = {High-Level Stream and Data Parallelism in C++ for GPUs}, author = {Dinei A Rockenbach and Júnior Löff and Gabriell Araujo and Dalvan Griebler and Luiz G Fernandes}, url = {https://doi.org/10.1145/3561320.3561327}, doi = {10.1145/3561320.3561327}, year = {2022}, date = {2022-10-01}, booktitle = {XXVI Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-49}, publisher = {ACM}, address = {Uberlândia, Brazil}, series = {SBLP'22}, abstract = {GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching.}, keywords = {GPGPU, Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching. |
Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Kessler, Christoph; Ernstsson, August; Fernandes, Luiz Gustavo Analyzing Programming Effort Model Accuracy of High-Level Parallel Programs for Stream Processing Inproceedings doi 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022), pp. 229-232, IEEE, Gran Canaria, Spain, 2022. Abstract | Links | BibTeX | Tags: Parallel programming, Stream processing @inproceedings{ANDRADE:SEAA:22, title = {Analyzing Programming Effort Model Accuracy of High-Level Parallel Programs for Stream Processing}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Christoph Kessler and August Ernstsson and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/SEAA56994.2022.00043}, doi = {10.1109/SEAA56994.2022.00043}, year = {2022}, date = {2022-09-01}, booktitle = {48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022)}, pages = {229-232}, publisher = {IEEE}, address = {Gran Canaria, Spain}, series = {SEAA'22}, abstract = {Over the years, several Parallel Programming Models (PPMs) have supported the abstraction of programming complexity for parallel computer systems. However, few studies aim to evaluate the productivity reached by such abstractions since this is a complex task that involves human beings. There are several studies to develop predictive methods to estimate the effort required to program applications in software engineering. In order to evaluate the reliability of such metrics, it is necessary to assess the accuracy in different programming domains. In this work, we used the data of an experiment conducted with beginners in parallel programming to determine the effort required for implementing stream parallelism using FastFlow, SPar, and TBB. Our results show that some traditional software effort estimation models, such as COCOMO II, fall short, while Putnam's model could be an alternative for high-level PPMs evaluation. To overcome the limitations of existing models, we plan to create a parallelism-aware model to evaluate applications in this domain in future work.}, keywords = {Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Over the years, several Parallel Programming Models (PPMs) have supported the abstraction of programming complexity for parallel computer systems. However, few studies aim to evaluate the productivity reached by such abstractions since this is a complex task that involves human beings. There are several studies to develop predictive methods to estimate the effort required to program applications in software engineering. In order to evaluate the reliability of such metrics, it is necessary to assess the accuracy in different programming domains. In this work, we used the data of an experiment conducted with beginners in parallel programming to determine the effort required for implementing stream parallelism using FastFlow, SPar, and TBB. Our results show that some traditional software effort estimation models, such as COCOMO II, fall short, while Putnam's model could be an alternative for high-level PPMs evaluation. To overcome the limitations of existing models, we plan to create a parallelism-aware model to evaluate applications in this domain in future work. |
Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores Inproceedings doi 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 10-17, IEEE, Valladolid, Spain, 2022. Abstract | Links | BibTeX | Tags: Benchmark, Stream processing @inproceedings{GARCIA:PDP:22, title = {Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP55904.2022.00011}, doi = {10.1109/PDP55904.2022.00011}, year = {2022}, date = {2022-04-01}, booktitle = {30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {10-17}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'22}, abstract = {In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations.}, keywords = {Benchmark, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations. |
Mencagli, Gabriele; Griebler, Dalvan; Danelutto, Marco Towards Parallel Data Stream Processing on System-on-Chip CPU+GPU Devices Inproceedings doi 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 34-38, IEEE, Valladolid, Spain, 2022. Abstract | Links | BibTeX | Tags: GPGPU, IoT, Stream processing @inproceedings{MENCAGLI:PDP:22, title = {Towards Parallel Data Stream Processing on System-on-Chip CPU+GPU Devices}, author = {Gabriele Mencagli and Dalvan Griebler and Marco Danelutto}, url = {https://doi.org/10.1109/PDP55904.2022.00014}, doi = {10.1109/PDP55904.2022.00014}, year = {2022}, date = {2022-04-01}, booktitle = {30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {34-38}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'22}, abstract = {Data Stream Processing is a pervasive computing paradigm with a wide spectrum of applications. Traditional streaming systems exploit the processing capabilities provided by homogeneous Clusters and Clouds. Due to the transition to streaming systems suitable for IoT/Edge environments, there has been the urgent need of new streaming frameworks and tools tailored for embedded platforms, often available as System-onChips composed of a small multicore CPU and an integrated onchip GPU. Exploiting this hybrid hardware requires special care in the runtime system design. In this paper, we discuss the support provided by the WindFlow library, showing its design principles and its effectiveness on the NVIDIA Jetson Nano board.}, keywords = {GPGPU, IoT, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Data Stream Processing is a pervasive computing paradigm with a wide spectrum of applications. Traditional streaming systems exploit the processing capabilities provided by homogeneous Clusters and Clouds. Due to the transition to streaming systems suitable for IoT/Edge environments, there has been the urgent need of new streaming frameworks and tools tailored for embedded platforms, often available as System-onChips composed of a small multicore CPU and an integrated onchip GPU. Exploiting this hybrid hardware requires special care in the runtime system design. In this paper, we discuss the support provided by the WindFlow library, showing its design principles and its effectiveness on the NVIDIA Jetson Nano board. |
Fim, Gabriel; Welter, Greice; Löff, Júnior; Griebler, Dalvan Compressão de Dados em Clusters HPC com Flink, MPI e SPar Inproceedings doi Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 29-32, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. Abstract | Links | BibTeX | Tags: Parallel programming, Stream processing @inproceedings{larcc:FIM:ERAD:22, title = {Compressão de Dados em Clusters HPC com Flink, MPI e SPar}, author = {Gabriel Fim and Greice Welter and Júnior Löff and Dalvan Griebler}, url = {https://doi.org/10.5753/eradrs.2022.19153}, doi = {10.5753/eradrs.2022.19153}, year = {2022}, date = {2022-04-01}, booktitle = {Anais da XXII Escola Regional de Alto Desempenho da Região Sul}, pages = {29-32}, publisher = {Sociedade Brasileira de Computação}, address = {Curitiba, Brazil}, abstract = {Este trabalho visa avaliar o desempenho do algoritmo de compressão de dados Bzip2 com as ferramentas de processamento de stream Apache Flink, MPI e SPar utilizando um cluster Beowulf. Os resultados mostram que as versões com maior desempenho em relação ao tempo sequencial são o MPI e SPar com speed-up de 7,6 e 7,2 vezes, respectivamente.}, keywords = {Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho visa avaliar o desempenho do algoritmo de compressão de dados Bzip2 com as ferramentas de processamento de stream Apache Flink, MPI e SPar utilizando um cluster Beowulf. Os resultados mostram que as versões com maior desempenho em relação ao tempo sequencial são o MPI e SPar com speed-up de 7,6 e 7,2 vezes, respectivamente. |
Gomes, Márcio Miguel; da Righi, Rodrigo Rosa; da Costa, Cristiano André; Griebler, Dalvan Steam++: An Extensible End-to-end Framework for Developing IoT Data Processing Applications in the Fog Journal Article doi International Journal of Computer Science & Information Technology, 14 (1), pp. 31-51, 2022. Abstract | Links | BibTeX | Tags: Cloud computing, IoT, Stream processing @article{GOMES:IJCSIT:22, title = {Steam++: An Extensible End-to-end Framework for Developing IoT Data Processing Applications in the Fog}, author = {Márcio Miguel Gomes and Rodrigo Rosa da Righi and Cristiano André da Costa and Dalvan Griebler}, url = {http://dx.doi.org/10.5121/ijcsit.2022.14103}, doi = {10.5121/ijcsit.2022.14103}, year = {2022}, date = {2022-02-01}, journal = {International Journal of Computer Science & Information Technology}, volume = {14}, number = {1}, pages = {31-51}, publisher = {AIRCC}, abstract = {IoT applications usually rely on cloud computing services to perform data analysis such as filtering, aggregation, classification, pattern detection, and prediction. When applied to specific domains, the IoT needs to deal with unique constraints. Besides the hostile environment such as vibration and electricmagnetic interference, resulting in malfunction, noise, and data loss, industrial plants often have Internet access restricted or unavailable, forcing us to design stand-alone fog and edge computing solutions. In this context, we present STEAM++, a lightweight and extensible framework for real-time data stream processing and decision-making in the network edge, targeting hardware-limited devices, besides proposing a micro-benchmark methodology for assessing embedded IoT applications. In real-case experiments in a semiconductor industry, we processed an entire data flow, from values sensing, processing and analysing data, detecting relevant events, and finally, publishing results to a dashboard. On average, the application consumed less than 500kb RAM and 1.0% of CPU usage, processing up to 239 data packets per second and reducing the output data size to 14% of the input raw data size when notifying events.}, keywords = {Cloud computing, IoT, Stream processing}, pubstate = {published}, tppubtype = {article} } IoT applications usually rely on cloud computing services to perform data analysis such as filtering, aggregation, classification, pattern detection, and prediction. When applied to specific domains, the IoT needs to deal with unique constraints. Besides the hostile environment such as vibration and electricmagnetic interference, resulting in malfunction, noise, and data loss, industrial plants often have Internet access restricted or unavailable, forcing us to design stand-alone fog and edge computing solutions. In this context, we present STEAM++, a lightweight and extensible framework for real-time data stream processing and decision-making in the network edge, targeting hardware-limited devices, besides proposing a micro-benchmark methodology for assessing embedded IoT applications. In real-case experiments in a semiconductor industry, we processed an entire data flow, from values sensing, processing and analysing data, detecting relevant events, and finally, publishing results to a dashboard. On average, the application consumed less than 500kb RAM and 1.0% of CPU usage, processing up to 239 data packets per second and reducing the output data size to 14% of the input raw data size when notifying events. |
Hoffmann, Renato Barreto; Löff, Júnior; Griebler, Dalvan; Fernandes, Luiz Gustavo OpenMP as runtime for providing high-level stream parallelism on multi-cores Journal Article doi The Journal of Supercomputing, 78 (1), pp. 7655-7676, 2022. Abstract | Links | BibTeX | Tags: Parallel programming, Stream processing @article{HOFFMANN:Jsuper:2022, title = {OpenMP as runtime for providing high-level stream parallelism on multi-cores}, author = {Renato Barreto Hoffmann and Júnior Löff and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-021-04182-9}, doi = {10.1007/s11227-021-04182-9}, year = {2022}, date = {2022-01-01}, journal = {The Journal of Supercomputing}, volume = {78}, number = {1}, pages = {7655-7676}, publisher = {Springer}, address = {New York, United States}, abstract = {OpenMP is an industry and academic standard for parallel programming. However, using it for developing parallel stream processing applications is complex and challenging. OpenMP lacks key programming mechanisms and abstractions for this particular domain. To tackle this problem, we used a high-level parallel programming framework (named SPar) for automatically generating parallel OpenMP code. We achieved this by leveraging SPar’s language and its domain-specific code annotations for simplifying the complexity and verbosity added by OpenMP in this application domain. Consequently, we implemented a new compiler algorithm in SPar for automatically generating parallel code targeting the OpenMP runtime using source-to-source code transformations. The experiments in four different stream processing applications demonstrated that the execution time of SPar was improved up to 25.42% when using the OpenMP runtime. Additionally, our abstraction over OpenMP introduced at most 1.72% execution time overhead when compared to handwritten parallel codes. Furthermore, SPar significantly reduces the total source lines of code required to express parallelism with respect to plain OpenMP parallel codes.}, keywords = {Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {article} } OpenMP is an industry and academic standard for parallel programming. However, using it for developing parallel stream processing applications is complex and challenging. OpenMP lacks key programming mechanisms and abstractions for this particular domain. To tackle this problem, we used a high-level parallel programming framework (named SPar) for automatically generating parallel OpenMP code. We achieved this by leveraging SPar’s language and its domain-specific code annotations for simplifying the complexity and verbosity added by OpenMP in this application domain. Consequently, we implemented a new compiler algorithm in SPar for automatically generating parallel code targeting the OpenMP runtime using source-to-source code transformations. The experiments in four different stream processing applications demonstrated that the execution time of SPar was improved up to 25.42% when using the OpenMP runtime. Additionally, our abstraction over OpenMP introduced at most 1.72% execution time overhead when compared to handwritten parallel codes. Furthermore, SPar significantly reduces the total source lines of code required to express parallelism with respect to plain OpenMP parallel codes. |
2021 |
Pieper, Ricardo; Löff, Júnior; Hoffmann, Renato Berreto; Griebler, Dalvan; Fernandes, Luiz Gustavo High-level and Efficient Structured Stream Parallelism for Rust on Multi-cores Journal Article Journal of Computer Languages, na (na), pp. na, 2021. Abstract | BibTeX | Tags: Parallel programming, Stream processing @article{PIEPER:COLA:21, title = {High-level and Efficient Structured Stream Parallelism for Rust on Multi-cores}, author = {Ricardo Pieper and Júnior Löff and Renato Berreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, year = {2021}, date = {2021-07-01}, journal = {Journal of Computer Languages}, volume = {na}, number = {na}, pages = {na}, publisher = {Elsevier}, abstract = {This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism..}, keywords = {Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {article} } This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism.. |
Gomes, Márcio Miguel; da Righi, Rodrigo Rosa; da Costa, Cristiano André; Griebler, Dalvan Simplifying IoT data stream enrichment and analytics in the edge Journal Article doi Computers & Electrical Engineering, 92 , pp. 107110, 2021. Abstract | Links | BibTeX | Tags: IoT, Stream processing @article{GOMES:CEE:21, title = {Simplifying IoT data stream enrichment and analytics in the edge}, author = {Márcio Miguel Gomes and Rodrigo Rosa da Righi and Cristiano André da Costa and Dalvan Griebler}, url = {https://doi.org/10.1016/j.compeleceng.2021.107110}, doi = {10.1016/j.compeleceng.2021.107110}, year = {2021}, date = {2021-06-01}, journal = {Computers & Electrical Engineering}, volume = {92}, pages = {107110}, publisher = {Elsevier}, abstract = {Edge devices are usually limited in resources. They often send data to the cloud, where techniques such as filtering, aggregation, classification, pattern detection, and prediction are performed. This process results in critical issues such as data loss, high response time, and overhead. On the other hand, processing data in the edge is not a simple task due to devices’ heterogeneity, resource limitations, a variety of programming languages and standards. In this context, this work proposes STEAM, a framework for developing data stream processing applications in the edge targeting hardware-limited devices. As the main contribution, STEAM enables the development of applications for different platforms, with standardized functions and class structures that use consolidated IoT data formats and communication protocols. Moreover, the experiments revealed the viability of stream processing in the edge resulting in the reduction of response time without compromising the quality of results.}, keywords = {IoT, Stream processing}, pubstate = {published}, tppubtype = {article} } Edge devices are usually limited in resources. They often send data to the cloud, where techniques such as filtering, aggregation, classification, pattern detection, and prediction are performed. This process results in critical issues such as data loss, high response time, and overhead. On the other hand, processing data in the edge is not a simple task due to devices’ heterogeneity, resource limitations, a variety of programming languages and standards. In this context, this work proposes STEAM, a framework for developing data stream processing applications in the edge targeting hardware-limited devices. As the main contribution, STEAM enables the development of applications for different platforms, with standardized functions and class structures that use consolidated IoT data formats and communication protocols. Moreover, the experiments revealed the viability of stream processing in the edge resulting in the reduction of response time without compromising the quality of results. |
Vogel, Adriano; Gabriele, ; Mencagli, ; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Online and Transparent Self-adaptation of Stream Parallel Patterns Journal Article doi Computing, 105 (5), pp. 1039-1057, 2021. Abstract | Links | BibTeX | Tags: Parallel programming, Self-adaptation, Stream processing @article{VOGEL:Computing:23, title = {Online and Transparent Self-adaptation of Stream Parallel Patterns}, author = {Adriano Vogel and Gabriele and Mencagli and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s00607-021-00998-8}, doi = {10.1007/s00607-021-00998-8}, year = {2021}, date = {2021-05-01}, journal = {Computing}, volume = {105}, number = {5}, pages = {1039-1057}, publisher = {Springer}, abstract = {Several real-world parallel applications are becoming more dynamic and long-running, demanding online (at run-time) adaptations. Stream processing is a representative scenario that computes data items arriving in real-time and where parallel executions are necessary. However, it is challenging for humans to monitor and manually self-optimize complex and long-running parallel executions continuously. Moreover, although high-level and structured parallel programming aims to facilitate parallelism, several issues still need to be addressed for improving the existing abstractions. In this paper, we extend self-adaptiveness for supporting autonomous and online changes of the parallel pattern compositions. Online self-adaptation is achieved with an online profiler that characterizes the applications, which is combined with a new self-adaptive strategy and a model for smooth transitions on reconfigurations. The solution provides a new abstraction layer that enables application programmers to define non-functional requirements instead of hand-tuning complex configurations. Hence, we contribute with additional abstractions and flexible self-adaptation for responsiveness at run-time. The proposed solution is evaluated with applications having different processing characteristics, workloads, and configurations. The results show that it is possible to provide additional abstractions, flexibility, and responsiveness while achieving performance comparable to the best static configuration executions.}, keywords = {Parallel programming, Self-adaptation, Stream processing}, pubstate = {published}, tppubtype = {article} } Several real-world parallel applications are becoming more dynamic and long-running, demanding online (at run-time) adaptations. Stream processing is a representative scenario that computes data items arriving in real-time and where parallel executions are necessary. However, it is challenging for humans to monitor and manually self-optimize complex and long-running parallel executions continuously. Moreover, although high-level and structured parallel programming aims to facilitate parallelism, several issues still need to be addressed for improving the existing abstractions. In this paper, we extend self-adaptiveness for supporting autonomous and online changes of the parallel pattern compositions. Online self-adaptation is achieved with an online profiler that characterizes the applications, which is combined with a new self-adaptive strategy and a model for smooth transitions on reconfigurations. The solution provides a new abstraction layer that enables application programmers to define non-functional requirements instead of hand-tuning complex configurations. Hence, we contribute with additional abstractions and flexible self-adaptation for responsiveness at run-time. The proposed solution is evaluated with applications having different processing characteristics, workloads, and configurations. The results show that it is possible to provide additional abstractions, flexibility, and responsiveness while achieving performance comparable to the best static configuration executions. |
Vanzan, Anthony; Fim, Gabriel; Welter, Greice; Griebler, Dalvan Aceleração da Classificação de Lavouras de Milho com MPI e Estratégias de Paralelismo Inproceedings doi 21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), pp. 49-52, Sociedade Brasileira de Computação, Joinville, RS, Brazil, 2021. Abstract | Links | BibTeX | Tags: Agriculture, Distributed computing, Parallel programming, Stream processing @inproceedings{larcc:DL_Classificaiton_MPI:ERAD:21, title = {Aceleração da Classificação de Lavouras de Milho com MPI e Estratégias de Paralelismo}, author = {Anthony Vanzan and Gabriel Fim and Greice Welter and Dalvan Griebler}, url = {https://doi.org/10.5753/eradrs.2021.14772}, doi = {10.5753/eradrs.2021.14772}, year = {2021}, date = {2021-04-01}, booktitle = {21th Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, pages = {49-52}, publisher = {Sociedade Brasileira de Computação}, address = {Joinville, RS, Brazil}, abstract = {Este trabalho visou acelerar a execução de um algoritmo de classificação de lavouras em imagens áreas. Para isso, foram implementadas diferentes versões paralelas usando a biblioteca MPI na linguagem Python. A avaliação foi conduzida em dois ambientes computacionais. Conclui-se que é possível reduzir o tempo de execução a medida que mais recursos paralelos são usados e a estratégia de distribuição de trabalho dinâmica é mais eficiente.}, keywords = {Agriculture, Distributed computing, Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho visou acelerar a execução de um algoritmo de classificação de lavouras em imagens áreas. Para isso, foram implementadas diferentes versões paralelas usando a biblioteca MPI na linguagem Python. A avaliação foi conduzida em dois ambientes computacionais. Conclui-se que é possível reduzir o tempo de execução a medida que mais recursos paralelos são usados e a estratégia de distribuição de trabalho dinâmica é mais eficiente. |
Vogel, Adriano; Mencagli, Gabriele; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Towards On-the-fly Self-Adaptation of Stream Parallel Patterns Inproceedings doi 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 89-93, IEEE, Valladolid, Spain, 2021. Abstract | Links | BibTeX | Tags: Self-adaptation, Stream processing @inproceedings{VOGEL:PDP:21, title = {Towards On-the-fly Self-Adaptation of Stream Parallel Patterns}, author = {Adriano Vogel and Gabriele Mencagli and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, doi = {10.1109/PDP52278.2021.00022}, year = {2021}, date = {2021-03-01}, booktitle = {29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {89-93}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'21}, abstract = {Stream processing applications compute streams of data and provide insightful results in a timely manner, where parallel computing is necessary for accelerating the application executions. Considering that these applications are becoming increasingly dynamic and long-running, a potential solution is to apply dynamic runtime changes. However, it is challenging for humans to continuously monitor and manually self-optimize the executions. In this paper, we propose self-adaptiveness of the parallel patterns used, enabling flexible on-the-fly adaptations. The proposed solution is evaluated with an existing programming framework and running experiments with a synthetic and a real-world application. The results show that the proposed solution is able to dynamically self-adapt to the most suitable parallel pattern configuration and achieve performance competitive with the best static cases. The feasibility of the proposed solution encourages future optimizations and other applicabilities.}, keywords = {Self-adaptation, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Stream processing applications compute streams of data and provide insightful results in a timely manner, where parallel computing is necessary for accelerating the application executions. Considering that these applications are becoming increasingly dynamic and long-running, a potential solution is to apply dynamic runtime changes. However, it is challenging for humans to continuously monitor and manually self-optimize the executions. In this paper, we propose self-adaptiveness of the parallel patterns used, enabling flexible on-the-fly adaptations. The proposed solution is evaluated with an existing programming framework and running experiments with a synthetic and a real-world application. The results show that the proposed solution is able to dynamically self-adapt to the most suitable parallel pattern configuration and achieve performance competitive with the best static cases. The feasibility of the proposed solution encourages future optimizations and other applicabilities. |
Vogel, Adriano; Griebler, Dalvan; Fernandes, Luiz Gustavo Providing High-level Self-adaptive Abstractions for Stream Parallelism on Multi-cores Journal Article doi Software: Practice and Experience, 51 (6), pp. 1194-1217, 2021. Abstract | Links | BibTeX | Tags: Self-adaptation, Stream processing @article{VOGEL:SPE:21, title = {Providing High-level Self-adaptive Abstractions for Stream Parallelism on Multi-cores}, author = {Adriano Vogel and Dalvan Griebler and Luiz Gustavo Fernandes}, doi = {10.1002/spe.2948}, year = {2021}, date = {2021-01-01}, journal = {Software: Practice and Experience}, volume = {51}, number = {6}, pages = {1194-1217}, publisher = {Wiley Online Library}, abstract = {Stream processing applications are common computing workloads that demand parallelism to increase their performance. As in the past, parallel programming remains a difficult task for application programmers. The complexity increases when application programmers must set non-intuitive parallelism parameters, i.e. the degree of parallelism. The main problem is that state-of-the-art libraries use a static degree of parallelism and are not sufficiently abstracted for developing stream processing applications. In this paper, we propose a self-adaptive regulation of the degree of parallelism to provide higher-level abstractions. Flexibility is provided to programmers with two new self-adaptive strategies, one is for performance experts, and the other abstracts the need to set a performance goal. We evaluated our solution using compiler transformation rules to generate parallel code with the SPar domain-specific language. The experimental results with real-world applications highlighted higher abstraction levels without significant performance degradation in comparison to static executions. The strategy for performance experts achieved slightly higher performance than the one that works without user-defined performance goals.}, keywords = {Self-adaptation, Stream processing}, pubstate = {published}, tppubtype = {article} } Stream processing applications are common computing workloads that demand parallelism to increase their performance. As in the past, parallel programming remains a difficult task for application programmers. The complexity increases when application programmers must set non-intuitive parallelism parameters, i.e. the degree of parallelism. The main problem is that state-of-the-art libraries use a static degree of parallelism and are not sufficiently abstracted for developing stream processing applications. In this paper, we propose a self-adaptive regulation of the degree of parallelism to provide higher-level abstractions. Flexibility is provided to programmers with two new self-adaptive strategies, one is for performance experts, and the other abstracts the need to set a performance goal. We evaluated our solution using compiler transformation rules to generate parallel code with the SPar domain-specific language. The experimental results with real-world applications highlighted higher abstraction levels without significant performance degradation in comparison to static executions. The strategy for performance experts achieved slightly higher performance than the one that works without user-defined performance goals. |
2020 |
Bordin, Maycon Viana; Griebler, Dalvan; Mencagli, Gabriele; Geyer, Claudio F R; Fernandes, Luiz Gustavo DSPBench: a Suite of Benchmark Applications for Distributed Data Stream Processing Systems Journal Article doi IEEE Access, 8 (na), pp. 222900-222917, 2020. Abstract | Links | BibTeX | Tags: Benchmark, Stream processing @article{BORDIN:IEEEAccess:20, title = {DSPBench: a Suite of Benchmark Applications for Distributed Data Stream Processing Systems}, author = {Maycon Viana Bordin and Dalvan Griebler and Gabriele Mencagli and Claudio F R Geyer and Luiz Gustavo Fernandes}, doi = {10.1109/ACCESS.2020.3043948}, year = {2020}, date = {2020-12-01}, journal = {IEEE Access}, volume = {8}, number = {na}, pages = {222900-222917}, publisher = {IEEE}, abstract = {Systems enabling the continuous processing of large data streams have recently attracted the attention of the scientific community and industrial stakeholders. Data Stream Processing Systems (DSPSs) are complex and powerful frameworks able to ease the development of streaming applications in distributed computing environments like clusters and clouds. Several systems of this kind have been released and currently maintained as open source projects, like Apache Storm and Spark Streaming. Some benchmark applications have often been used by the scientific community to test and evaluate new techniques to improve the performance and usability of DSPSs. However, the existing benchmark suites lack of representative workloads coming from the wide set of application domains that can leverage the benefits offered by the stream processing paradigm in terms of near real-time performance. The goal of this paper is to present a new benchmark suite composed of 15 applications coming from areas like Finance, Telecommunications, Sensor Networks, Social Networks and others. This paper describes in detail the nature of these applications, their full workload characterization in terms of selectivity, processing cost, input size and overall memory occupation. In addition, it exemplifies the usefulness of our benchmark suite to compare real DSPSs by selecting Apache Storm and Spark Streaming for this analysis.}, keywords = {Benchmark, Stream processing}, pubstate = {published}, tppubtype = {article} } Systems enabling the continuous processing of large data streams have recently attracted the attention of the scientific community and industrial stakeholders. Data Stream Processing Systems (DSPSs) are complex and powerful frameworks able to ease the development of streaming applications in distributed computing environments like clusters and clouds. Several systems of this kind have been released and currently maintained as open source projects, like Apache Storm and Spark Streaming. Some benchmark applications have often been used by the scientific community to test and evaluate new techniques to improve the performance and usability of DSPSs. However, the existing benchmark suites lack of representative workloads coming from the wide set of application domains that can leverage the benefits offered by the stream processing paradigm in terms of near real-time performance. The goal of this paper is to present a new benchmark suite composed of 15 applications coming from areas like Finance, Telecommunications, Sensor Networks, Social Networks and others. This paper describes in detail the nature of these applications, their full workload characterization in terms of selectivity, processing cost, input size and overall memory occupation. In addition, it exemplifies the usefulness of our benchmark suite to compare real DSPSs by selecting Apache Storm and Spark Streaming for this analysis. |
Stein, Charles M; Rockenbach, Dinei A; Griebler, Dalvan; Torquati, Massimo; Mencagli, Gabriele; Danelutto, Marco; Fernandes, Luiz Gustavo Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units Journal Article doi Concurrency and Computation: Practice and Experience, na (na), pp. e5786, 2020. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @article{STEIN:CCPE:20, title = {Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units}, author = {Charles M Stein and Dinei A Rockenbach and Dalvan Griebler and Massimo Torquati and Gabriele Mencagli and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1002/cpe.5786}, doi = {10.1002/cpe.5786}, year = {2020}, date = {2020-05-01}, journal = {Concurrency and Computation: Practice and Experience}, volume = {na}, number = {na}, pages = {e5786}, publisher = {Wiley Online Library}, abstract = {Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {article} } Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads. |
Vogel, Adriano; Rista, Cassiano; Justo, Gabriel; Ewald, Endrius; Griebler, Dalvan; Mencagli, Gabriele; Fernandes, Luiz Gustavo Parallel Stream Processing with MPI for Video Analytics and Data Visualization Inproceedings doi High Performance Computing Systems, pp. 102-116, Springer, Cham, 2020. Abstract | Links | BibTeX | Tags: Stream processing @inproceedings{VOGEL:CCIS:20, title = {Parallel Stream Processing with MPI for Video Analytics and Data Visualization}, author = {Adriano Vogel and Cassiano Rista and Gabriel Justo and Endrius Ewald and Dalvan Griebler and Gabriele Mencagli and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/978-3-030-41050-6_7}, doi = {10.1007/978-3-030-41050-6_7}, year = {2020}, date = {2020-02-01}, booktitle = {High Performance Computing Systems}, volume = {1171}, pages = {102-116}, publisher = {Springer}, address = {Cham}, series = {Communications in Computer and Information Science (CCIS)}, abstract = {The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. Parallel stream processing can be implemented for handling high frequency and big data flows. The MPI parallel programming model offers low-level and flexible mechanisms for dealing with distributed architectures such as clusters. This paper aims to use it to accelerate video analytics and data visualization applications so that insight can be obtained as soon as the data arrives. Experiments were conducted with a Domain-Specific Language for Geospatial Data Visualization and a Person Recognizer video application. We applied the same stream parallelism strategy and two task distribution strategies. The dynamic task distribution achieved better performance than the static distribution in the HPC cluster. The data visualization achieved lower throughput with respect to the video analytics due to the I/O intensive operations. Also, the MPI programming model shows promising performance outcomes for stream processing applications.}, keywords = {Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. Parallel stream processing can be implemented for handling high frequency and big data flows. The MPI parallel programming model offers low-level and flexible mechanisms for dealing with distributed architectures such as clusters. This paper aims to use it to accelerate video analytics and data visualization applications so that insight can be obtained as soon as the data arrives. Experiments were conducted with a Domain-Specific Language for Geospatial Data Visualization and a Person Recognizer video application. We applied the same stream parallelism strategy and two task distribution strategies. The dynamic task distribution achieved better performance than the static distribution in the HPC cluster. The data visualization achieved lower throughput with respect to the video analytics due to the I/O intensive operations. Also, the MPI programming model shows promising performance outcomes for stream processing applications. |
2019 |
Vogel, Adriano; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Seamless Parallelism Management for Multi-core Stream Processing Inproceedings doi Advances in Parallel Computing, Proceedings of the International Conference on Parallel Computing (ParCo), pp. 533-542, IOS Press, Prague, Czech Republic, 2019. Abstract | Links | BibTeX | Tags: Stream processing @inproceedings{VOGEL:PARCO:19, title = {Seamless Parallelism Management for Multi-core Stream Processing}, author = {Adriano Vogel and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.3233/APC200082}, doi = {10.3233/APC200082}, year = {2019}, date = {2019-09-01}, booktitle = {Advances in Parallel Computing, Proceedings of the International Conference on Parallel Computing (ParCo)}, volume = {36}, pages = {533-542}, publisher = {IOS Press}, address = {Prague, Czech Republic}, series = {ParCo'19}, abstract = {Video streaming applications have critical performance requirements for dealing with fluctuating workloads and providing results in real-time. As a consequence, the majority of these applications demand parallelism for delivering quality of service to users. Although high-level and structured parallel programming aims at facilitating parallelism exploitation, there are still several issues to be addressed for increasing/improving existing parallel programming abstractions. In this paper, we aim at employing self-adaptivity for stream processing in order to seamlessly manage the application parallelism configurations at run-time, where a new strategy alleviates from application programmers the need to set time-consuming and error-prone parallelism parameters. The new strategy was implemented and validated on SPar. The results have shown that the proposed solution increases the level of abstraction and achieved a competitive performance.}, keywords = {Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Video streaming applications have critical performance requirements for dealing with fluctuating workloads and providing results in real-time. As a consequence, the majority of these applications demand parallelism for delivering quality of service to users. Although high-level and structured parallel programming aims at facilitating parallelism exploitation, there are still several issues to be addressed for increasing/improving existing parallel programming abstractions. In this paper, we aim at employing self-adaptivity for stream processing in order to seamlessly manage the application parallelism configurations at run-time, where a new strategy alleviates from application programmers the need to set time-consuming and error-prone parallelism parameters. The new strategy was implemented and validated on SPar. The results have shown that the proposed solution increases the level of abstraction and achieved a competitive performance. |
Rockenbach, Dinei A; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo High-Level Stream Parallelism Abstractions with SPar Targeting GPUs Inproceedings doi Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing (ParCo), pp. 543-552, IOS Press, Prague, Czech Republic, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{ROCKENBACH:PARCO:19, title = {High-Level Stream Parallelism Abstractions with SPar Targeting GPUs}, author = {Dinei A Rockenbach and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.3233/APC200083}, doi = {10.3233/APC200083}, year = {2019}, date = {2019-09-01}, booktitle = {Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing (ParCo)}, volume = {36}, pages = {543-552}, publisher = {IOS Press}, address = {Prague, Czech Republic}, series = {ParCo'19}, abstract = {The combined exploitation of stream and data parallelism is demonstrating encouraging performance results in the literature for heterogeneous architectures, which are present on every computer systems today. However, provide parallel software efficiently targeting those architectures requires significant programming effort and expertise. The SPar domain-specific language already represents a solution to this problem providing proven high-level programming abstractions for multi-core architectures. In this paper, we enrich the SPar language adding support for GPUs. New transformation rules are designed for generating parallel code using stream and data parallel patterns. Our experiments revealed that these transformations rules are able to improve performance while the high-level programming abstractions are maintained.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } The combined exploitation of stream and data parallelism is demonstrating encouraging performance results in the literature for heterogeneous architectures, which are present on every computer systems today. However, provide parallel software efficiently targeting those architectures requires significant programming effort and expertise. The SPar domain-specific language already represents a solution to this problem providing proven high-level programming abstractions for multi-core architectures. In this paper, we enrich the SPar language adding support for GPUs. New transformation rules are designed for generating parallel code using stream and data parallel patterns. Our experiments revealed that these transformations rules are able to improve performance while the high-level programming abstractions are maintained. |
Vogel, Adriano; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Minimizing Self-Adaptation Overhead in Parallel Stream Processing for Multi-Cores Inproceedings doi Euro-Par 2019: Parallel Processing Workshops, pp. 12, Springer, Göttingen, Germany, 2019. Abstract | Links | BibTeX | Tags: Self-adaptation, Stream processing @inproceedings{VOGEL:adaptive-overhead:AutoDaSP:19, title = {Minimizing Self-Adaptation Overhead in Parallel Stream Processing for Multi-Cores}, author = {Adriano Vogel and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/978-3-030-48340-1_3}, doi = {10.1007/978-3-030-48340-1_3}, year = {2019}, date = {2019-08-01}, booktitle = {Euro-Par 2019: Parallel Processing Workshops}, volume = {11997}, pages = {12}, publisher = {Springer}, address = {Göttingen, Germany}, series = {Lecture Notes in Computer Science}, abstract = {Stream processing paradigm is present in several applications that apply computations over continuous data flowing in the form of streams (e.g., video feeds, image, and data analytics). Employing self-adaptivity to stream processing applications can provide higher-level programming abstractions and autonomic resource management. However, there are cases where the performance is suboptimal. In this paper, the goal is to optimize parallelism adaptations in terms of stability and accuracy, which can improve the performance of parallel stream processing applications. Therefore, we present a new optimized self-adaptive strategy that is experimentally evaluated. The proposed solution provided high-level programming abstractions, reduced the adaptation overhead, and achieved a competitive performance with the best static executions.}, keywords = {Self-adaptation, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Stream processing paradigm is present in several applications that apply computations over continuous data flowing in the form of streams (e.g., video feeds, image, and data analytics). Employing self-adaptivity to stream processing applications can provide higher-level programming abstractions and autonomic resource management. However, there are cases where the performance is suboptimal. In this paper, the goal is to optimize parallelism adaptations in terms of stability and accuracy, which can improve the performance of parallel stream processing applications. Therefore, we present a new optimized self-adaptive strategy that is experimentally evaluated. The proposed solution provided high-level programming abstractions, reduced the adaptation overhead, and achieved a competitive performance with the best static executions. |
Griebler, Dalvan; Vogel, Adriano; De Sensi, Daniele ; Danelutto, Marco; Fernandes, Luiz Gustavo Simplifying and implementing service level objectives for stream parallelism Journal Article doi Journal of Supercomputing, 76 , pp. 4603-4628, 2019, ISSN: 0920-8542. Abstract | Links | BibTeX | Tags: Self-adaptation, Stream processing @article{GRIEBLER:JS:19, title = {Simplifying and implementing service level objectives for stream parallelism}, author = {Dalvan Griebler and Adriano Vogel and Daniele {De Sensi} and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-019-02914-6}, doi = {10.1007/s11227-019-02914-6}, issn = {0920-8542}, year = {2019}, date = {2019-06-01}, journal = {Journal of Supercomputing}, volume = {76}, pages = {4603-4628}, publisher = {Springer}, abstract = {An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application’s source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications.}, keywords = {Self-adaptation, Stream processing}, pubstate = {published}, tppubtype = {article} } An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application’s source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications. |
Rockenbach, Dinei A; Stein, Charles Michael; Griebler, Dalvan; Mencagli, Gabriele; Torquati, Massimo; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges Inproceedings doi International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 834-841, IEEE, Rio de Janeiro, Brazil, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{ROCKENBACH:stream-multigpus:IPDPSW:19, title = {Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges}, author = {Dinei A Rockenbach and Charles Michael Stein and Dalvan Griebler and Gabriele Mencagli and Massimo Torquati and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/IPDPSW.2019.00137}, doi = {10.1109/IPDPSW.2019.00137}, year = {2019}, date = {2019-05-01}, booktitle = {International Parallel and Distributed Processing Symposium Workshops (IPDPSW)}, pages = {834-841}, publisher = {IEEE}, address = {Rio de Janeiro, Brazil}, series = {IPDPSW'19}, abstract = {The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models. |
Stein, Charles M; Stein, Joao V; Boz, Leonardo; Rockenbach, Dinei A; Griebler, Dalvan Mandelbrot Streaming para Sistemas Multi-core com GPUs Inproceedings 19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS), Sociedade Brasileira de Computação, Três de Maio, RS, Brazil, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{larcc:mandelbrot_multicore_GPU:ERAD:19, title = {Mandelbrot Streaming para Sistemas Multi-core com GPUs}, author = {Charles M Stein and Joao V Stein and Leonardo Boz and Dinei A Rockenbach and Dalvan Griebler}, url = {http://larcc.setrem.com.br/wp-content/uploads/2019/04/192109.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS)}, publisher = {Sociedade Brasileira de Computação}, address = {Três de Maio, RS, Brazil}, abstract = {Este trabalho visa explorar o paralelismo na aplicação MandelbrotStreamingpara arquiteturas multi-core com GPUs, usando as bibliotecas Fast-Flow, TBB e SPar com CUDA. A implementação do paralelismo foi baseada nopadrão farm, alcançando speedup de 16x no sistema multi-core e de 77x em umambiente multi-core com duas GPUs. Os resultados evidenciam um melhor de-sempenho no uso de GPUs embora tenham sido identificadas futuras melhorias.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho visa explorar o paralelismo na aplicação MandelbrotStreamingpara arquiteturas multi-core com GPUs, usando as bibliotecas Fast-Flow, TBB e SPar com CUDA. A implementação do paralelismo foi baseada nopadrão farm, alcançando speedup de 16x no sistema multi-core e de 77x em umambiente multi-core com duas GPUs. Os resultados evidenciam um melhor de-sempenho no uso de GPUs embora tenham sido identificadas futuras melhorias. |
Stein, Charles Michael; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 247-251, IEEE, Pavia, Italy, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{STEIN:LZSS-multigpu:PDP:19, title = {Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs}, author = {Charles Michael Stein and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/EMPDP.2019.8671624}, doi = {10.1109/EMPDP.2019.8671624}, year = {2019}, date = {2019-02-01}, booktitle = {27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {247-251}, publisher = {IEEE}, address = {Pavia, Italy}, series = {PDP'19}, abstract = {GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques. |
2018 |
Vogel, Adriano; Griebler, Dalvan; De Sensi, Daniele ; Danelutto, Marco; Fernandes, Luiz Gustavo Autonomic and Latency-Aware Degree of Parallelism Management in SPar Inproceedings doi Euro-Par 2018: Parallel Processing Workshops, pp. 28-39, Springer, Turin, Italy, 2018. Abstract | Links | BibTeX | Tags: Self-adaptation, Stream processing @inproceedings{VOGEL:Adaptive-Latency-SPar:AutoDaSP:18, title = {Autonomic and Latency-Aware Degree of Parallelism Management in SPar}, author = {Adriano Vogel and Dalvan Griebler and Daniele {De Sensi} and Marco Danelutto and Luiz Gustavo Fernandes}, url = {http://dx.doi.org/10.1007/978-3-030-10549-5_3}, doi = {10.1007/978-3-030-10549-5_3}, year = {2018}, date = {2018-08-01}, booktitle = {Euro-Par 2018: Parallel Processing Workshops}, pages = {28-39}, publisher = {Springer}, address = {Turin, Italy}, series = {Lecture Notes in Computer Science}, abstract = {Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off to the application programmers by using the C++11 attributes’ annotation mechanism. In SPar and other programming frameworks for stream processing applications, the manual definition of the number of replicas to be used for the stream operators is a challenge. In addition to that, low latency is required by several stream processing applications. We noted that explicit latency requirements are poorly considered on the state-of-the-art parallel programming frameworks. Since there is a direct relationship between the number of replicas and the latency of the application, in this work we propose an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints. We experimentally evaluated our implemented strategy and demonstrated its effectiveness on a real-world application, demonstrating that our adaptive strategy can provide higher abstraction levels while automatically managing the latency.}, keywords = {Self-adaptation, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off to the application programmers by using the C++11 attributes’ annotation mechanism. In SPar and other programming frameworks for stream processing applications, the manual definition of the number of replicas to be used for the stream operators is a challenge. In addition to that, low latency is required by several stream processing applications. We noted that explicit latency requirements are poorly considered on the state-of-the-art parallel programming frameworks. Since there is a direct relationship between the number of replicas and the latency of the application, in this work we propose an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints. We experimentally evaluated our implemented strategy and demonstrated its effectiveness on a real-world application, demonstrating that our adaptive strategy can provide higher abstraction levels while automatically managing the latency. |
Stein, Charles Programação Paralela para GPU em Aplicações de Processamento Stream Undergraduate Thesis Undergraduate Thesis, 2018. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @misc{larcc:charles_stein:TCC:18, title = {Programação Paralela para GPU em Aplicações de Processamento Stream}, author = {Charles Stein}, url = {http://larcc.setrem.com.br/wp-content/uploads/2018/11/TCC_SETREM__Charles_Stein_1.pdf}, year = {2018}, date = {2018-06-01}, address = {Três de Maio, RS, Brazil}, school = {Sociedade Educacional Três de Maio (SETREM)}, abstract = {Stream processing applications are used in many areas. They usually require real-time processing and have a high computational load. The parallelization of this type of application is necessary. The use of GPUs can hypothetically increase the performance of this stream processing applications. This work presents the study and parallel software implementation for GPU on stream processing applications. Applications of different areas were chosen and parallelized for CPU and GPU. A set of experiments were conducted and the results achieved were analyzed. Therefore, the Sobel, LZSS, Dedup, and Black-Scholes applications were parallelized. The Sobel filter did not gain performance, while the LZSS, Dudup and Black-Scholes obtained a Speedup of 36x, 13x and 6.9x respectively. In addition to performance, the source lines of code from the implementations with CUDA and OpenCL libraries were measured in order to analyze the code intrusion. The tests performed showed that in some applications the use of GPU is advantageous, while in other applications there are no significant gains when compared to the parallel versions in CPU.}, howpublished = {Undergraduate Thesis}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {misc} } Stream processing applications are used in many areas. They usually require real-time processing and have a high computational load. The parallelization of this type of application is necessary. The use of GPUs can hypothetically increase the performance of this stream processing applications. This work presents the study and parallel software implementation for GPU on stream processing applications. Applications of different areas were chosen and parallelized for CPU and GPU. A set of experiments were conducted and the results achieved were analyzed. Therefore, the Sobel, LZSS, Dedup, and Black-Scholes applications were parallelized. The Sobel filter did not gain performance, while the LZSS, Dudup and Black-Scholes obtained a Speedup of 36x, 13x and 6.9x respectively. In addition to performance, the source lines of code from the implementations with CUDA and OpenCL libraries were measured in order to analyze the code intrusion. The tests performed showed that in some applications the use of GPU is advantageous, while in other applications there are no significant gains when compared to the parallel versions in CPU. |
Stein, Charles M; Griebler, Dalvan Explorando o Paralelismo de Stream em CPU e de Dados em GPU na Aplicação de Filtro Sobel Inproceedings 18th Escola Regional de Alto Desempenho do Estado do Rio Grande do Sul (ERAD/RS), pp. 137-140, Sociedade Brasileira de Computação, Porto Alegre, RS, Brazil, 2018. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{larcc:stream_gpu_cuda:ERAD:18, title = {Explorando o Paralelismo de Stream em CPU e de Dados em GPU na Aplicação de Filtro Sobel}, author = {Charles M Stein and Dalvan Griebler}, url = {http://larcc.setrem.com.br/wp-content/uploads/2018/04/LARCC_ERAD_IC_Stein_2018.pdf}, year = {2018}, date = {2018-04-01}, booktitle = {18th Escola Regional de Alto Desempenho do Estado do Rio Grande do Sul (ERAD/RS)}, pages = {137-140}, publisher = {Sociedade Brasileira de Computação}, address = {Porto Alegre, RS, Brazil}, abstract = {O objetivo deste estudo é a paralelização combinada do stream em CPU e dos dados em GPU usando uma aplicação de filtro sobel. Foi realizada uma avaliação do desempenho de OpenCL, OpenACC e CUDA com o algorí-timo de multiplicação de matrizes para escolha da ferramenta a ser usada com a SPar. Concluiu-se que apesar da GPU apresentar um speedup de 11.81x com CUDA, o uso exclusivo da CPU com a SPar é mais vantajoso nesta aplicação.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } O objetivo deste estudo é a paralelização combinada do stream em CPU e dos dados em GPU usando uma aplicação de filtro sobel. Foi realizada uma avaliação do desempenho de OpenCL, OpenACC e CUDA com o algorí-timo de multiplicação de matrizes para escolha da ferramenta a ser usada com a SPar. Concluiu-se que apesar da GPU apresentar um speedup de 11.81x com CUDA, o uso exclusivo da CPU com a SPar é mais vantajoso nesta aplicação. |