Se você prefere baixar um arquivo único com todas as referências do LARCC, você pode encontrá-lo neste link. Você também pode acompanhar novas publicações via RSS.
Adicionalmente, você também pode encontrar as publicações no perfil do LARCC no Google Scholar .
2024 |
Mencagli, Gabriele; Torquati, Massimo; Griebler, Dalvan; Fais, Alessandra; Danelutto, Marco General-purpose data stream processing on heterogeneous architectures with WindFlow Journal Article doi Journal of Parallel and Distributed Computing, 184 , pp. 104782, 2024. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @article{MENCAGLI:JPDC:24, title = {General-purpose data stream processing on heterogeneous architectures with WindFlow}, author = {Gabriele Mencagli and Massimo Torquati and Dalvan Griebler and Alessandra Fais and Marco Danelutto}, url = {https://www.sciencedirect.com/science/article/pii/S0743731523001521}, doi = {https://doi.org/10.1016/j.jpdc.2023.104782}, year = {2024}, date = {2024-02-01}, journal = {Journal of Parallel and Distributed Computing}, volume = {184}, pages = {104782}, publisher = {Elsevier}, abstract = {Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {article} } Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink. |
2023 |
Leonarczyk, Ricardo; Griebler, Dalvan; Mencagli, Gabriele; Danelutto, Marco Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing Inproceedings Euro-ParW 2023: Parallel Processing Workshops, pp. 1-8, Springer, Limassol, 2023. Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{LEONARCZYK:Euro-ParW:23, title = {Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing}, author = {Ricardo Leonarczyk and Dalvan Griebler and Gabriele Mencagli and Marco Danelutto}, url = {https://doi.org/}, year = {2023}, date = {2023-08-01}, booktitle = {Euro-ParW 2023: Parallel Processing Workshops}, pages = {1-8}, publisher = {Springer}, address = {Limassol}, series = {Euro-ParW'23}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } |
Leonarczyk, Ricardo; Griebler, Dalvan Avaliação da Auto-Adaptação de Micro-Lote para aplicação de Processamento de Streaming em GPUs Inproceedings doi Anais da XXIII Escola Regional de Alto Desempenho da Região Sul, pp. 123-124, Sociedade Brasileira de Computação, Porto Alegre, Brazil, 2023. Abstract | Links | BibTeX | Tags: GPGPU, Self-adaptation, Stream processing @inproceedings{LEONARCZYK:ERAD:23, title = {Avaliação da Auto-Adaptação de Micro-Lote para aplicação de Processamento de Streaming em GPUs}, author = {Ricardo Leonarczyk and Dalvan Griebler}, url = {https://doi.org/10.5753/eradrs.2023.229267}, doi = {10.5753/eradrs.2023.229267}, year = {2023}, date = {2023-05-01}, booktitle = {Anais da XXIII Escola Regional de Alto Desempenho da Região Sul}, pages = {123-124}, publisher = {Sociedade Brasileira de Computação}, address = {Porto Alegre, Brazil}, abstract = {Este artigo apresenta uma avaliação de algoritmos para regular a latência através da auto-adaptação de micro-lote em sistemas de processamento de streaming acelerados por GPU. Os resultados demonstraram que o algoritmo com o fator de adaptação fixo conseguiu ficar por mais tempo na região de latência especificada para a aplicação.}, keywords = {GPGPU, Self-adaptation, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Este artigo apresenta uma avaliação de algoritmos para regular a latência através da auto-adaptação de micro-lote em sistemas de processamento de streaming acelerados por GPU. Os resultados demonstraram que o algoritmo com o fator de adaptação fixo conseguiu ficar por mais tempo na região de latência especificada para a aplicação. |
Araujo, Gabriell; Griebler, Dalvan; Rockenbach, Dinei A; Danelutto, Marco; Fernandes, Luiz Gustavo NAS Parallel Benchmarks with CUDA and Beyond Journal Article doi Software: Practice and Experience, 53 (1), pp. 53-80, 2023. Abstract | Links | BibTeX | Tags: Benchmark, GPGPU, Parallel programming @article{ARAUJO:SPE:23, title = {NAS Parallel Benchmarks with CUDA and Beyond}, author = {Gabriell Araujo and Dalvan Griebler and Dinei A Rockenbach and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1002/spe.3056}, doi = {10.1002/spe.3056}, year = {2023}, date = {2023-01-01}, journal = {Software: Practice and Experience}, volume = {53}, number = {1}, pages = {53-80}, publisher = {Wiley}, abstract = {NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.}, keywords = {Benchmark, GPGPU, Parallel programming}, pubstate = {published}, tppubtype = {article} } NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior. |
2022 |
Rockenbach, Dinei A; Löff, Júnior; Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz G High-Level Stream and Data Parallelism in C++ for GPUs Inproceedings doi XXVI Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Uberlândia, Brazil, 2022. Abstract | Links | BibTeX | Tags: GPGPU, Parallel programming, Stream processing @inproceedings{ROCKENBACH:SBLP:22, title = {High-Level Stream and Data Parallelism in C++ for GPUs}, author = {Dinei A Rockenbach and Júnior Löff and Gabriell Araujo and Dalvan Griebler and Luiz G Fernandes}, url = {https://doi.org/10.1145/3561320.3561327}, doi = {10.1145/3561320.3561327}, year = {2022}, date = {2022-10-01}, booktitle = {XXVI Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-49}, publisher = {ACM}, address = {Uberlândia, Brazil}, series = {SBLP'22}, abstract = {GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching.}, keywords = {GPGPU, Parallel programming, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching. |
Mencagli, Gabriele; Griebler, Dalvan; Danelutto, Marco Towards Parallel Data Stream Processing on System-on-Chip CPU+GPU Devices Inproceedings doi 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 34-38, IEEE, Valladolid, Spain, 2022. Abstract | Links | BibTeX | Tags: GPGPU, IoT, Stream processing @inproceedings{MENCAGLI:PDP:22, title = {Towards Parallel Data Stream Processing on System-on-Chip CPU+GPU Devices}, author = {Gabriele Mencagli and Dalvan Griebler and Marco Danelutto}, url = {https://doi.org/10.1109/PDP55904.2022.00014}, doi = {10.1109/PDP55904.2022.00014}, year = {2022}, date = {2022-04-01}, booktitle = {30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {34-38}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'22}, abstract = {Data Stream Processing is a pervasive computing paradigm with a wide spectrum of applications. Traditional streaming systems exploit the processing capabilities provided by homogeneous Clusters and Clouds. Due to the transition to streaming systems suitable for IoT/Edge environments, there has been the urgent need of new streaming frameworks and tools tailored for embedded platforms, often available as System-onChips composed of a small multicore CPU and an integrated onchip GPU. Exploiting this hybrid hardware requires special care in the runtime system design. In this paper, we discuss the support provided by the WindFlow library, showing its design principles and its effectiveness on the NVIDIA Jetson Nano board.}, keywords = {GPGPU, IoT, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Data Stream Processing is a pervasive computing paradigm with a wide spectrum of applications. Traditional streaming systems exploit the processing capabilities provided by homogeneous Clusters and Clouds. Due to the transition to streaming systems suitable for IoT/Edge environments, there has been the urgent need of new streaming frameworks and tools tailored for embedded platforms, often available as System-onChips composed of a small multicore CPU and an integrated onchip GPU. Exploiting this hybrid hardware requires special care in the runtime system design. In this paper, we discuss the support provided by the WindFlow library, showing its design principles and its effectiveness on the NVIDIA Jetson Nano board. |
Scheer, Claudio; Araujo, Gabriell; Griebler, Dalvan; Meneguzzi, Felipe; Fernandes, Luiz Gustavo Encontrando a Configuração de Threads por Bloco para os Kernels NPB-CUDA com Q-Learning Inproceedings doi Anais da XXII Escola Regional de Alto Desempenho da Região Sul, pp. 119-120, Sociedade Brasileira de Computação, Curitiba, Brazil, 2022. Abstract | Links | BibTeX | Tags: Benchmark, Deep learning, GPGPU @inproceedings{SCHEER:ERAD:22, title = {Encontrando a Configuração de Threads por Bloco para os Kernels NPB-CUDA com Q-Learning}, author = {Claudio Scheer and Gabriell Araujo and Dalvan Griebler and Felipe Meneguzzi and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/eradrs.2022.19191}, doi = {10.5753/eradrs.2022.19191}, year = {2022}, date = {2022-04-01}, booktitle = {Anais da XXII Escola Regional de Alto Desempenho da Região Sul}, pages = {119-120}, publisher = {Sociedade Brasileira de Computação}, address = {Curitiba, Brazil}, abstract = {Este trabalho apresenta um novo método que utiliza aprendizado de máquina para prever a melhor configuração de threads por bloco para aplicações de GPUs. Os resultados foram similares a estratégias manuais.}, keywords = {Benchmark, Deep learning, GPGPU}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho apresenta um novo método que utiliza aprendizado de máquina para prever a melhor configuração de threads por bloco para aplicações de GPUs. Os resultados foram similares a estratégias manuais. |
2020 |
Stein, Charles M; Rockenbach, Dinei A; Griebler, Dalvan; Torquati, Massimo; Mencagli, Gabriele; Danelutto, Marco; Fernandes, Luiz Gustavo Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units Journal Article doi Concurrency and Computation: Practice and Experience, na (na), pp. e5786, 2020. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @article{STEIN:CCPE:20, title = {Latency‐aware adaptive micro‐batching techniques for streamed data compression on graphics processing units}, author = {Charles M Stein and Dinei A Rockenbach and Dalvan Griebler and Massimo Torquati and Gabriele Mencagli and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1002/cpe.5786}, doi = {10.1002/cpe.5786}, year = {2020}, date = {2020-05-01}, journal = {Concurrency and Computation: Practice and Experience}, volume = {na}, number = {na}, pages = {e5786}, publisher = {Wiley Online Library}, abstract = {Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {article} } Stream processing is a parallel paradigm used in many application domains. With the advance of graphics processing units (GPUs), their usage in stream processing applications has increased as well. The efficient utilization of GPU accelerators in streaming scenarios requires to batch input elements in microbatches, whose computation is offloaded on the GPU leveraging data parallelism within the same batch of data. Since data elements are continuously received based on the input speed, the bigger the microbatch size the higher the latency to completely buffer it and to start the processing on the device. Unfortunately, stream processing applications often have strict latency requirements that need to find the best size of the microbatches and to adapt it dynamically based on the workload conditions as well as according to the characteristics of the underlying device and network. In this work, we aim at implementing latency‐aware adaptive microbatching techniques and algorithms for streaming compression applications targeting GPUs. The evaluation is conducted using the Lempel‐Ziv‐Storer‐Szymanski compression application considering different input workloads. As a general result of our work, we noticed that algorithms with elastic adaptation factors respond better for stable workloads, while algorithms with narrower targets respond better for highly unbalanced workloads. |
de Araujo, Gabriell ; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Efficient NAS Parallel Benchmark Kernels with CUDA Inproceedings doi 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 9-16, IEEE, Västerås, Sweden, Sweden, 2020. Abstract | Links | BibTeX | Tags: Benchmark, GPGPU @inproceedings{ARAUJO:PDP:20, title = {Efficient NAS Parallel Benchmark Kernels with CUDA}, author = {Gabriell {de Araujo} and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP50117.2020.00009}, doi = {10.1109/PDP50117.2020.00009}, year = {2020}, date = {2020-03-01}, booktitle = {28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {9-16}, publisher = {IEEE}, address = {Västerås, Sweden, Sweden}, series = {PDP'20}, abstract = {NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate parallel hardware and software. There are many research efforts trying to provide different parallel versions apart from the original OpenMP and MPI. Concerning GPU accelerators, there are only the OpenCL and OpenACC available as consolidated versions. Our goal is to provide an efficient parallel implementation of the five NPB kernels with CUDA. Our contribution covers different aspects. First, best parallel programming practices were followed to implement NPB kernels using CUDA. Second, the support of larger workloads (class B and C) allow to stress and investigate the memory of robust GPUs. Third, we show that it is possible to make NPB efficient and suitable for GPUs although the benchmarks were designed for CPUs in the past. We succeed in achieving double performance with respect to the state-of-the-art in some cases as well as implementing efficient memory usage. Fourth, we discuss new experiments comparing performance and memory usage against OpenACC and OpenCL state-of-the-art versions using a relative new GPU architecture. The experimental results also revealed that our version is the best one for all the NPB kernels compared to OpenACC and OpenCL. The greatest differences were observed for the FT and EP kernels.}, keywords = {Benchmark, GPGPU}, pubstate = {published}, tppubtype = {inproceedings} } NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate parallel hardware and software. There are many research efforts trying to provide different parallel versions apart from the original OpenMP and MPI. Concerning GPU accelerators, there are only the OpenCL and OpenACC available as consolidated versions. Our goal is to provide an efficient parallel implementation of the five NPB kernels with CUDA. Our contribution covers different aspects. First, best parallel programming practices were followed to implement NPB kernels using CUDA. Second, the support of larger workloads (class B and C) allow to stress and investigate the memory of robust GPUs. Third, we show that it is possible to make NPB efficient and suitable for GPUs although the benchmarks were designed for CPUs in the past. We succeed in achieving double performance with respect to the state-of-the-art in some cases as well as implementing efficient memory usage. Fourth, we discuss new experiments comparing performance and memory usage against OpenACC and OpenCL state-of-the-art versions using a relative new GPU architecture. The experimental results also revealed that our version is the best one for all the NPB kernels compared to OpenACC and OpenCL. The greatest differences were observed for the FT and EP kernels. |
2019 |
Rockenbach, Dinei A; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo High-Level Stream Parallelism Abstractions with SPar Targeting GPUs Inproceedings doi Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing (ParCo), pp. 543-552, IOS Press, Prague, Czech Republic, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{ROCKENBACH:PARCO:19, title = {High-Level Stream Parallelism Abstractions with SPar Targeting GPUs}, author = {Dinei A Rockenbach and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.3233/APC200083}, doi = {10.3233/APC200083}, year = {2019}, date = {2019-09-01}, booktitle = {Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing (ParCo)}, volume = {36}, pages = {543-552}, publisher = {IOS Press}, address = {Prague, Czech Republic}, series = {ParCo'19}, abstract = {The combined exploitation of stream and data parallelism is demonstrating encouraging performance results in the literature for heterogeneous architectures, which are present on every computer systems today. However, provide parallel software efficiently targeting those architectures requires significant programming effort and expertise. The SPar domain-specific language already represents a solution to this problem providing proven high-level programming abstractions for multi-core architectures. In this paper, we enrich the SPar language adding support for GPUs. New transformation rules are designed for generating parallel code using stream and data parallel patterns. Our experiments revealed that these transformations rules are able to improve performance while the high-level programming abstractions are maintained.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } The combined exploitation of stream and data parallelism is demonstrating encouraging performance results in the literature for heterogeneous architectures, which are present on every computer systems today. However, provide parallel software efficiently targeting those architectures requires significant programming effort and expertise. The SPar domain-specific language already represents a solution to this problem providing proven high-level programming abstractions for multi-core architectures. In this paper, we enrich the SPar language adding support for GPUs. New transformation rules are designed for generating parallel code using stream and data parallel patterns. Our experiments revealed that these transformations rules are able to improve performance while the high-level programming abstractions are maintained. |
Rockenbach, Dinei A; Stein, Charles Michael; Griebler, Dalvan; Mencagli, Gabriele; Torquati, Massimo; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges Inproceedings doi International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 834-841, IEEE, Rio de Janeiro, Brazil, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{ROCKENBACH:stream-multigpus:IPDPSW:19, title = {Stream Processing on Multi-cores with GPUs: Parallel Programming Models' Challenges}, author = {Dinei A Rockenbach and Charles Michael Stein and Dalvan Griebler and Gabriele Mencagli and Massimo Torquati and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/IPDPSW.2019.00137}, doi = {10.1109/IPDPSW.2019.00137}, year = {2019}, date = {2019-05-01}, booktitle = {International Parallel and Distributed Processing Symposium Workshops (IPDPSW)}, pages = {834-841}, publisher = {IEEE}, address = {Rio de Janeiro, Brazil}, series = {IPDPSW'19}, abstract = {The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } The stream processing paradigm is used in several scientific and enterprise applications in order to continuously compute results out of data items coming from data sources such as sensors. The full exploitation of the potential parallelism offered by current heterogeneous multi-cores equipped with one or more GPUs is still a challenge in the context of stream processing applications. In this work, our main goal is to present the parallel programming challenges that the programmer has to face when exploiting CPUs and GPUs' parallelism at the same time using traditional programming models. We highlight the parallelization methodology in two use-cases (the Mandelbrot Streaming benchmark and the PARSEC's Dedup application) to demonstrate the issues and benefits of using heterogeneous parallel hardware. The experiments conducted demonstrate how a high-level parallel programming model targeting stream processing like the one offered by SPar can be used to reduce the programming effort still offering a good level of performance if compared with state-of-the-art programming models. |
Stein, Charles M; Stein, Joao V; Boz, Leonardo; Rockenbach, Dinei A; Griebler, Dalvan Mandelbrot Streaming para Sistemas Multi-core com GPUs Inproceedings 19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS), Sociedade Brasileira de Computação, Três de Maio, RS, Brazil, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{larcc:mandelbrot_multicore_GPU:ERAD:19, title = {Mandelbrot Streaming para Sistemas Multi-core com GPUs}, author = {Charles M Stein and Joao V Stein and Leonardo Boz and Dinei A Rockenbach and Dalvan Griebler}, url = {http://larcc.setrem.com.br/wp-content/uploads/2019/04/192109.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS)}, publisher = {Sociedade Brasileira de Computação}, address = {Três de Maio, RS, Brazil}, abstract = {Este trabalho visa explorar o paralelismo na aplicação MandelbrotStreamingpara arquiteturas multi-core com GPUs, usando as bibliotecas Fast-Flow, TBB e SPar com CUDA. A implementação do paralelismo foi baseada nopadrão farm, alcançando speedup de 16x no sistema multi-core e de 77x em umambiente multi-core com duas GPUs. Os resultados evidenciam um melhor de-sempenho no uso de GPUs embora tenham sido identificadas futuras melhorias.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } Este trabalho visa explorar o paralelismo na aplicação MandelbrotStreamingpara arquiteturas multi-core com GPUs, usando as bibliotecas Fast-Flow, TBB e SPar com CUDA. A implementação do paralelismo foi baseada nopadrão farm, alcançando speedup de 16x no sistema multi-core e de 77x em umambiente multi-core com duas GPUs. Os resultados evidenciam um melhor de-sempenho no uso de GPUs embora tenham sido identificadas futuras melhorias. |
Stein, Charles M; Rockenbach, Dinei A; Griebler, Dalvan Paralelização do Dedup para Sistemas Multi-core com GPUs Inproceedings 19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS), Sociedade Brasileira de Computação, Três de Maio, RS, Brazil, 2019. Abstract | Links | BibTeX | Tags: GPGPU @inproceedings{larcc:paralelizacao_multicore_GPU:ERAD:19, title = {Paralelização do Dedup para Sistemas Multi-core com GPUs}, author = {Charles M Stein and Dinei A Rockenbach and Dalvan Griebler}, url = {http://larcc.setrem.com.br/wp-content/uploads/2019/04/192087.pdf}, year = {2019}, date = {2019-04-01}, booktitle = {19th Escola Regional de Alto Desempenho da Região Sul (ERAD/RS)}, publisher = {Sociedade Brasileira de Computação}, address = {Três de Maio, RS, Brazil}, abstract = {O maior volume de dados gerado, trafegado e processado aumentaa demanda por mais poder de processamento e por algoritmos de compressãoeficientes. Este trabalho tem como objetivo explorar o paralelismo de streampara arquiteturas multi-core com GPUs na aplicação Dedup, usando SPar comCUDA e OpenCL. Apesar do desempenho não ser o esperado, o artigo contribuicom uma análise detalhada dos resultados e sugestões futuras de melhorias.}, keywords = {GPGPU}, pubstate = {published}, tppubtype = {inproceedings} } O maior volume de dados gerado, trafegado e processado aumentaa demanda por mais poder de processamento e por algoritmos de compressãoeficientes. Este trabalho tem como objetivo explorar o paralelismo de streampara arquiteturas multi-core com GPUs na aplicação Dedup, usando SPar comCUDA e OpenCL. Apesar do desempenho não ser o esperado, o artigo contribuicom uma análise detalhada dos resultados e sugestões futuras de melhorias. |
Stein, Charles Michael; Griebler, Dalvan; Danelutto, Marco; Fernandes, Luiz Gustavo Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs Inproceedings doi 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 247-251, IEEE, Pavia, Italy, 2019. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{STEIN:LZSS-multigpu:PDP:19, title = {Stream Parallelism on the LZSS Data Compression Application for Multi-Cores with GPUs}, author = {Charles Michael Stein and Dalvan Griebler and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/EMPDP.2019.8671624}, doi = {10.1109/EMPDP.2019.8671624}, year = {2019}, date = {2019-02-01}, booktitle = {27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {247-251}, publisher = {IEEE}, address = {Pavia, Italy}, series = {PDP'19}, abstract = {GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } GPUs have been used to accelerate different data parallel applications. The challenge consists in using GPUs to accelerate stream processing applications. Our goal is to investigate and evaluate whether stream parallel applications may benefit from parallel execution on both CPU and GPU cores. In this paper, we introduce new parallel algorithms for the Lempel-Ziv-Storer-Szymanski (LZSS) data compression application. We implemented the algorithms targeting both CPUs and GPUs. GPUs have been used with CUDA and OpenCL to exploit inner algorithm data parallelism. Outer stream parallelism has been exploited using CPU cores through SPar. The parallel implementation of LZSS achieved 135 fold speedup using a multi-core CPU and two GPUs. We also observed speedups in applications where we were not expecting to get it using the same combine data-stream parallel exploitation techniques. |
2018 |
Stein, Charles Programação Paralela para GPU em Aplicações de Processamento Stream Undergraduate Thesis Undergraduate Thesis, 2018. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @misc{larcc:charles_stein:TCC:18, title = {Programação Paralela para GPU em Aplicações de Processamento Stream}, author = {Charles Stein}, url = {http://larcc.setrem.com.br/wp-content/uploads/2018/11/TCC_SETREM__Charles_Stein_1.pdf}, year = {2018}, date = {2018-06-01}, address = {Três de Maio, RS, Brazil}, school = {Sociedade Educacional Três de Maio (SETREM)}, abstract = {Stream processing applications are used in many areas. They usually require real-time processing and have a high computational load. The parallelization of this type of application is necessary. The use of GPUs can hypothetically increase the performance of this stream processing applications. This work presents the study and parallel software implementation for GPU on stream processing applications. Applications of different areas were chosen and parallelized for CPU and GPU. A set of experiments were conducted and the results achieved were analyzed. Therefore, the Sobel, LZSS, Dedup, and Black-Scholes applications were parallelized. The Sobel filter did not gain performance, while the LZSS, Dudup and Black-Scholes obtained a Speedup of 36x, 13x and 6.9x respectively. In addition to performance, the source lines of code from the implementations with CUDA and OpenCL libraries were measured in order to analyze the code intrusion. The tests performed showed that in some applications the use of GPU is advantageous, while in other applications there are no significant gains when compared to the parallel versions in CPU.}, howpublished = {Undergraduate Thesis}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {misc} } Stream processing applications are used in many areas. They usually require real-time processing and have a high computational load. The parallelization of this type of application is necessary. The use of GPUs can hypothetically increase the performance of this stream processing applications. This work presents the study and parallel software implementation for GPU on stream processing applications. Applications of different areas were chosen and parallelized for CPU and GPU. A set of experiments were conducted and the results achieved were analyzed. Therefore, the Sobel, LZSS, Dedup, and Black-Scholes applications were parallelized. The Sobel filter did not gain performance, while the LZSS, Dudup and Black-Scholes obtained a Speedup of 36x, 13x and 6.9x respectively. In addition to performance, the source lines of code from the implementations with CUDA and OpenCL libraries were measured in order to analyze the code intrusion. The tests performed showed that in some applications the use of GPU is advantageous, while in other applications there are no significant gains when compared to the parallel versions in CPU. |
Stein, Charles M; Griebler, Dalvan Explorando o Paralelismo de Stream em CPU e de Dados em GPU na Aplicação de Filtro Sobel Inproceedings 18th Escola Regional de Alto Desempenho do Estado do Rio Grande do Sul (ERAD/RS), pp. 137-140, Sociedade Brasileira de Computação, Porto Alegre, RS, Brazil, 2018. Abstract | Links | BibTeX | Tags: GPGPU, Stream processing @inproceedings{larcc:stream_gpu_cuda:ERAD:18, title = {Explorando o Paralelismo de Stream em CPU e de Dados em GPU na Aplicação de Filtro Sobel}, author = {Charles M Stein and Dalvan Griebler}, url = {http://larcc.setrem.com.br/wp-content/uploads/2018/04/LARCC_ERAD_IC_Stein_2018.pdf}, year = {2018}, date = {2018-04-01}, booktitle = {18th Escola Regional de Alto Desempenho do Estado do Rio Grande do Sul (ERAD/RS)}, pages = {137-140}, publisher = {Sociedade Brasileira de Computação}, address = {Porto Alegre, RS, Brazil}, abstract = {O objetivo deste estudo é a paralelização combinada do stream em CPU e dos dados em GPU usando uma aplicação de filtro sobel. Foi realizada uma avaliação do desempenho de OpenCL, OpenACC e CUDA com o algorí-timo de multiplicação de matrizes para escolha da ferramenta a ser usada com a SPar. Concluiu-se que apesar da GPU apresentar um speedup de 11.81x com CUDA, o uso exclusivo da CPU com a SPar é mais vantajoso nesta aplicação.}, keywords = {GPGPU, Stream processing}, pubstate = {published}, tppubtype = {inproceedings} } O objetivo deste estudo é a paralelização combinada do stream em CPU e dos dados em GPU usando uma aplicação de filtro sobel. Foi realizada uma avaliação do desempenho de OpenCL, OpenACC e CUDA com o algorí-timo de multiplicação de matrizes para escolha da ferramenta a ser usada com a SPar. Concluiu-se que apesar da GPU apresentar um speedup de 11.81x com CUDA, o uso exclusivo da CPU com a SPar é mais vantajoso nesta aplicação. |