Papers
2026
2025
- Analyzing Performance in Stencil Operations using the Sobel FilterA GPU vs. CPU performance study of the Sobel edge detection algorithm using OpenMP, CUDA, and OpenMP GPU offload on Perlmutter's A100 nodes, with CUDA achieving 91.4% occupancy at optimal configuration.
- Comparing LIKWID Hardware Metrics to Performance across DGEMM ImplementationsUsing the LIKWID performance monitoring tool to correlate L2/L3 cache accesses and retired instruction counts with DGEMM runtime, revealing that naive implementations can incur up to 600x more L3 accesses than CBLAS.
- Comparison of Serial and OpenMP Parallel Performance in Vector Matrix MultiplicationBenchmarking basic, vectorized, OpenMP-parallel, and CBLAS vector matrix multiplication on Perlmutter, finding compiler vectorization delivers a 4-5x speedup and 4 threads is the optimal OpenMP thread count.
- Runtime Efficiency Gains in Matrix Multiplication with Blocking and Copy OptimizationsEvaluating naive, blocked-with-copy (BMMCO), and CBLAS DGEMM implementations on Perlmutter CPU nodes, showing how cache-aware blocking improves performance from 123 MFLOP/s to over 4600 MFLOP/s.
- Summation Method Performance AnalysisA performance study comparing direct, vector, and indirect summation on the Perlmutter supercomputer, showing serial array access is roughly 100x faster than random access.