Papers | Benjamin Klein

Papers

2026

Limitations of Simple Additive and Multiplicative Scoring Methods

A critical review of Simple Additive Scoring (SAS) and Simple Multiplicative Scoring (SMS) in education, healthcare, GIS, and human development, showing where simple aggregation creates dangerous logic failures.

2025

Analyzing Performance in Stencil Operations using the Sobel Filter

A GPU vs. CPU performance study of the Sobel edge detection algorithm using OpenMP, CUDA, and OpenMP GPU offload on Perlmutter's A100 nodes, with CUDA achieving 91.4% occupancy at optimal configuration.
Comparing LIKWID Hardware Metrics to Performance across DGEMM Implementations

Using the LIKWID performance monitoring tool to correlate L2/L3 cache accesses and retired instruction counts with DGEMM runtime, revealing that naive implementations can incur up to 600x more L3 accesses than CBLAS.
Comparison of Serial and OpenMP Parallel Performance in Vector Matrix Multiplication

Benchmarking basic, vectorized, OpenMP-parallel, and CBLAS vector matrix multiplication on Perlmutter, finding compiler vectorization delivers a 4-5x speedup and 4 threads is the optimal OpenMP thread count.
Runtime Efficiency Gains in Matrix Multiplication with Blocking and Copy Optimizations

Evaluating naive, blocked-with-copy (BMMCO), and CBLAS DGEMM implementations on Perlmutter CPU nodes, showing how cache-aware blocking improves performance from 123 MFLOP/s to over 4600 MFLOP/s.
Summation Method Performance Analysis

A performance study comparing direct, vector, and indirect summation on the Perlmutter supercomputer, showing serial array access is roughly 100x faster than random access.