1 min read
Runtime Efficiency Gains in Matrix Multiplication with Blocking and Copy Optimizations

Benchmarks three DGEMM implementations on Perlmutter (NERSC): a naive three-loop algorithm, a blocked matrix multiply with copy optimization (BMMCO), and the CBLAS reference library. The naive approach drops to 123 MFLOP/s at N=2048 due to poor cache reuse, while BMMCO with block sizes of 16 or 32 reaches 4300–4600 MFLOP/s. CBLAS peaks at over 50,000 MFLOP/s, illustrating the gap that vectorization and deeper hardware optimizations can close.