Analyzes three implementations of the Sobel edge detection filter — OpenMP on CPU, CUDA on GPU, and OpenMP GPU offload — running on Perlmutter’s A100 GPU nodes at NERSC. The CPU OpenMP implementation scales near-linearly up to 8 threads (7.5x speedup). On GPU, CUDA with 4096 blocks and 256 threads achieves 91.4% occupancy and 0.49ms runtime, outperforming the OpenMP offload approach (0.62ms, 66.8% occupancy). A key finding: increasing block count improves GPU utilization faster than increasing thread count per block.