Why are MPI + OpenMP codes sometimes slower than MPI alone?
There are four common reasons why this might happen: • There is a portion of the code (in terms of runtime) that is not OpenMP parallelized or that contains a serializing construct such as a critical section or atomic operation; • The loops that are being parallelized with OpenMP are too small to offset the overhead required to create threads. • The OpenMP domain is spanning more than one memory domain and is seeing NUMA effects. On Hopper this would correspond to using more than six threads. • The are data consistency effects that lead to extraneous data movement (false sharing of cache lines).