Is there a typical parallelism scenario for hybrid MPI / OpenMP codes?
A typical scenario is to use MPI for domain decomposition with four or eight MPI processes per node and then use the remaining cores for OpenMP threads of parallelism within each domain. Frequently, this additional parallelism is at the loop level but the more computation per thread, the better. The threads belonging to each MPI process carry out their computation until some synchronization point or until they’ve completed. It is important to remember not to use more than six OpenMP threads per NUMA node (24 per node).