Can the CPU and GPU run in parallel?
Kernel invocation in CUDA is asynchronous, so the driver will return control to the application as soon as it has launched the kernel. The “cudaThreadSynchronize()” API call should be used when measuring performance to ensure that all device operations have completed before stopping the timer. CUDA functions that perform memory copies and that control graphics interoperability are synchronous, and implicitly wait for all kernels to complete.