并行块中的非并行for循环
Non-parallel for loop in a parallel block
我有一个并行块,它生成一定数量的线程。然后,所有这些线程都应启动一个 "shared" for 循环,其中包含多个并行 for 循环。例如这样的事情:
// 1. The parallel region spawns a number of threads.
#pragma omp parallel
{
// 2. Each thread does something before it enters the loop below.
doSomethingOnEachThreadAsPreparation();
// 3. This loop should run by all threads synchronously; i belongs
// to all threads simultaneously
// Basically there is only one variable i. When all threads reach this
// loop i at first is set to zero.
for (int i = 0; i < 100; i++)
{
// 4. Then each thread calls this function (this happens in parallel)
doSomethingOnEachThreadAtTheStartOfEachIteration();
// 5. Then all threads work on this for loop in parallel
#pragma omp for
for (int k = 0; i < 100000000; k++)
doSomethingVeryTimeConsumingInParallel(k);
// 6. After the parallel for loop there is (always) an implicit barrier
// 7. When all threads finished the for loop they call this method in parallel.
doSomethingOnEachThreadAfterEachIteration();
// 8. Here should be another barrier. Once every thread has finished
// the call above, they jump back to the top of the for loop,
// where i is set to i + 1. If the condition for the loop
// holds, continue at 4., otherwise go to 9.
}
// 9. When the "non-parallel" loop has finished each thread continues.
doSomethingMoreOnEachThread();
}
我认为可能已经可以使用
#pragma omp single
和一个共享的 i
变量,但我不再确定了。
函数实际做什么是无关紧要的;这是关于控制流程。我添加了关于我 想要 的评论。
如果我理解正确,3.
处的循环通常会为每个线程创建一个 i
变量,并且循环头通常不会仅由单个线程执行。但这就是我想要的。
您可以在所有线程中运行 for
循环。根据您的算法,可能需要在每次迭代之后(如下所示)或所有迭代结束时进行同步。
#pragma omp parallel
{
// enter parallel region
doSomethingOnEachThreadAsPreparation();
//done in // by all threads
for (int i = 0; i < 100; i++)
{
doSomethingOnEachThreadAtTheStartOfEachIteration();
# pragma omp for
// parallelize the for loop
for (int k = 0; i < 100000000; k++)
doSomethingVeryTimeConsumingInParallel(k);
// implicit barrier
doSomethingOnEachThreadAfterEachIteration();
# pragma omp barrier
// Maybe a barrier is required,
// so that all iterations are synchronous
// but if it is not required by the algorithm
// performances will be better without the barrier
}
doSomethingMoreOnEachThread();
// still in parallel
}
正如 Zulan 所指出的,用 omp single
到 re-enter 包围主 for
循环之后,并行部分不起作用,除非您使用嵌套并行性。在那种情况下,线程将在每次迭代时重新创建,这将导致严重的减速。
omp_set_nested(1);
#pragma omp parallel
{
// enter parallel region
doSomethingOnEachThreadAsPreparation();
//done in // by all threads
# pragma omp single
// only one thread runs the loop
for (int i = 0; i < 100; i++)
{
# pragma omp parallel
{
// create a new nested parallel section
// new threads are created and this will
// certainly degrade performances
doSomethingOnEachThreadAtTheStartOfEachIteration();
# pragma omp for
// and we parallelize the for loop
for (int k = 0; i < 100000000; k++)
doSomethingVeryTimeConsumingInParallel(k);
// implicit barrier
doSomethingOnEachThreadAfterEachIteration();
}
// we leave the parallel section (implicit barrier)
}
// we leave the single section
doSomethingMoreOnEachThread();
// and we continue running in parallel
}
我有一个并行块,它生成一定数量的线程。然后,所有这些线程都应启动一个 "shared" for 循环,其中包含多个并行 for 循环。例如这样的事情:
// 1. The parallel region spawns a number of threads.
#pragma omp parallel
{
// 2. Each thread does something before it enters the loop below.
doSomethingOnEachThreadAsPreparation();
// 3. This loop should run by all threads synchronously; i belongs
// to all threads simultaneously
// Basically there is only one variable i. When all threads reach this
// loop i at first is set to zero.
for (int i = 0; i < 100; i++)
{
// 4. Then each thread calls this function (this happens in parallel)
doSomethingOnEachThreadAtTheStartOfEachIteration();
// 5. Then all threads work on this for loop in parallel
#pragma omp for
for (int k = 0; i < 100000000; k++)
doSomethingVeryTimeConsumingInParallel(k);
// 6. After the parallel for loop there is (always) an implicit barrier
// 7. When all threads finished the for loop they call this method in parallel.
doSomethingOnEachThreadAfterEachIteration();
// 8. Here should be another barrier. Once every thread has finished
// the call above, they jump back to the top of the for loop,
// where i is set to i + 1. If the condition for the loop
// holds, continue at 4., otherwise go to 9.
}
// 9. When the "non-parallel" loop has finished each thread continues.
doSomethingMoreOnEachThread();
}
我认为可能已经可以使用
#pragma omp single
和一个共享的 i
变量,但我不再确定了。
函数实际做什么是无关紧要的;这是关于控制流程。我添加了关于我 想要 的评论。
如果我理解正确,3.
处的循环通常会为每个线程创建一个 i
变量,并且循环头通常不会仅由单个线程执行。但这就是我想要的。
您可以在所有线程中运行 for
循环。根据您的算法,可能需要在每次迭代之后(如下所示)或所有迭代结束时进行同步。
#pragma omp parallel
{
// enter parallel region
doSomethingOnEachThreadAsPreparation();
//done in // by all threads
for (int i = 0; i < 100; i++)
{
doSomethingOnEachThreadAtTheStartOfEachIteration();
# pragma omp for
// parallelize the for loop
for (int k = 0; i < 100000000; k++)
doSomethingVeryTimeConsumingInParallel(k);
// implicit barrier
doSomethingOnEachThreadAfterEachIteration();
# pragma omp barrier
// Maybe a barrier is required,
// so that all iterations are synchronous
// but if it is not required by the algorithm
// performances will be better without the barrier
}
doSomethingMoreOnEachThread();
// still in parallel
}
正如 Zulan 所指出的,用 omp single
到 re-enter 包围主 for
循环之后,并行部分不起作用,除非您使用嵌套并行性。在那种情况下,线程将在每次迭代时重新创建,这将导致严重的减速。
omp_set_nested(1);
#pragma omp parallel
{
// enter parallel region
doSomethingOnEachThreadAsPreparation();
//done in // by all threads
# pragma omp single
// only one thread runs the loop
for (int i = 0; i < 100; i++)
{
# pragma omp parallel
{
// create a new nested parallel section
// new threads are created and this will
// certainly degrade performances
doSomethingOnEachThreadAtTheStartOfEachIteration();
# pragma omp for
// and we parallelize the for loop
for (int k = 0; i < 100000000; k++)
doSomethingVeryTimeConsumingInParallel(k);
// implicit barrier
doSomethingOnEachThreadAfterEachIteration();
}
// we leave the parallel section (implicit barrier)
}
// we leave the single section
doSomethingMoreOnEachThread();
// and we continue running in parallel
}