如何使用 OpenMP 正确并行化 for 循环?
How do I properly parallelise a for loop using OpenMP?
我正在为 C++ 测试 OpenMP,因为我的软件将严重依赖处理器并行化的速度。
当 运行 以下代码时,我得到了奇怪的结果。
- 并行化的速度没有我预期的那么快
- 不使用 -O 标志时,代码运行较慢。
我在 i5-8600 CPU 和 16 GB RAM 上使用 g++ 编译器,版本 7.3.0 和 Ubuntu 18.04 OS。
输出:
Output 1 (Not allowed to embed yet since I'm a new member)
转录:
.../OpenMPTest$ g++ -O3 -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 2.87415 seconds.
Parallel action took: 0.99954 seconds.
Output 2
.../OpenMPTest$ g++ -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 25.7037 seconds.
Parallel action took: 68.0485 seconds.
如您所见,对于 6 个处理器,我的速度仅提高了约 2.9 倍,除非我省略了 -O 标志,在这种情况下,程序运行速度要慢得多,但仍以 100 使用所有 6 个处理器利用率百分比(使用 htop
测试)。
这是为什么?另外,我可以做些什么来实现 6 倍的性能提升?
源代码:
#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>
int main() {
using namespace std::chrono;
const int big_number = 1000000000;
std::array<double, 6> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
// Sequential
high_resolution_clock::time_point start_linear = high_resolution_clock::now();
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i]++;
}
}
high_resolution_clock::time_point end_linear = high_resolution_clock::now();
// Parallel
high_resolution_clock::time_point start_parallel = high_resolution_clock::now();
array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i]++;
}
}
}
high_resolution_clock::time_point end_parallel = high_resolution_clock::now();
// Stats.
std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;
duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;
time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;
return EXIT_SUCCESS;
}
您的代码似乎受到了 false sharing 的影响。
不要让不同的线程访问同一个缓存line.A更好的方法是尽量不要在线程之间共享变量。
#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>
int main() {
using namespace std::chrono;
const int big_number = 1000000000;
alignas(64) std::array<double, 6*8> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
// Sequential
high_resolution_clock::time_point start_linear = high_resolution_clock::now();
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i]++;
}
}
high_resolution_clock::time_point end_linear = high_resolution_clock::now();
// Parallel
high_resolution_clock::time_point start_parallel = high_resolution_clock::now();
array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i*8]++;
}
}
}
high_resolution_clock::time_point end_parallel = high_resolution_clock::now();
// Stats.
std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;
duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;
time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;
return EXIT_SUCCESS;
}
使用了 8 个处理器。
线性动作耗时:26.9021 秒。
并行操作耗时:6.41319 秒。
你可以阅读 this。
我正在为 C++ 测试 OpenMP,因为我的软件将严重依赖处理器并行化的速度。
当 运行 以下代码时,我得到了奇怪的结果。
- 并行化的速度没有我预期的那么快
- 不使用 -O 标志时,代码运行较慢。
我在 i5-8600 CPU 和 16 GB RAM 上使用 g++ 编译器,版本 7.3.0 和 Ubuntu 18.04 OS。
输出:
Output 1 (Not allowed to embed yet since I'm a new member)
转录:
.../OpenMPTest$ g++ -O3 -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 2.87415 seconds.
Parallel action took: 0.99954 seconds.
Output 2
.../OpenMPTest$ g++ -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 25.7037 seconds.
Parallel action took: 68.0485 seconds.
如您所见,对于 6 个处理器,我的速度仅提高了约 2.9 倍,除非我省略了 -O 标志,在这种情况下,程序运行速度要慢得多,但仍以 100 使用所有 6 个处理器利用率百分比(使用 htop
测试)。
这是为什么?另外,我可以做些什么来实现 6 倍的性能提升?
源代码:
#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>
int main() {
using namespace std::chrono;
const int big_number = 1000000000;
std::array<double, 6> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
// Sequential
high_resolution_clock::time_point start_linear = high_resolution_clock::now();
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i]++;
}
}
high_resolution_clock::time_point end_linear = high_resolution_clock::now();
// Parallel
high_resolution_clock::time_point start_parallel = high_resolution_clock::now();
array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i]++;
}
}
}
high_resolution_clock::time_point end_parallel = high_resolution_clock::now();
// Stats.
std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;
duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;
time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;
return EXIT_SUCCESS;
}
您的代码似乎受到了 false sharing 的影响。
不要让不同的线程访问同一个缓存line.A更好的方法是尽量不要在线程之间共享变量。
#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>
int main() {
using namespace std::chrono;
const int big_number = 1000000000;
alignas(64) std::array<double, 6*8> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
// Sequential
high_resolution_clock::time_point start_linear = high_resolution_clock::now();
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i]++;
}
}
high_resolution_clock::time_point end_linear = high_resolution_clock::now();
// Parallel
high_resolution_clock::time_point start_parallel = high_resolution_clock::now();
array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i*8]++;
}
}
}
high_resolution_clock::time_point end_parallel = high_resolution_clock::now();
// Stats.
std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;
duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;
time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;
return EXIT_SUCCESS;
}
使用了 8 个处理器。
线性动作耗时:26.9021 秒。
并行操作耗时:6.41319 秒。
你可以阅读 this。