pthread 和 printf 的 C 性能都很差
Poor C performance with both pthread and printf
我正在使用大型数组测试 Linux 的 c 代码以测量线程性能,当线程增加到最大内核数(Intel 4770 为 8 个)时,应用程序的扩展性非常好,但这仅适用于我代码的纯数学部分。
如果我为结果数组添加 printf 部分,那么时间会变得太大,从几秒到几分钟,即使重定向到一个文件,当 printf 时,这些数组应该只添加几秒钟。
代码:
(gcc 7.5.0-Ubuntu 18.04)
没有 printf 循环:
gcc -O3 -m64 exp_multi.c -pthread -lm
使用 printf 循环:
gcc -DPRINT_ARRAY -O3 -m64 exp_multi.c -pthread -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 8
static double xv[MAXSIZE];
static double yv[MAXSIZE];
/* gcc -O3 -m64 exp_multi.c -pthread -lm */
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
double p;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i++) {
xv[i]=p;
yv[i]=exp(p);
p += 0.00001;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
exit(1);
}
*arg = i;
pthread_create(&(tid[i]), NULL, run, arg);
}
for(i=0; i<num_threads; i++)
{
pthread_join(tid[i], NULL);
}
#ifdef PRINT_ARRAY
for (i=0;i<MAXSIZE;i++){
printf("x=%.20lf, e^x=%.20lf\n",xv[i],yv[i]);
}
#endif
return 0;
}
pthread_create 中的 malloc 按照 post.
中的建议将整数作为最后一个参数传递
我试过了,没有成功,clang,添加free(tid)指令,避免使用malloc指令,反向循环,只有1个一维数组,1个线程版本没有pthread。
EDIT2:我认为 exp 函数是处理器资源密集型的,可能受到处理器生成实现的每核缓存或 SIMD 资源的影响。以下示例代码基于许可代码 .
此代码无论有没有 printf 循环都运行得很快,来自 math.h 的 exp 在几年前得到了改进,它可以快 40 倍左右,至少在 Intel 4770 (Haswell) 上是这样,这个link 是数学库 vs SSE2 的已知测试代码,现在数学的 exp 速度应该接近针对 float 和 x8 并行计算优化的 AVX2 算法。
测试结果:expf 与其他 SSE2 算法 (exp_ps):
sinf .. -> 55.5 millions of vector evaluations/second -> 12 cycles/value
cosf .. -> 57.3 millions of vector evaluations/second -> 11 cycles/value
sincos (x87) .. -> 9.1 millions of vector evaluations/second -> 71 cycles/value
expf .. -> 61.4 millions of vector evaluations/second -> 11 cycles/value
logf .. -> 55.6 millions of vector evaluations/second -> 12 cycles/value
cephes_sinf .. -> 52.5 millions of vector evaluations/second -> 12 cycles/value
cephes_cosf .. -> 41.9 millions of vector evaluations/second -> 15 cycles/value
cephes_expf .. -> 18.3 millions of vector evaluations/second -> 35 cycles/value
cephes_logf .. -> 20.2 millions of vector evaluations/second -> 32 cycles/value
sin_ps .. -> 54.1 millions of vector evaluations/second -> 12 cycles/value
cos_ps .. -> 54.8 millions of vector evaluations/second -> 12 cycles/value
sincos_ps .. -> 54.6 millions of vector evaluations/second -> 12 cycles/value
exp_ps .. -> 42.6 millions of vector evaluations/second -> 15 cycles/value
log_ps .. -> 41.0 millions of vector evaluations/second -> 16 cycles/value
/* Performance test exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h", by Julien Pommier
http://gruntthepeon.free.fr/ssemath/
Copyright (C) 2012 Giovanni Garberoglio
Interdisciplinary Laboratory for Computational Science (LISC)
Fondazione Bruno Kessler and University of Trento
via Sommarive, 18
I-38123 Trento (Italy)
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -Wall -mavx2 -march=haswell expc.c -lm */
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,
gcc style alignment attributes etc.
Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
This modified code is not thoroughly tested!
*/
__m256 exp_hi = _mm256_set1_ps(88.3762626647949f);
__m256 exp_lo = _mm256_set1_ps(-88.3762626647949f);
__m256 cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256 inv_LOG2EF = _mm256_set1_ps(0.693147180559945f);
__m256 cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256 cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256 cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256 cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256 cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256 cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256 fx;
__m256i imm0;
__m256 one = _mm256_set1_ps(1.0f);
x = _mm256_min_ps(x, exp_hi);
x = _mm256_max_ps(x, exp_lo);
/* express exp(x) as exp(g + n*log(2)) */
fx = _mm256_mul_ps(x, cephes_LOG2EF);
fx = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256 z = _mm256_mul_ps(fx, inv_LOG2EF);
x = _mm256_sub_ps(x, z);
z = _mm256_mul_ps(x,x);
__m256 y = cephes_exp_p0;
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p1);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p2);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p3);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p4);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p5);
y = _mm256_mul_ps(y, z);
y = _mm256_add_ps(y, x);
y = _mm256_add_ps(y, one);
/* build 2^n */
imm0 = _mm256_cvttps_epi32(fx);
imm0 = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
imm0 = _mm256_slli_epi32(imm0, 23);
__m256 pow2n = _mm256_castsi256_ps(imm0);
y = _mm256_mul_ps(y, pow2n);
return y;
}
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX;
xp = xv;
yp = yv;
for (i = 0; i < MAXSIZE; i += 8) {
__m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
为了比较,这是数学库中的 exp (x)、单线程和浮点数的代码示例。
#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
/* gcc -O3 -m64 exp_st.c -lm */
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
for (r = 0; r < REIT; r++) {
p = XXX;
for (i = 0; i < MAXSIZE; i++) {
xv[i]=p;
yv[i]=expf(p);
p += 0.00001;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
解决方案:正如 Andreas Wenzel 所说,gcc 编译器足够聪明,它决定没有必要将结果实际写入数组,这些写入已被编译器优化掉。在我根据新信息进行了新的性能测试之后,或者在我犯了几个错误之前或者我假设了错误的事实之前,结果似乎更清楚了:exp (double arg), or expf( float arg) which is x2+ exp(double arg) ,已得到改进,但它不像快速 AVX2 算法(x8 并行浮点参数)那样快,它比 SSE2 算法(x4 并行浮点参数)快 6 倍左右。以下是一些结果,正如英特尔超线程 CPUs 所预期的那样,SSE2 算法除外:
exp(双参数)单线程:18 分 46 秒
exp(双参数)4 个线程:5 分 4 秒
exp(双参数)8 个线程:4 分 28 秒
expf (float arg) 单线程:7 分 32 秒
expf(浮点参数)4 个线程:1 分 58 秒
expf (float arg) 8 个线程:1 分 41 秒
相对误差**:
i x y = expf(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059411496e-02 exp_dbl = 8.059411715e-02 rel_err =-2.716802480e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.867173195e-02 y = 9.620664716e-01 exp_dbl = 9.620664730e-01 rel_err =-1.481617428e-09
i = 620000 x = 1.201261759e+00 y = 3.324308872e+00 exp_dbl = 3.324308753e+00 rel_err = 3.571995830e-08
i = 744000 x = 2.441616058e+00 y = 1.149159718e+01 exp_dbl = 1.149159684e+01 rel_err = 2.955980805e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
*SSE2算法通过Julien Pommier,x6,8速度从一个线程提升到8个线程。我的性能测试代码使用 aligned(16) 作为传递给库的 vector/4 浮点数组的 typedef 联合,而不是对齐整个浮点数组。这可能会导致性能降低,至少对于其他 AVX2 代码而言,它在多线程方面的性能改进似乎对英特尔超线程也有好处,但在较低的速度下,x2.5-x1.5 之间的时间增加了。也许 SSE2 代码可以通过更好的数组对齐来加速,但我无法改进:
exp_ps(x4 并行浮点参数)单线程:12 分 7 秒
exp_ps(x4 并行浮点参数)4 个线程:3 分 10 秒
exp_ps(x4 并行浮点参数)8 个线程:1 分 46 秒
相对误差**:
i x y = exp_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059412241e-02 exp_dbl = 8.059411715e-02 rel_err = 6.527768787e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.977407143e-02 y = 9.610065222e-01 exp_dbl = 9.610065335e-01 rel_err =-1.174323454e-08
i = 620000 x = 1.200158238e+00 y = 3.320642233e+00 exp_dbl = 3.320642334e+00 rel_err =-3.054731957e-08
i = 744000 x = 2.441616058e+00 y = 1.149159622e+01 exp_dbl = 1.149159684e+01 rel_err =-5.342903415e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
AVX2算法(x8并行浮点参数)单线程:1分45秒
AVX2 算法(x8 并行浮点参数)4 个线程:28 秒
AVX2 算法(x8 并行浮点参数)8 个线程:27 秒
相对误差**:
i x y = exp256_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.516632080e+00 y = 8.073104918e-02 exp_dbl = 8.073104510e-02 rel_err = 5.057888540e-08
i = 372000 x =-1.279417157e+00 y = 2.781994045e-01 exp_dbl = 2.781993997e-01 rel_err = 1.705288467e-08
i = 496000 x =-3.954863176e-02 y = 9.612231851e-01 exp_dbl = 9.612232069e-01 rel_err =-2.269774967e-08
i = 620000 x = 1.199879169e+00 y = 3.319715738e+00 exp_dbl = 3.319715775e+00 rel_err =-1.119642824e-08
i = 744000 x = 2.440370798e+00 y = 1.147729492e+01 exp_dbl = 1.147729571e+01 rel_err =-6.896860199e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.923286438e+00 y = 1.374535980e+02 exp_dbl = 1.374536045e+02 rel_err =-4.676466368e-08
**相对错误是相同的,因为SSE2和AVX2的代码使用相同的算法,而且很可能也是库函数exp(x)的代码。
源码AVX2算法多线程
/* Performance test of a multithreaded exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h", by Julien Pommier
http://gruntthepeon.free.fr/ssemath/
Copyright (C) 2012 Giovanni Garberoglio
Interdisciplinary Laboratory for Computational Science (LISC)
Fondazione Bruno Kessler and University of Trento
via Sommarive, 18
I-38123 Trento (Italy)
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -mavx2 -march=haswell expc_multi.c -pthread -lm */
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 4
typedef float FLOAT32[MAXSIZE] __attribute__((aligned(4)));
static FLOAT32 xv;
static FLOAT32 yv;
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,
gcc style alignment attributes etc.
Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
This modified code is not thoroughly tested!
*/
__m256 exp_hi = _mm256_set1_ps(88.3762626647949f);
__m256 exp_lo = _mm256_set1_ps(-88.3762626647949f);
__m256 cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256 inv_LOG2EF = _mm256_set1_ps(0.693147180559945f);
__m256 cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256 cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256 cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256 cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256 cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256 cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256 fx;
__m256i imm0;
__m256 one = _mm256_set1_ps(1.0f);
x = _mm256_min_ps(x, exp_hi);
x = _mm256_max_ps(x, exp_lo);
/* express exp(x) as exp(g + n*log(2)) */
fx = _mm256_mul_ps(x, cephes_LOG2EF);
fx = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256 z = _mm256_mul_ps(fx, inv_LOG2EF);
x = _mm256_sub_ps(x, z);
z = _mm256_mul_ps(x,x);
__m256 y = cephes_exp_p0;
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p1);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p2);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p3);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p4);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p5);
y = _mm256_mul_ps(y, z);
y = _mm256_add_ps(y, x);
y = _mm256_add_ps(y, one);
/* build 2^n */
imm0 = _mm256_cvttps_epi32(fx);
imm0 = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
imm0 = _mm256_slli_epi32(imm0, 23);
__m256 pow2n = _mm256_castsi256_ps(imm0);
y = _mm256_mul_ps(y, pow2n);
return y;
}
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
float p;
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
xp = xv + single_val*MAXSIZE/num_threads;
yp = yv + single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i += 8) {
__m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
exit(1);
}
*arg = i;
pthread_create(&(tid[i]), NULL, run, arg);
}
for(i=0; i<num_threads; i++)
{
pthread_join(tid[i], NULL);
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
图表概览:
没有 printf 循环的 exp(双参数),不是真正的性能,正如 Andreas Wenzel 发现的那样,当结果不会是 printf 时,gcc 不会计算 exp(x),即使是 float 版本也慢得多,因为它的汇编指令不同。尽管 graph 可能对某些仅使用低级 CPU cache/registers 的组装算法有用。
expf (float arg) 实际性能或使用 printf 循环
AVX2算法,性能最佳。
我认为这与 pthread
没有太大关系,因为您的代码似乎只在线程加入后才调用 printf
。相反,性能不佳可能是由于需要在打印循环的每次迭代中从 xv
和 yv
数组读取缓存未命中。
当您不在程序末尾打印数组时,gcc 编译器足够聪明,可以意识到计算结果没有可观察到的效果。因此,编译器决定没有必要将结果实际写入数组,因为这些结果从未被使用过。相反,这些写入被编译器优化掉了。
此外,当您不打印结果时,库函数 exp
没有任何可观察到的效果,前提是调用它的输入不会高到会导致 floating-point溢出(这会导致函数 raise a floating point exception)。这也允许编译器优化这些函数调用。
正如你在assembly instructions emitted by the gcc compiler for your code which does not print the results中看到的,编译后的程序不会无条件地调用函数exp
,而是测试函数exp
的输入是否会更高比 7.09e2
(以确保不会发生溢出)。只有发生溢出时,程序才会跳转到调用函数exp
的代码。下面是相关的汇编代码:
ucomisd xmm1, xmm3
jnb .L9
在上面的汇编代码中,CPU寄存器xmm3
包含double-precisionfloating-point值7.09e2
。如果输入高于此常量,函数 exp
将导致 floating-point 溢出,因为结果无法用 double-precision floating-point 值表示。
由于您的输入始终有效且低到不会导致 floating-point 溢出,您的程序永远不会执行此跳转,因此它永远不会实际调用函数 exp
.
这解释了为什么不打印结果时您的代码速度如此之快。如果您不打印结果,您的编译器将确定计算没有可观察到的效果,因此它将优化它们。
因此,如果想让编译器真正执行计算,就必须保证计算有一些可观察的效果。这并不意味着您必须实际打印所有结果(几兆字节大)。如果您只打印一行取决于所有结果(例如所有结果的总和)就足够了。
但是,如果您将对库函数 exp
的函数调用替换为对其他自定义函数的调用,那么,至少在我的测试中,编译器不够智能,无法实现该函数调用没有明显的影响。在那种情况下,即使不打印计算结果,也无法优化函数调用。
出于上述原因,如果您想比较这两个函数的性能,您必须通过确保结果具有可观察的效果来确保计算确实发生。否则,您 运行 编译器至少会优化部分计算的风险,并且比较将不公平。
我正在使用大型数组测试 Linux 的 c 代码以测量线程性能,当线程增加到最大内核数(Intel 4770 为 8 个)时,应用程序的扩展性非常好,但这仅适用于我代码的纯数学部分。
如果我为结果数组添加 printf 部分,那么时间会变得太大,从几秒到几分钟,即使重定向到一个文件,当 printf 时,这些数组应该只添加几秒钟。
代码:
(gcc 7.5.0-Ubuntu 18.04)
没有 printf 循环:
gcc -O3 -m64 exp_multi.c -pthread -lm
使用 printf 循环:
gcc -DPRINT_ARRAY -O3 -m64 exp_multi.c -pthread -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 8
static double xv[MAXSIZE];
static double yv[MAXSIZE];
/* gcc -O3 -m64 exp_multi.c -pthread -lm */
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
double p;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i++) {
xv[i]=p;
yv[i]=exp(p);
p += 0.00001;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
exit(1);
}
*arg = i;
pthread_create(&(tid[i]), NULL, run, arg);
}
for(i=0; i<num_threads; i++)
{
pthread_join(tid[i], NULL);
}
#ifdef PRINT_ARRAY
for (i=0;i<MAXSIZE;i++){
printf("x=%.20lf, e^x=%.20lf\n",xv[i],yv[i]);
}
#endif
return 0;
}
pthread_create 中的 malloc 按照 post.
中的建议将整数作为最后一个参数传递我试过了,没有成功,clang,添加free(tid)指令,避免使用malloc指令,反向循环,只有1个一维数组,1个线程版本没有pthread。
EDIT2:我认为 exp 函数是处理器资源密集型的,可能受到处理器生成实现的每核缓存或 SIMD 资源的影响。以下示例代码基于许可代码
此代码无论有没有 printf 循环都运行得很快,来自 math.h 的 exp 在几年前得到了改进,它可以快 40 倍左右,至少在 Intel 4770 (Haswell) 上是这样,这个link 是数学库 vs SSE2 的已知测试代码,现在数学的 exp 速度应该接近针对 float 和 x8 并行计算优化的 AVX2 算法。
测试结果:expf 与其他 SSE2 算法 (exp_ps):
sinf .. -> 55.5 millions of vector evaluations/second -> 12 cycles/value
cosf .. -> 57.3 millions of vector evaluations/second -> 11 cycles/value
sincos (x87) .. -> 9.1 millions of vector evaluations/second -> 71 cycles/value
expf .. -> 61.4 millions of vector evaluations/second -> 11 cycles/value
logf .. -> 55.6 millions of vector evaluations/second -> 12 cycles/value
cephes_sinf .. -> 52.5 millions of vector evaluations/second -> 12 cycles/value
cephes_cosf .. -> 41.9 millions of vector evaluations/second -> 15 cycles/value
cephes_expf .. -> 18.3 millions of vector evaluations/second -> 35 cycles/value
cephes_logf .. -> 20.2 millions of vector evaluations/second -> 32 cycles/value
sin_ps .. -> 54.1 millions of vector evaluations/second -> 12 cycles/value
cos_ps .. -> 54.8 millions of vector evaluations/second -> 12 cycles/value
sincos_ps .. -> 54.6 millions of vector evaluations/second -> 12 cycles/value
exp_ps .. -> 42.6 millions of vector evaluations/second -> 15 cycles/value
log_ps .. -> 41.0 millions of vector evaluations/second -> 16 cycles/value
/* Performance test exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h", by Julien Pommier
http://gruntthepeon.free.fr/ssemath/
Copyright (C) 2012 Giovanni Garberoglio
Interdisciplinary Laboratory for Computational Science (LISC)
Fondazione Bruno Kessler and University of Trento
via Sommarive, 18
I-38123 Trento (Italy)
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -Wall -mavx2 -march=haswell expc.c -lm */
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,
gcc style alignment attributes etc.
Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
This modified code is not thoroughly tested!
*/
__m256 exp_hi = _mm256_set1_ps(88.3762626647949f);
__m256 exp_lo = _mm256_set1_ps(-88.3762626647949f);
__m256 cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256 inv_LOG2EF = _mm256_set1_ps(0.693147180559945f);
__m256 cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256 cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256 cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256 cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256 cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256 cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256 fx;
__m256i imm0;
__m256 one = _mm256_set1_ps(1.0f);
x = _mm256_min_ps(x, exp_hi);
x = _mm256_max_ps(x, exp_lo);
/* express exp(x) as exp(g + n*log(2)) */
fx = _mm256_mul_ps(x, cephes_LOG2EF);
fx = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256 z = _mm256_mul_ps(fx, inv_LOG2EF);
x = _mm256_sub_ps(x, z);
z = _mm256_mul_ps(x,x);
__m256 y = cephes_exp_p0;
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p1);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p2);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p3);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p4);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p5);
y = _mm256_mul_ps(y, z);
y = _mm256_add_ps(y, x);
y = _mm256_add_ps(y, one);
/* build 2^n */
imm0 = _mm256_cvttps_epi32(fx);
imm0 = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
imm0 = _mm256_slli_epi32(imm0, 23);
__m256 pow2n = _mm256_castsi256_ps(imm0);
y = _mm256_mul_ps(y, pow2n);
return y;
}
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX;
xp = xv;
yp = yv;
for (i = 0; i < MAXSIZE; i += 8) {
__m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
为了比较,这是数学库中的 exp (x)、单线程和浮点数的代码示例。
#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
/* gcc -O3 -m64 exp_st.c -lm */
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
for (r = 0; r < REIT; r++) {
p = XXX;
for (i = 0; i < MAXSIZE; i++) {
xv[i]=p;
yv[i]=expf(p);
p += 0.00001;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
解决方案:正如 Andreas Wenzel 所说,gcc 编译器足够聪明,它决定没有必要将结果实际写入数组,这些写入已被编译器优化掉。在我根据新信息进行了新的性能测试之后,或者在我犯了几个错误之前或者我假设了错误的事实之前,结果似乎更清楚了:exp (double arg), or expf( float arg) which is x2+ exp(double arg) ,已得到改进,但它不像快速 AVX2 算法(x8 并行浮点参数)那样快,它比 SSE2 算法(x4 并行浮点参数)快 6 倍左右。以下是一些结果,正如英特尔超线程 CPUs 所预期的那样,SSE2 算法除外:
exp(双参数)单线程:18 分 46 秒
exp(双参数)4 个线程:5 分 4 秒
exp(双参数)8 个线程:4 分 28 秒
expf (float arg) 单线程:7 分 32 秒
expf(浮点参数)4 个线程:1 分 58 秒
expf (float arg) 8 个线程:1 分 41 秒
相对误差**:
i x y = expf(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059411496e-02 exp_dbl = 8.059411715e-02 rel_err =-2.716802480e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.867173195e-02 y = 9.620664716e-01 exp_dbl = 9.620664730e-01 rel_err =-1.481617428e-09
i = 620000 x = 1.201261759e+00 y = 3.324308872e+00 exp_dbl = 3.324308753e+00 rel_err = 3.571995830e-08
i = 744000 x = 2.441616058e+00 y = 1.149159718e+01 exp_dbl = 1.149159684e+01 rel_err = 2.955980805e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
*SSE2算法通过Julien Pommier,x6,8速度从一个线程提升到8个线程。我的性能测试代码使用 aligned(16) 作为传递给库的 vector/4 浮点数组的 typedef 联合,而不是对齐整个浮点数组。这可能会导致性能降低,至少对于其他 AVX2 代码而言,它在多线程方面的性能改进似乎对英特尔超线程也有好处,但在较低的速度下,x2.5-x1.5 之间的时间增加了。也许 SSE2 代码可以通过更好的数组对齐来加速,但我无法改进:
exp_ps(x4 并行浮点参数)单线程:12 分 7 秒
exp_ps(x4 并行浮点参数)4 个线程:3 分 10 秒
exp_ps(x4 并行浮点参数)8 个线程:1 分 46 秒
相对误差**:
i x y = exp_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059412241e-02 exp_dbl = 8.059411715e-02 rel_err = 6.527768787e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.977407143e-02 y = 9.610065222e-01 exp_dbl = 9.610065335e-01 rel_err =-1.174323454e-08
i = 620000 x = 1.200158238e+00 y = 3.320642233e+00 exp_dbl = 3.320642334e+00 rel_err =-3.054731957e-08
i = 744000 x = 2.441616058e+00 y = 1.149159622e+01 exp_dbl = 1.149159684e+01 rel_err =-5.342903415e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
AVX2算法(x8并行浮点参数)单线程:1分45秒
AVX2 算法(x8 并行浮点参数)4 个线程:28 秒
AVX2 算法(x8 并行浮点参数)8 个线程:27 秒
相对误差**:
i x y = exp256_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.516632080e+00 y = 8.073104918e-02 exp_dbl = 8.073104510e-02 rel_err = 5.057888540e-08
i = 372000 x =-1.279417157e+00 y = 2.781994045e-01 exp_dbl = 2.781993997e-01 rel_err = 1.705288467e-08
i = 496000 x =-3.954863176e-02 y = 9.612231851e-01 exp_dbl = 9.612232069e-01 rel_err =-2.269774967e-08
i = 620000 x = 1.199879169e+00 y = 3.319715738e+00 exp_dbl = 3.319715775e+00 rel_err =-1.119642824e-08
i = 744000 x = 2.440370798e+00 y = 1.147729492e+01 exp_dbl = 1.147729571e+01 rel_err =-6.896860199e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.923286438e+00 y = 1.374535980e+02 exp_dbl = 1.374536045e+02 rel_err =-4.676466368e-08
**相对错误是相同的,因为SSE2和AVX2的代码使用相同的算法,而且很可能也是库函数exp(x)的代码。
源码AVX2算法多线程
/* Performance test of a multithreaded exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h", by Julien Pommier
http://gruntthepeon.free.fr/ssemath/
Copyright (C) 2012 Giovanni Garberoglio
Interdisciplinary Laboratory for Computational Science (LISC)
Fondazione Bruno Kessler and University of Trento
via Sommarive, 18
I-38123 Trento (Italy)
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -mavx2 -march=haswell expc_multi.c -pthread -lm */
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 4
typedef float FLOAT32[MAXSIZE] __attribute__((aligned(4)));
static FLOAT32 xv;
static FLOAT32 yv;
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,
gcc style alignment attributes etc.
Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
This modified code is not thoroughly tested!
*/
__m256 exp_hi = _mm256_set1_ps(88.3762626647949f);
__m256 exp_lo = _mm256_set1_ps(-88.3762626647949f);
__m256 cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256 inv_LOG2EF = _mm256_set1_ps(0.693147180559945f);
__m256 cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256 cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256 cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256 cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256 cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256 cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256 fx;
__m256i imm0;
__m256 one = _mm256_set1_ps(1.0f);
x = _mm256_min_ps(x, exp_hi);
x = _mm256_max_ps(x, exp_lo);
/* express exp(x) as exp(g + n*log(2)) */
fx = _mm256_mul_ps(x, cephes_LOG2EF);
fx = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256 z = _mm256_mul_ps(fx, inv_LOG2EF);
x = _mm256_sub_ps(x, z);
z = _mm256_mul_ps(x,x);
__m256 y = cephes_exp_p0;
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p1);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p2);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p3);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p4);
y = _mm256_mul_ps(y, x);
y = _mm256_add_ps(y, cephes_exp_p5);
y = _mm256_mul_ps(y, z);
y = _mm256_add_ps(y, x);
y = _mm256_add_ps(y, one);
/* build 2^n */
imm0 = _mm256_cvttps_epi32(fx);
imm0 = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
imm0 = _mm256_slli_epi32(imm0, 23);
__m256 pow2n = _mm256_castsi256_ps(imm0);
y = _mm256_mul_ps(y, pow2n);
return y;
}
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
float p;
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
xp = xv + single_val*MAXSIZE/num_threads;
yp = yv + single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i += 8) {
__m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
exit(1);
}
*arg = i;
pthread_create(&(tid[i]), NULL, run, arg);
}
for(i=0; i<num_threads; i++)
{
pthread_join(tid[i], NULL);
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
}
return 0;
}
图表概览:
我认为这与 pthread
没有太大关系,因为您的代码似乎只在线程加入后才调用 printf
。相反,性能不佳可能是由于需要在打印循环的每次迭代中从 xv
和 yv
数组读取缓存未命中。
当您不在程序末尾打印数组时,gcc 编译器足够聪明,可以意识到计算结果没有可观察到的效果。因此,编译器决定没有必要将结果实际写入数组,因为这些结果从未被使用过。相反,这些写入被编译器优化掉了。
此外,当您不打印结果时,库函数 exp
没有任何可观察到的效果,前提是调用它的输入不会高到会导致 floating-point溢出(这会导致函数 raise a floating point exception)。这也允许编译器优化这些函数调用。
正如你在assembly instructions emitted by the gcc compiler for your code which does not print the results中看到的,编译后的程序不会无条件地调用函数exp
,而是测试函数exp
的输入是否会更高比 7.09e2
(以确保不会发生溢出)。只有发生溢出时,程序才会跳转到调用函数exp
的代码。下面是相关的汇编代码:
ucomisd xmm1, xmm3
jnb .L9
在上面的汇编代码中,CPU寄存器xmm3
包含double-precisionfloating-point值7.09e2
。如果输入高于此常量,函数 exp
将导致 floating-point 溢出,因为结果无法用 double-precision floating-point 值表示。
由于您的输入始终有效且低到不会导致 floating-point 溢出,您的程序永远不会执行此跳转,因此它永远不会实际调用函数 exp
.
这解释了为什么不打印结果时您的代码速度如此之快。如果您不打印结果,您的编译器将确定计算没有可观察到的效果,因此它将优化它们。
因此,如果想让编译器真正执行计算,就必须保证计算有一些可观察的效果。这并不意味着您必须实际打印所有结果(几兆字节大)。如果您只打印一行取决于所有结果(例如所有结果的总和)就足够了。
但是,如果您将对库函数 exp
的函数调用替换为对其他自定义函数的调用,那么,至少在我的测试中,编译器不够智能,无法实现该函数调用没有明显的影响。在那种情况下,即使不打印计算结果,也无法优化函数调用。
出于上述原因,如果您想比较这两个函数的性能,您必须通过确保结果具有可观察的效果来确保计算确实发生。否则,您 运行 编译器至少会优化部分计算的风险,并且比较将不公平。