如何知道使用了哪个 malloc?
How to know which malloc is used?
据我了解,存在许多不同的 malloc 实现:
- dlmalloc – 通用分配器
- ptmalloc2 – glibc
- jemalloc – FreeBSD 和 Firefox
- tcmalloc – Google
- libumem – Solaris
有什么方法可以确定我的 (linux) 系统上实际使用了哪个 malloc?
我读到 "due to ptmalloc2’s threading support, it became the default memory allocator for linux." 我可以自己检查一下吗?
我问是因为我似乎没有通过在下面的代码中并行化我的 malloc 循环来加快速度:
for (int i = 1; i <= 16; i += 1 ) {
parallelMalloc(i);
}
void parallelMalloc(int parallelism, int mallocCnt = 10000000) {
omp_set_num_threads(parallelism);
std::vector<char*> ptrStore(mallocCnt);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::local_time();
#pragma omp parallel for
for (int i = 0; i < mallocCnt; i++) {
ptrStore[i] = ((char*)malloc(100 * sizeof(char)));
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::local_time();
#pragma omp parallel for
for (int i = 0; i < mallocCnt; i++) {
free(ptrStore[i]);
}
boost::posix_time::ptime t3 = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration malloc_time = t2 - t1;
boost::posix_time::time_duration free_time = t3 - t2;
std::cout << " parallelism = " << parallelism << "\t itr = " << mallocCnt << "\t malloc_time = " <<
malloc_time.total_milliseconds() << "\t free_time = " << free_time.total_milliseconds() << std::endl;
}
这给了我
的输出
parallelism = 1 itr = 10000000 malloc_time = 1225 free_time = 1517
parallelism = 2 itr = 10000000 malloc_time = 1614 free_time = 1112
parallelism = 3 itr = 10000000 malloc_time = 1619 free_time = 687
parallelism = 4 itr = 10000000 malloc_time = 2325 free_time = 620
parallelism = 5 itr = 10000000 malloc_time = 2233 free_time = 550
parallelism = 6 itr = 10000000 malloc_time = 2207 free_time = 489
parallelism = 7 itr = 10000000 malloc_time = 2778 free_time = 398
parallelism = 8 itr = 10000000 malloc_time = 1813 free_time = 389
parallelism = 9 itr = 10000000 malloc_time = 1997 free_time = 350
parallelism = 10 itr = 10000000 malloc_time = 1922 free_time = 291
parallelism = 11 itr = 10000000 malloc_time = 2480 free_time = 257
parallelism = 12 itr = 10000000 malloc_time = 1614 free_time = 256
parallelism = 13 itr = 10000000 malloc_time = 1387 free_time = 289
parallelism = 14 itr = 10000000 malloc_time = 1481 free_time = 248
parallelism = 15 itr = 10000000 malloc_time = 1252 free_time = 297
parallelism = 16 itr = 10000000 malloc_time = 1063 free_time = 281
I read that "due to ptmalloc2’s threading support, it became the default memory allocator for linux." Is there any way for me to check this myself?
glibc
内部使用 ptmalloc2
这不是最近的发展。无论哪种方式,做 getconf GNU_LIBC_VERSION
都不是很难,然后交叉检查版本以查看该版本是否使用 ptmalloc2
,但我敢打赌你会浪费你的时间。
I am asking because I do not seem to get any speed up by paralellizing my malloc loop in the code below
将您的示例转换为 MVCE(为简洁起见,此处省略代码),并使用 g++ -Wall -pedantic -O3 -pthread -fopenmp
进行编译,使用 g++ 5.3.1
这是我的结果。
使用 OpenMP:
parallelism = 1 itr = 10000000 malloc_time = 746 free_time = 263
parallelism = 2 itr = 10000000 malloc_time = 541 free_time = 267
parallelism = 3 itr = 10000000 malloc_time = 405 free_time = 259
parallelism = 4 itr = 10000000 malloc_time = 324 free_time = 221
parallelism = 5 itr = 10000000 malloc_time = 330 free_time = 242
parallelism = 6 itr = 10000000 malloc_time = 287 free_time = 244
parallelism = 7 itr = 10000000 malloc_time = 257 free_time = 226
parallelism = 8 itr = 10000000 malloc_time = 270 free_time = 225
parallelism = 9 itr = 10000000 malloc_time = 253 free_time = 225
parallelism = 10 itr = 10000000 malloc_time = 236 free_time = 226
parallelism = 11 itr = 10000000 malloc_time = 225 free_time = 239
parallelism = 12 itr = 10000000 malloc_time = 276 free_time = 258
parallelism = 13 itr = 10000000 malloc_time = 241 free_time = 228
parallelism = 14 itr = 10000000 malloc_time = 254 free_time = 225
parallelism = 15 itr = 10000000 malloc_time = 278 free_time = 272
parallelism = 16 itr = 10000000 malloc_time = 235 free_time = 220
23.87 user
2.11 system
0:10.41 elapsed
249% CPU
没有 OpenMP:
parallelism = 1 itr = 10000000 malloc_time = 748 free_time = 263
parallelism = 2 itr = 10000000 malloc_time = 344 free_time = 256
parallelism = 3 itr = 10000000 malloc_time = 751 free_time = 254
parallelism = 4 itr = 10000000 malloc_time = 339 free_time = 262
parallelism = 5 itr = 10000000 malloc_time = 748 free_time = 253
parallelism = 6 itr = 10000000 malloc_time = 330 free_time = 256
parallelism = 7 itr = 10000000 malloc_time = 734 free_time = 260
parallelism = 8 itr = 10000000 malloc_time = 334 free_time = 259
parallelism = 9 itr = 10000000 malloc_time = 750 free_time = 256
parallelism = 10 itr = 10000000 malloc_time = 339 free_time = 255
parallelism = 11 itr = 10000000 malloc_time = 743 free_time = 267
parallelism = 12 itr = 10000000 malloc_time = 342 free_time = 261
parallelism = 13 itr = 10000000 malloc_time = 739 free_time = 252
parallelism = 14 itr = 10000000 malloc_time = 333 free_time = 252
parallelism = 15 itr = 10000000 malloc_time = 740 free_time = 252
parallelism = 16 itr = 10000000 malloc_time = 330 free_time = 252
13.38 user
4.66 system
0:18.08 elapsed
99% CPU
并行度似乎快了大约 8 秒。还是不相信?好的。我继续抓取 dlmalloc
、运行 make
产生 libmalloc.a
。我的新命令是 g++ -Wall -pedantic -O3 -pthread -fopenmp -L$HOME/Development/test/dlmalloc/lib test.cpp -lmalloc
使用 OpenMP:
parallelism = 1 itr = 10000000 malloc_time = 814 free_time = 277
I CTRL-C'd 37 秒后.
没有 OpenMP:
parallelism = 1 itr = 10000000 malloc_time = 772 free_time = 271
parallelism = 2 itr = 10000000 malloc_time = 780 free_time = 272
parallelism = 3 itr = 10000000 malloc_time = 783 free_time = 272
parallelism = 4 itr = 10000000 malloc_time = 792 free_time = 277
parallelism = 5 itr = 10000000 malloc_time = 813 free_time = 281
parallelism = 6 itr = 10000000 malloc_time = 800 free_time = 275
parallelism = 7 itr = 10000000 malloc_time = 795 free_time = 277
parallelism = 8 itr = 10000000 malloc_time = 790 free_time = 273
parallelism = 9 itr = 10000000 malloc_time = 788 free_time = 277
parallelism = 10 itr = 10000000 malloc_time = 784 free_time = 276
parallelism = 11 itr = 10000000 malloc_time = 786 free_time = 284
parallelism = 12 itr = 10000000 malloc_time = 807 free_time = 279
parallelism = 13 itr = 10000000 malloc_time = 791 free_time = 277
parallelism = 14 itr = 10000000 malloc_time = 790 free_time = 273
parallelism = 15 itr = 10000000 malloc_time = 785 free_time = 276
parallelism = 16 itr = 10000000 malloc_time = 787 free_time = 275
6.48 user
11.27 system
0:17.81 elapsed
99% CPU
非常显着的差异。我怀疑问题出在您更复杂的代码中,或者您的基准测试有问题。
据我了解,存在许多不同的 malloc 实现:
- dlmalloc – 通用分配器
- ptmalloc2 – glibc
- jemalloc – FreeBSD 和 Firefox
- tcmalloc – Google
- libumem – Solaris
有什么方法可以确定我的 (linux) 系统上实际使用了哪个 malloc?
我读到 "due to ptmalloc2’s threading support, it became the default memory allocator for linux." 我可以自己检查一下吗?
我问是因为我似乎没有通过在下面的代码中并行化我的 malloc 循环来加快速度:
for (int i = 1; i <= 16; i += 1 ) {
parallelMalloc(i);
}
void parallelMalloc(int parallelism, int mallocCnt = 10000000) {
omp_set_num_threads(parallelism);
std::vector<char*> ptrStore(mallocCnt);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::local_time();
#pragma omp parallel for
for (int i = 0; i < mallocCnt; i++) {
ptrStore[i] = ((char*)malloc(100 * sizeof(char)));
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::local_time();
#pragma omp parallel for
for (int i = 0; i < mallocCnt; i++) {
free(ptrStore[i]);
}
boost::posix_time::ptime t3 = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration malloc_time = t2 - t1;
boost::posix_time::time_duration free_time = t3 - t2;
std::cout << " parallelism = " << parallelism << "\t itr = " << mallocCnt << "\t malloc_time = " <<
malloc_time.total_milliseconds() << "\t free_time = " << free_time.total_milliseconds() << std::endl;
}
这给了我
的输出 parallelism = 1 itr = 10000000 malloc_time = 1225 free_time = 1517
parallelism = 2 itr = 10000000 malloc_time = 1614 free_time = 1112
parallelism = 3 itr = 10000000 malloc_time = 1619 free_time = 687
parallelism = 4 itr = 10000000 malloc_time = 2325 free_time = 620
parallelism = 5 itr = 10000000 malloc_time = 2233 free_time = 550
parallelism = 6 itr = 10000000 malloc_time = 2207 free_time = 489
parallelism = 7 itr = 10000000 malloc_time = 2778 free_time = 398
parallelism = 8 itr = 10000000 malloc_time = 1813 free_time = 389
parallelism = 9 itr = 10000000 malloc_time = 1997 free_time = 350
parallelism = 10 itr = 10000000 malloc_time = 1922 free_time = 291
parallelism = 11 itr = 10000000 malloc_time = 2480 free_time = 257
parallelism = 12 itr = 10000000 malloc_time = 1614 free_time = 256
parallelism = 13 itr = 10000000 malloc_time = 1387 free_time = 289
parallelism = 14 itr = 10000000 malloc_time = 1481 free_time = 248
parallelism = 15 itr = 10000000 malloc_time = 1252 free_time = 297
parallelism = 16 itr = 10000000 malloc_time = 1063 free_time = 281
I read that "due to ptmalloc2’s threading support, it became the default memory allocator for linux." Is there any way for me to check this myself?
glibc
内部使用 ptmalloc2
这不是最近的发展。无论哪种方式,做 getconf GNU_LIBC_VERSION
都不是很难,然后交叉检查版本以查看该版本是否使用 ptmalloc2
,但我敢打赌你会浪费你的时间。
I am asking because I do not seem to get any speed up by paralellizing my malloc loop in the code below
将您的示例转换为 MVCE(为简洁起见,此处省略代码),并使用 g++ -Wall -pedantic -O3 -pthread -fopenmp
进行编译,使用 g++ 5.3.1
这是我的结果。
使用 OpenMP:
parallelism = 1 itr = 10000000 malloc_time = 746 free_time = 263
parallelism = 2 itr = 10000000 malloc_time = 541 free_time = 267
parallelism = 3 itr = 10000000 malloc_time = 405 free_time = 259
parallelism = 4 itr = 10000000 malloc_time = 324 free_time = 221
parallelism = 5 itr = 10000000 malloc_time = 330 free_time = 242
parallelism = 6 itr = 10000000 malloc_time = 287 free_time = 244
parallelism = 7 itr = 10000000 malloc_time = 257 free_time = 226
parallelism = 8 itr = 10000000 malloc_time = 270 free_time = 225
parallelism = 9 itr = 10000000 malloc_time = 253 free_time = 225
parallelism = 10 itr = 10000000 malloc_time = 236 free_time = 226
parallelism = 11 itr = 10000000 malloc_time = 225 free_time = 239
parallelism = 12 itr = 10000000 malloc_time = 276 free_time = 258
parallelism = 13 itr = 10000000 malloc_time = 241 free_time = 228
parallelism = 14 itr = 10000000 malloc_time = 254 free_time = 225
parallelism = 15 itr = 10000000 malloc_time = 278 free_time = 272
parallelism = 16 itr = 10000000 malloc_time = 235 free_time = 220
23.87 user
2.11 system
0:10.41 elapsed
249% CPU
没有 OpenMP:
parallelism = 1 itr = 10000000 malloc_time = 748 free_time = 263
parallelism = 2 itr = 10000000 malloc_time = 344 free_time = 256
parallelism = 3 itr = 10000000 malloc_time = 751 free_time = 254
parallelism = 4 itr = 10000000 malloc_time = 339 free_time = 262
parallelism = 5 itr = 10000000 malloc_time = 748 free_time = 253
parallelism = 6 itr = 10000000 malloc_time = 330 free_time = 256
parallelism = 7 itr = 10000000 malloc_time = 734 free_time = 260
parallelism = 8 itr = 10000000 malloc_time = 334 free_time = 259
parallelism = 9 itr = 10000000 malloc_time = 750 free_time = 256
parallelism = 10 itr = 10000000 malloc_time = 339 free_time = 255
parallelism = 11 itr = 10000000 malloc_time = 743 free_time = 267
parallelism = 12 itr = 10000000 malloc_time = 342 free_time = 261
parallelism = 13 itr = 10000000 malloc_time = 739 free_time = 252
parallelism = 14 itr = 10000000 malloc_time = 333 free_time = 252
parallelism = 15 itr = 10000000 malloc_time = 740 free_time = 252
parallelism = 16 itr = 10000000 malloc_time = 330 free_time = 252
13.38 user
4.66 system
0:18.08 elapsed
99% CPU
并行度似乎快了大约 8 秒。还是不相信?好的。我继续抓取 dlmalloc
、运行 make
产生 libmalloc.a
。我的新命令是 g++ -Wall -pedantic -O3 -pthread -fopenmp -L$HOME/Development/test/dlmalloc/lib test.cpp -lmalloc
使用 OpenMP:
parallelism = 1 itr = 10000000 malloc_time = 814 free_time = 277
I CTRL-C'd 37 秒后.
没有 OpenMP:
parallelism = 1 itr = 10000000 malloc_time = 772 free_time = 271
parallelism = 2 itr = 10000000 malloc_time = 780 free_time = 272
parallelism = 3 itr = 10000000 malloc_time = 783 free_time = 272
parallelism = 4 itr = 10000000 malloc_time = 792 free_time = 277
parallelism = 5 itr = 10000000 malloc_time = 813 free_time = 281
parallelism = 6 itr = 10000000 malloc_time = 800 free_time = 275
parallelism = 7 itr = 10000000 malloc_time = 795 free_time = 277
parallelism = 8 itr = 10000000 malloc_time = 790 free_time = 273
parallelism = 9 itr = 10000000 malloc_time = 788 free_time = 277
parallelism = 10 itr = 10000000 malloc_time = 784 free_time = 276
parallelism = 11 itr = 10000000 malloc_time = 786 free_time = 284
parallelism = 12 itr = 10000000 malloc_time = 807 free_time = 279
parallelism = 13 itr = 10000000 malloc_time = 791 free_time = 277
parallelism = 14 itr = 10000000 malloc_time = 790 free_time = 273
parallelism = 15 itr = 10000000 malloc_time = 785 free_time = 276
parallelism = 16 itr = 10000000 malloc_time = 787 free_time = 275
6.48 user
11.27 system
0:17.81 elapsed
99% CPU
非常显着的差异。我怀疑问题出在您更复杂的代码中,或者您的基准测试有问题。