Google sparse_hash_map 中的内存泄漏
Memory leak in Google sparse_hash_map
本周我一直在尝试发现一种不寻常的内存行为:当我 运行 我的代码使用一个线程时,我有一个特定的内存占用,但是如果我 运行 它使用多个线程线程然后内存占用量增加了很多,没有明显的原因。
我知道由于多线程程序的不确定性,可能会发生一些事情运行,所以最好使用一些确定性的方法来调试它。然而,即使在这种非常确定的情况下,我也看不出内存占用量增加的原因。
这是我能够编写的重现该问题的最小程序:
test.cpp
#include <chrono>
#include <iomanip>
#include <iostream>
#include <openssl/sha.h>
#include <sparsehash/sparse_hash_map>
#include <thread>
struct mystruct
{
uint64_t values[16];
inline bool operator==(const mystruct &other) const {
return !memcmp(values, other.values, sizeof(values));
}
inline bool operator!=(const mystruct &other) const {
return !(*this == other);
}
};
namespace std {
template<>
class hash<mystruct> {
public:
inline size_t operator()(const mystruct &s) const
{
return s.values[0];
}
};
}
google::sparse_hash_map<mystruct, uint64_t, std::hash<mystruct>> h;
const uint64_t N = 1ull * 1024ull * 1024ull * 1024ull;
const int step = 128;
volatile uint64_t next = 0;
uint64_t total_keys = 0;
const uint64_t THREADS = 1;
void insert_data(uint64_t data) {
unsigned char buf[256];
mystruct b;
for (uint64_t j = 0; j < N / THREADS; j++) {
SHA256((unsigned char*)&data, 8, buf);
memcpy(&b, buf, sizeof(mystruct));
while (next != data)
;
h[b] = 1;
total_keys++;
next += step;
data += THREADS * step;
}
}
int main() {
std::thread t[THREADS];
for (uint64_t i = 0; i < THREADS; i++) {
t[i] = std::thread(insert_data, i * step);
}
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
while (total_keys < N) {
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
uint64_t elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
double speed = 1000.0*(double)total_keys / (double)elapsed_ms ;
std::cout << std::fixed << "Processed " << total_keys << " items in " << elapsed_ms << " ms (" << std::setprecision(0) << speed << " keys/s)" << ", next key is " << next << std::endl;
}
for (uint64_t i = 0; i < THREADS; i++) {
t[i].join();
}
}
程序的想法非常简单:
- 将 1 gazillion 键的 已知和预定序列 插入带有循环的哈希表中
- 观察程序每 100 毫秒在控制台上转储到目前为止它插入了多少个键,以及下一个要插入的键是什么
- 等到 OS 终止程序
因此,目标是验证对于预定义的插入序列,内存消耗是相同的无论有多少线程将数据推入哈希表,即插入键的数量大致相同(我不介意是否存在一些差异,例如由于 OS 决定终止应用程序)。好吧,差异可能很大。
insert_data
函数使用一个简单的自旋循环来 gua运行tee 插入序列,即数字 0、128、256 等等...(使用 lock/mutex 没有区别)。
我运行程序在以下环境:
- Debian 8.3 Linux 具有 4GB RAM 的虚拟机
- 无交换内存 swapoff -a
- 带有
-O3
标志的 GCC 4.9.2
这些是当上述代码 运行s 具有不同数量的 THREADS 变量时的结果:
线程 = 1
Processed 54241 items in 100 ms (542410 keys/s), next key is 6942848
Processed 104857 items in 200 ms (524285 keys/s), next key is 13421696
...
Processed 13410458 items in 35698 ms (375664 keys/s), next key is 1716538624
Processed 13421773 items in 35799 ms (374920 keys/s), next key is 1717986944
...
... 6 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 41245 ms (325416 keys/s), next key is 1717986944
Processed 13438143 items in 41345 ms (325025 keys/s), next key is 1720082432
...
Processed 23580346 items in 65567 ms (359637 keys/s), next key is 3018284416
Killed
线程 = 2
Processed 69922 items in 101 ms (692297 keys/s), next key is 8950016
Processed 121766 items in 202 ms (602802 keys/s), next key is 15586048
...
Processed 13409271 items in 37098 ms (361455 keys/s), next key is 1716386688
Processed 13421773 items in 37198 ms (360820 keys/s), next key is 1717986944
...
... 6 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 43204 ms (310660 keys/s), next key is 1717986944
Processed 13443021 items in 43304 ms (310434 keys/s), next key is 1720706688
...
Processed 20024294 items in 64129 ms (312250 keys/s), next key is 2563110656
Killed
线程 = 4
Processed 31 items in 100 ms (310 keys/s), next key is 3968
Processed 33882 items in 200 ms (169410 keys/s), next key is 4336896
...
Processed 13399262 items in 48949 ms (273739 keys/s), next key is 1715105536
Processed 13421773 items in 49049 ms (273640 keys/s), next key is 1717986944
...
... 5 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 54091 ms (248133 keys/s), next key is 1717986944
Processed 13421773 items in 54201 ms (247630 keys/s), next key is 1717986944
Killed
如您所见,所有三个测试都将特定键 1717986944
识别为触发哈希表调整大小的键,并且所有测试都发生在相同的插入键 n.13421773
上,并且这证实了无论 运行ning 线程的数量如何,插入总是严格按照相同的顺序发生。
但是THREADS=1比THREADS=2多插入了3556052个key(对应434MB),多插入了6602521个key (对应805MB)比THREADS=4。
任何人都可以解释为什么即使在这种确定性条件下我也会有如此奇怪的内存消耗?我真的遗漏了什么吗?
发布这个作为答案,因为我能够理解发生了什么。
经过大量时间和调试后,我发现内存消耗的根本原因是病态内存 alloc/dealloc 模式会产生堆碎片。没有内存泄漏或任何东西。我也能够通过自定义哈希表实现重现该问题。
虽然很奇怪,但在这两种情况下都在主循环的末尾添加一个 malloc_trim(0);
就在行
之后
std::cout << std::fixed << "Processed " << total_keys << " items in " << elapsed_ms << " ms (" << std::setprecision(0) << speed << " keys/s)" << ", next key is " << next << std::endl;
补偿一点并允许程序回收更多内存。
本周我一直在尝试发现一种不寻常的内存行为:当我 运行 我的代码使用一个线程时,我有一个特定的内存占用,但是如果我 运行 它使用多个线程线程然后内存占用量增加了很多,没有明显的原因。
我知道由于多线程程序的不确定性,可能会发生一些事情运行,所以最好使用一些确定性的方法来调试它。然而,即使在这种非常确定的情况下,我也看不出内存占用量增加的原因。
这是我能够编写的重现该问题的最小程序:
test.cpp
#include <chrono>
#include <iomanip>
#include <iostream>
#include <openssl/sha.h>
#include <sparsehash/sparse_hash_map>
#include <thread>
struct mystruct
{
uint64_t values[16];
inline bool operator==(const mystruct &other) const {
return !memcmp(values, other.values, sizeof(values));
}
inline bool operator!=(const mystruct &other) const {
return !(*this == other);
}
};
namespace std {
template<>
class hash<mystruct> {
public:
inline size_t operator()(const mystruct &s) const
{
return s.values[0];
}
};
}
google::sparse_hash_map<mystruct, uint64_t, std::hash<mystruct>> h;
const uint64_t N = 1ull * 1024ull * 1024ull * 1024ull;
const int step = 128;
volatile uint64_t next = 0;
uint64_t total_keys = 0;
const uint64_t THREADS = 1;
void insert_data(uint64_t data) {
unsigned char buf[256];
mystruct b;
for (uint64_t j = 0; j < N / THREADS; j++) {
SHA256((unsigned char*)&data, 8, buf);
memcpy(&b, buf, sizeof(mystruct));
while (next != data)
;
h[b] = 1;
total_keys++;
next += step;
data += THREADS * step;
}
}
int main() {
std::thread t[THREADS];
for (uint64_t i = 0; i < THREADS; i++) {
t[i] = std::thread(insert_data, i * step);
}
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
while (total_keys < N) {
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
uint64_t elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
double speed = 1000.0*(double)total_keys / (double)elapsed_ms ;
std::cout << std::fixed << "Processed " << total_keys << " items in " << elapsed_ms << " ms (" << std::setprecision(0) << speed << " keys/s)" << ", next key is " << next << std::endl;
}
for (uint64_t i = 0; i < THREADS; i++) {
t[i].join();
}
}
程序的想法非常简单:
- 将 1 gazillion 键的 已知和预定序列 插入带有循环的哈希表中
- 观察程序每 100 毫秒在控制台上转储到目前为止它插入了多少个键,以及下一个要插入的键是什么
- 等到 OS 终止程序
因此,目标是验证对于预定义的插入序列,内存消耗是相同的无论有多少线程将数据推入哈希表,即插入键的数量大致相同(我不介意是否存在一些差异,例如由于 OS 决定终止应用程序)。好吧,差异可能很大。
insert_data
函数使用一个简单的自旋循环来 gua运行tee 插入序列,即数字 0、128、256 等等...(使用 lock/mutex 没有区别)。
我运行程序在以下环境:
- Debian 8.3 Linux 具有 4GB RAM 的虚拟机
- 无交换内存 swapoff -a
- 带有
-O3
标志的 GCC 4.9.2
这些是当上述代码 运行s 具有不同数量的 THREADS 变量时的结果:
线程 = 1
Processed 54241 items in 100 ms (542410 keys/s), next key is 6942848
Processed 104857 items in 200 ms (524285 keys/s), next key is 13421696
...
Processed 13410458 items in 35698 ms (375664 keys/s), next key is 1716538624
Processed 13421773 items in 35799 ms (374920 keys/s), next key is 1717986944
...
... 6 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 41245 ms (325416 keys/s), next key is 1717986944
Processed 13438143 items in 41345 ms (325025 keys/s), next key is 1720082432
...
Processed 23580346 items in 65567 ms (359637 keys/s), next key is 3018284416
Killed
线程 = 2
Processed 69922 items in 101 ms (692297 keys/s), next key is 8950016
Processed 121766 items in 202 ms (602802 keys/s), next key is 15586048
...
Processed 13409271 items in 37098 ms (361455 keys/s), next key is 1716386688
Processed 13421773 items in 37198 ms (360820 keys/s), next key is 1717986944
...
... 6 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 43204 ms (310660 keys/s), next key is 1717986944
Processed 13443021 items in 43304 ms (310434 keys/s), next key is 1720706688
...
Processed 20024294 items in 64129 ms (312250 keys/s), next key is 2563110656
Killed
线程 = 4
Processed 31 items in 100 ms (310 keys/s), next key is 3968
Processed 33882 items in 200 ms (169410 keys/s), next key is 4336896
...
Processed 13399262 items in 48949 ms (273739 keys/s), next key is 1715105536
Processed 13421773 items in 49049 ms (273640 keys/s), next key is 1717986944
...
... 5 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 54091 ms (248133 keys/s), next key is 1717986944
Processed 13421773 items in 54201 ms (247630 keys/s), next key is 1717986944
Killed
如您所见,所有三个测试都将特定键 1717986944
识别为触发哈希表调整大小的键,并且所有测试都发生在相同的插入键 n.13421773
上,并且这证实了无论 运行ning 线程的数量如何,插入总是严格按照相同的顺序发生。
但是THREADS=1比THREADS=2多插入了3556052个key(对应434MB),多插入了6602521个key (对应805MB)比THREADS=4。
任何人都可以解释为什么即使在这种确定性条件下我也会有如此奇怪的内存消耗?我真的遗漏了什么吗?
发布这个作为答案,因为我能够理解发生了什么。
经过大量时间和调试后,我发现内存消耗的根本原因是病态内存 alloc/dealloc 模式会产生堆碎片。没有内存泄漏或任何东西。我也能够通过自定义哈希表实现重现该问题。
虽然很奇怪,但在这两种情况下都在主循环的末尾添加一个 malloc_trim(0);
就在行
std::cout << std::fixed << "Processed " << total_keys << " items in " << elapsed_ms << " ms (" << std::setprecision(0) << speed << " keys/s)" << ", next key is " << next << std::endl;
补偿一点并允许程序回收更多内存。