如何控制多线程中的内存使用？

Question

我正在使用多线程处理图像。

在我有足够内存的电脑上运行良好（处理大量图像时增加2~3GB），但我的服务器只有1GB内存并且代码无法正常运行。

有时以Segmentation fault结尾，有时：

Exception in thread Thread-13:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "passportRecognizeNew.py", line 267, in doSomething
  ...

代码：

import threading

def doSomething(image):
    # picture processing code
    print("processing over")

threads = []

for i in range(20):
    thread = threading.Thread(target=doSomething, args=("image",))
    threads.append(thread)

for t in threads:
    t.setDaemon(True)
    t.start()

t.join()

print("All over")

如何解决这个问题或控制内存使用的任何方式？

Answer 1

我认为你从错误的角度看待这个问题。您的代码启动了 n 个线程。这些线程然后执行您为它们定义的 work。

如果这项工作需要他们分配大量内存 - 该上下文的任何 "outside" 应该对此做什么？应该发生什么？一些线程应该被杀死吗？应该在某个地方，在 C 代码的深处 malloc ... 不会发生 ... 然后？

我的意思是：您的问题很可能是因为您只是启动了太多的线程。

因此答案是：在你弄坏了东西之后，不要试图修复它们 - 最好确保你根本不弄坏它们：

仔细分析，了解您的申请；所以你可以评估一个线程需要多少内存来完成它的"work"
然后更改您的 "main" 程序以查询它运行所在的硬件（例如：检查可用内存和可用物理 CPU 的数量）
并根据该评估，启动应该在上述硬件详细信息下工作的线程数

除此之外：这是非常常见的模式。开发人员有一台 "powerful" 机器正在使用；并且他隐含地假设任何目标系统运行他的产品将具有相同或更好的特性。这根本不是真的。

换句话说：当您不知道硬件看起来像您的代码时运行打开时 - 那么只有一个合理的事情要做：首先获得这些知识。之后根据真实数据做不同的事情。

Answer 2

在GhostCat的帮助下，我用下面的代码解决了内存使用问题。

import Queue
import threading
import multiprocessing
import time
import psutil


class ThreadSomething(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            # check available memory
            virtualMemoryInfo = psutil.virtual_memory()
            availableMemory = virtualMemoryInfo.available

            print(str(availableMemory/1025/1024)+"M")

            if availableMemory > MEMORY_WARNING:
                # image from queue
                image = self.queue.get()

                # do something
                doSomething(image)

                # signals to queue job is done
                self.queue.task_done()
            else:
                print("memory warning!")

def doSomething(image):
    # picture processing code, cost time and memory
    print("processing over")

# After testing, there seems no use to create threads more than CPU_COUNT, 
# execution time is not reduce.
CPU_COUNT = multiprocessing.cpu_count()
MEMORY_WARNING = 200*1024*1024  # 200M

images = ["1.png", "2.png", "3.png", "4.png", "5.png"]
queue = Queue.Queue()

def main():
    # spawn a pool of threads, and pass them queue instance
    for i in range(CPU_COUNT):
        t = ThreadSomething(queue)
        t.setDaemon(True)
        t.start()

    # populate queue with data
        for image in images:
            queue.put(image)

    # wait on the queue until everything has been processed
    queue.join()

start = time.time()
main()
print 'All over. Elapsed Time: %s' % (time.time() - start)

我使用 psutil 模块来获取可用内存。

参考代码：yosemitebandit/ibm_queue.py

我问题中的代码存在创建线程超过CPU_COUNT的问题。

如何控制多线程中的内存使用？

How to control memory usage in multithreading?

python

multithreading

memory-management

image-processing

out-of-memory