python 中的高效搜索算法可在 excel 工作簿的所有 sheet 中搜索字符串并匹配 sheet 数字 return

Efficient search algorithm in python to search strings in all sheets of an excel workbook and return matching sheet numbers

如何在工作簿的所有 sheet 中搜索 string/pattern 并且 return 工作簿中所有匹配的 sheet 编号?

我可以遍历 Excel 工作簿中的所有 sheet,一个一个地遍历,并在每个 sheet 中搜索字符串(类似于线性搜索),但效率很低而且需要很长时间,而且我要处理数百个甚至更多的工作簿。

更新一:示例代码

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

def searchSheets(fnames):
    #Search Logic here
    #Loop over each Sheet
    #Search for string 'Balance' in each Sheet
    #Return matching Sheet Number

if __name__ == '__main__':
    __spec__ = None

    folder = "C://AB//"
    if os.path.exists(folder):
        files = glob.glob(folder + "*.xlsx")


    #Multi threading   
    pool = Pool()
    pool=ThreadPool(processes=10)
    #Suggested by @Dan D
    pool.map(searchSheets,files) # It did not work
    pool.close()    

更新2:Error

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 119, in work
er
    result = (True, func(*args, **kwds))
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 44, in mapst
ar
    return list(map(*args))
  File "C:\temp3.py", line 36, in searchSheet
    wb = xl_wb(f)
  File "C:\ProgramData\Anaconda3\lib\site-packages\xlrd\__init__.py", line 116,
in open_workbook
    with open(filename, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'C'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\temp3.py", line 167, in <module>
    pool.map(searchSheet,files)
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 644, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: 'C'
>>>

sheet 中的搜索不依赖于以前的搜索,工作簿中的搜索也不依赖于以前的搜索。 这是您可以进行多线程处理的典型情况。

这个 post 描述了 Python 中的方法 How to use threading in Python?

所以在伪代码中:

  • 对每个工作簿的每个 sheet 进行并行搜索
  • 汇总并显示结果。

解决方案

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

def searchSheets(fnames):
    #Search Logic here
    #Loop over each Sheet
    #Search for string 'Balance' in each Sheet
    #Return matching Sheet Number

if __name__ == '__main__':
    __spec__ = None

    folder = "C://AB//"
    if os.path.exists(folder):
        files = glob.glob(folder + "*.xlsx")


    #Multi threading   
    pool = Pool()
    pool=ThreadPool(processes=10)
    #Suggested by @Dan D
    #pool.map(searchSheets,files) # It did not work
    pool.map(searchSheets,[workbook for workbook in files],)
    multiprocessing.freeze_support() # this line is needed on window 
    #only,found it in may other posts
    pool.close()    
    #pool.join() #Removed this from code as it made all the workers to wait