我应该在 return 的生成器和函数之间使用什么来在 python 中进行简历解析，我需要一次处理大量简历？

Question

我只需要确定性能，因为目前我正在使用 函数和 returns 并且 它需要太多时间显示整个结果。以下是使用yeild

的方法

dirpath="E:\Python_Resumes\"

 def getResumeList(dirpath):
   resumes=[]
   files = os.listdir(dirpath)
   for file in files:
     if file.endswith(".pdf"):
         yield file

fileObject=getResumeList(dirpath)

def convertToRawText(fileObject):
 rawText=""
 resumeContent={}
 for file in fileObject:
    fContent=open(dirpath+file,'rb')
    rsrcmgr = PDFResourceManager()
    sio = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fContent):
         interpreter.process_page(page)
         rawText = sio.getvalue()
         yield rawText


result=convertToRawText(fileObject)

for r in result:
   print(r)
   print("\n")

以下是使用 return

的方法

def getResumeList(dirpath): 
 resumes=[]
 files = os.listdir(dirpath)# Get all the files in that directory
 for file in files:
    if file.endswith(".pdf"):
     resumes.append(file)
 return resumes

listOfFiles=getResumeList(dirpath)

def convertToRawText(files):
  rawText=""
  resumeContent={}
  for file in files:
    fContent=open(dirpath+file,'rb')
    rsrcmgr = PDFResourceManager()
    sio = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fContent):
         interpreter.process_page(page)
         rawText = sio.getvalue()
    resumeContent[file]=rawText
return resumeContent

bulkResumesText={}
bulkResumesText = convertToRawText(list(listOfFiles))

for b in bulkResumeText:
 print(bulkResumeText[b])

从性能和效率的角度来看，哪个更好？

Answer 1

首先我强烈建议写Clean Code，这意味着当你写Python时不要写C#/Java（a.k.a PEP8）

另一个问题是：尝试成为 pythonic（有时甚至可以使您的代码更快），例如，代替生成器示例中的 getResumeList()，尝试 generator expression:

def get_resume_list(dir_path):
    files = os.listdir(dir_path)
    return (f for f in files if f.endswith(".pdf"))

或者列表理解，在第二个例子中：

def get_resume_list(dir_path):
    files = os.listdir(dir_path)
    return [f for f in files if f.endswith(".pdf")]

打开文件时请尝试使用 with，因为人们往往会忘记关闭文件。

关于效率，很明显生成器就是为此而创建的。使用生成器，您可以在准备好后立即看到每个结果，而不是等待整个代码完成处理。

关于性能，我不知道您要解析多少个 pdf 文件，但我对 1056 个 pdf 文件做了一些测试，迭代器快了几秒钟（通常情况下是速度） . 生成器是为了提高效率，看看 Raymond Hettinger（Python 核心开发人员）的 answer 解释何时不使用生成器。

结论：在您的情况下，使用生成器效率更高，使用迭代器速度更快。

我应该在 return 的生成器和函数之间使用什么来在 python 中进行简历解析，我需要一次处理大量简历？

What should I use between generator and function with return for resume parsing in python where I need to process lots of resume at a time?

performance

text-parsing

python-3.x