Gensim returns "ValueError: input must have more than one sentence" in for loop through list of paragraphs

Gensim returns "ValueError: input must have more than one sentence" in for loop through list of paragraphs

我正在尝试使用 gensim summarize() 来简化职位描述中的段落。 我使用 selenium 包从网上抓取了一堆职位描述并将它们存储在列表中。

descriptions=[]
for link in job_urls:
    driver.get(link)
    jd = driver.find_element_by_xpath('//div[@id="jobDescriptionText"]').text
    #The form element with attribute id set to jobDescriptionText
    descriptions.append(jd)

The output is a list of text; each item is multiple paragraphs. EX:

如果我用索引一次总结一个项目,代码就可以工作。:

    text = descriptions[2] # Change index to desired job description.
    summarize(str(text), ratio=0.5)
'The core function of this opening is to conduct regional studies and mapping.\nAs the successful candidate you would be expected to conduct regional exploration studies and evaluations of the petroleum system elements, and possess the experience to integrate geological and geophysical data to create regional maps.\nYou should have the aptitude for, and tireless energy around data mining and analysis, with high level computer mapping skills.\nMinimum Requirements\nYou will be required to perform the following:\nConduct regional exploration studies and evaluations of the petroleum system elements, and integrate available geological and geophysical data to create regional maps.\nDevelop gross depositional environment maps, effectiveness maps, common risk segment maps of all petroleum system elements (source, reservoir seal), and composite common risk segment maps of different plays, to develop new play concepts and exploration opportunities.\nAnalyze data mining with high level of computer mapping skills, using major Exploration software packages, preferably Petrel.'

但是如果我遍历列表,函数会抛出 ValueError:

for text in descriptions:
    text = str(text)
    summarize(text, ratio=0.5)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-b3969fcb2610> in <module>
      4 for text in descriptions:
      5     text = str(text)
----> 6     summarize(text, ratio=0.5)

~\Anaconda3\lib\site-packages\gensim\summarization\summarizer.py in summarize(text, ratio, word_count, split)
    426     # If only one sentence is present, the function raises an error (Avoids ZeroDivisionError).
    427     if len(sentences) == 1:
--> 428         raise ValueError("input must have more than one sentence")
    429 
    430     # Warns if the text is too short.
ValueError: input must have more than one sentence

并通过列表理解:

summary = [summarize(str(text),ratio=0.5) for text in descriptions]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-8d79c7c19d53> in <module>
      1 #text = descriptions[2] # Change index to desired job description.
      2 #summarize(str(text), ratio=0.5)
----> 3 summary = [summarize(str(text),ratio=0.5) for text in descriptions]
      4 #for text in descriptions:
      5    # print(str(text)+"\n")

<ipython-input-31-8d79c7c19d53> in <listcomp>(.0)
      1 #text = descriptions[2] # Change index to desired job description.
      2 #summarize(str(text), ratio=0.5)
----> 3 summary = [summarize(str(text),ratio=0.5) for text in descriptions]
      4 #for text in descriptions:
      5    # print(str(text)+"\n")

~\Anaconda3\lib\site-packages\gensim\summarization\summarizer.py in summarize(text, ratio, word_count, split)
    426     # If only one sentence is present, the function raises an error (Avoids ZeroDivisionError).
    427     if len(sentences) == 1:
--> 428         raise ValueError("input must have more than one sentence")
    429 
    430     # Warns if the text is too short.

ValueError: input must have more than one sentence

这些项目不止一个句子,summarize() 单独工作。为什么 summarize() 会在循环或列表理解中抛出此错误?

请注意,summarization 模块将从下一个 Gensim 版本中删除:

https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#11-removed-gensimsummarization

(它的方法非常特殊,难以概括,并且没有任何主动维护。)

就是说,如果您收到“输入必须有多个句子”的错误,您可能只是向它输入了一个句子——或者至少是看起来像一个句子的东西它非常粗糙的句子分割器。

您是否尝试打印专门触发此错误的 text 值,以验证它们是否有多个句子?

像往常一样,一天后我想通了。正如 gojomo 提到的,职位描述只有一句话(谁做的?)。我能够通过对字符长度的一些探索和观看网络驱动程序 webscrape 来捕捉它。

调试使用:

for i, text in enumerate(descriptions):
        try:
            summarize(text, ratio=0.5)
        except:
            print("Job description {} could not be summarized".format(i))
            continue

经验教训:不要假设所有职位描述都不止一句话。