Scrapy:从 HTML 文档中先前定义的列表中查找字符串

Scrapy: find strings from a previously defined list in an HTML document

我想使用 scrapy 从预定义列表中获取字符串 bacteria_species 并将它们逐个字符串与来自网站 http://www.microbiologyresearch.org/content/journal/ijsem 的 HTML 文档中的元素进行匹配,如果此字符串出现在 HTML 的标记元素中,应返回整个元素的文本。

这是我的代码:

import scrapy

class BacteriaSpider(scrapy.Spider):
    name = 'bacteria'
    allowed_domains = ['https://www.microbiologyresearch.org/content/journal/ijsem']
    start_urls = ['http://www.microbiologyresearch.org/content/journal/ijsem/']

    def parse(self, response):

        bacteria_species = ['Abditibacterium utsteinense',
                            'Abiotrophia defectiva',
                            'Abyssibacter profundi',
                            'Abyssicoccus albus',
                            'Abyssivirga alkaniphila',
                            'Acanthopleuribacter pedis',
                            'Acaricomes phytoseiuli',
                            'Acetanaerobacterium elongatum',
                            'Acetanaerobacterium sp.',
                            'Acetatifactor muris']

        for bacteria in bacteria_species:
            response.xpath("//*/text()[contains(., bacteria)]").getall()   # select the text of all nodes
        pass

不幸的是它确实有效

有没有人有更好的主意?

您发布的是您的确切代码吗?

你有

start_urls = ['http://https://www.{...}']

无效,因为它同时包含 http://https://

应该是https://www.{...}.

我认为问题是您的 xpath 不包含正确的格式,您需要像这样动态构建它: def 解析(自我,响应): bacteria_species = ['prokaryotes', 'Malaciobacter']

    search_xpath = "//*/text()[contains(., {0})]"
    for bact in bacteria_species:
        searchfor = search_xpath.format('"' + bact + '"')
        print(searchfor)
        results =  response.xpath(searchfor).getall() 
        for item in results:
            yield{
            "bacteria" : bact,
            "results" : item
            }

我将 bacteria_species 更改为我在网页中看到的倍数并且找到了它们。需要注意的一件事是 xpath 区分大小写,这可能是一个问题,具体取决于数据