为scrapy项目中的每个蜘蛛提取项目

Question

我在一个 scrapy 项目中有十几个蜘蛛，从不同的来源提取各种项目，包括其他元素，例如，我必须在每个蜘蛛中一遍又一遍地复制相同的正则表达式代码

item['element'] = re.findall('my_regex', response.text)

我使用这个正则表达式来获取在 scrapy 项目中定义的相同元素，有没有办法避免复制？我应该把它放在项目的什么地方，这样我就不必在每个蜘蛛中复制它，只添加那些不同的。

我的项目结构是默认的

感谢任何帮助提前致谢

Answer 1

所以如果我正确理解你的问题，你想在多个蜘蛛中使用相同的正则表达式。

你可以这样做：

创建一个名为 regex_to_use
的 python 模块
在该模块中放置您的正则表达式。

示例：

# regex_to_use.py

regex_one = 'test'

您可以在您的蜘蛛中访问此快递。

# spider.py

import regex_to_use
import re as regex

find_string = regex.search(regex_to_use.regex_one, ' this is a test')
print(find_string)
# output 
<re.Match object; span=(11, 15), match='test'>

您也可以在 regex_to_use 模块中做类似的事情

# regex_to_use.py

import re as regex

class CustomRegularExpressions(object):
    
    def __init__(self, text):
        """
        :param text: string containing the variable to search for
        """
        self._text = text

    def search_text(self):
        find_xyx = regex.search('test', self._text)
        return find_xyx

你会在你的蜘蛛中这样称呼它：

# spider.py

from regex_to_use import CustomRegularExpressions


find_word = CustomRegularExpressions('this is a test').search_text()
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>

如果你有多个正则表达式，你可以这样做：

# regex_to_use.py

import re as regex

class CustomRegularExpressions(object):

    def __init__(self, text):
        """
        :param text: string containing the variable to search for
        """
        self._text = text

    def search_text(self, regex_to_use):
        regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
        expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
        find_xyx = regex.search(expression, self._text)
        return find_xyx

# spider.py

from regex_to_use import CustomRegularExpressions

find_word = CustomRegularExpressions('this is a test').search_text('regex_one')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>

您也可以在 class CustomRegularExpressions

中使用 staticmethod

# regex_to_use.py

import re as regex

class CustomRegularExpressions:

    @staticmethod
    def search_text(regex_to_use, text_to_search):
        regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
        expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
        find_xyx = regex.search(expression, text_to_search)
        return find_xyx

# spider.py

from regex_to_use import CustomRegularExpressions

# find_word would be replaced with item['element'] 
# this is a test would be replaced with response.text
find_word = CustomRegularExpressions.search_text('regex_one', 'this is a test')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>

如果您在函数 search_text() 中使用文档字符串，您可以在 Python 字典中查看正则表达式。

显示这一切是如何工作的...

这是我撰写并发表的python project。查看文件夹实用程序。在此文件夹中，我可以在整个代码中使用这些函数，而无需一遍又一遍地复制和粘贴相同的代码。

Answer 2

有很多常用数据通常用于多个蜘蛛，例如 regex 甚至 XPath。

隔离它们是个好主意。

你可以这样使用：

/project
    /site_data
        handle_responses.py
        ...
    /spiders
        your_spider.py
        ...

隔离具有共同目的的功能。

# handle_responses.py

# imports ...
from re import search


def get_specific_commom_data(text: str):
    # probably is a good idea handle predictable errors here (`try except`) 
    return search('your_regex', text)

并且只在需要该功能的地方使用。

# your_spider.py

# imports ...
import scrapy
from site_data.handle_responses import get_specific_commom_data


class YourSpider(scrapy.Spider):
    # ... previous code
    def your_method(self, response):
        # ... previous code
        item['element'] = get_specific_commom_data(response.text)

尽量保持简单，做你需要做的来解决你的问题。

Answer 3

我可以在多个蜘蛛中复制正则表达式，而不是从其他 .py 文件中导入对象，我知道他们有用例，但在这里我不想向任何蜘蛛添加任何东西，但仍然想要元素结果

对此有一些很好的答案，但并没有真正解决问题，所以在搜索了几天之后我找到了这个解决方案，希望它对其他寻找类似答案的人有用。

#middlewares.py
import yourproject.items import youritem()
#find the function and add your element 

def process_spider_output(self, response, result, spider):
    item = YourItem()
    item['element'] = re.findall('my_regex', response.text)

现在取消注释来自

的中间件

#settings.py

SPIDER_MIDDLEWARES = {
    'yourproject.middlewares.YoursprojectMiddleware': 543,
}

对于每个蜘蛛，您都会在结果数据中获得元素，我仍在寻找更好的解决方案，我会更新答案，因为它会减慢蜘蛛的速度，

为scrapy项目中的每个蜘蛛提取项目

Extract item for each spider in scrapy project

scrapy

python-3.x