可以用 Beautiful Soup 的 html5lib 解析器替换 Scrapy 的默认 lxml 解析器吗?
Possible to replace Scrapy's default lxml parser with Beautiful Soup's html5lib parser?
问题:有没有办法将 BeautifulSoup 的 html5lib 解析器集成到 scrapy 项目中——而不是 scrapy 默认的 lxml 解析器?
我的抓取页面的 Scrapy 解析器失败(对于某些元素)。
这只会在 20 页中的每 2 页发生一次。
作为修复,我已将 BeautifulSoup 的解析器添加到项目中(有效)。
也就是说,我觉得我正在加倍使用条件和多个解析器的工作......在某个时候,使用 Scrapy 的解析器的原因是什么?
代码确实有效....感觉像是破解。
我不是专家——有更优雅的方法吗?
预先感谢
更新:
将中间件 class 添加到 scrapy(来自 python 包 scrapy-beautifulsoup)就像一个魅力。显然,Scrapy 的 lxml 不如 BeautifulSoup 的 lxml 健壮。我不必求助于 html5lib 解析器——它慢了 30 倍以上。
class BeautifulSoupMiddleware(object):
def __init__(self, crawler):
super(BeautifulSoupMiddleware, self).__init__()
self.parser = crawler.settings.get('BEAUTIFULSOUP_PARSER', "html.parser")
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
"""Overridden process_response would "pipe" response.body through BeautifulSoup."""
return response.replace(body=str(BeautifulSoup(response.body, self.parser)))
原文:
import scrapy
from scrapy.item import Item, Field
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy import Selector
from scrapy.loader import ItemLoader
from w3lib.html import remove_tags
from bs4 import BeautifulSoup
class SimpleSpider(scrapy.Spider):
name = 'SimpleSpider'
allowed_domains = ['totally-above-board.com']
start_urls = [
'https://totally-above-board.com/nefarious-scrape-page.html'
]
custom_settings = {
'ITEM_PIPELINES': {
'crawler.spiders.simple_spider.Pipeline': 400
}
}
def parse(self, response):
yield from self.parse_company_info(response)
yield from self.parse_reviews(response)
def parse_company_info(self, response):
print('parse_company_info')
print('==================')
loader = ItemLoader(CompanyItem(), response=response)
loader.add_xpath('company_name',
'//h1[contains(@class,"sp-company-name")]//span//text()')
yield loader.load_item()
def parse_reviews(self, response):
print('parse_reviews')
print('=============')
# Beautiful Soup
selector = Selector(response)
# On the Page (Total Reviews) # 49
search = '//span[contains(@itemprop,"reviewCount")]//text()'
review_count = selector.xpath(search).get()
review_count = int(float(review_count))
# Number of elements Scrapy's LXML Could find # 0
search = '//div[@itemprop ="review"]'
review_element_count = len(selector.xpath(search))
# Use Scrapy or Beautiful Soup?
if review_count > review_element_count:
# Try Beautiful Soup
soup = BeautifulSoup(response.text, "lxml")
root = soup.findAll("div", {"itemprop": "review"})
for review in root:
loader = ItemLoader(ReviewItem(), selector=review)
review_text = review.find("span", {"itemprop": "reviewBody"}).text
loader.add_value('review_text', review_text)
author = review.find("span", {"itemprop": "author"}).text
loader.add_value('author', author)
yield loader.load_item()
else:
# Try Scrapy
review_list_xpath = '//div[@itemprop ="review"]'
selector = Selector(response)
for review in selector.xpath(review_list_xpath):
loader = ItemLoader(ReviewItem(), selector=review)
loader.add_xpath('review_text',
'.//span[@itemprop="reviewBody"]//text()')
loader.add_xpath('author',
'.//span[@itemprop="author"]//text()')
yield loader.load_item()
yield from self.paginate_reviews(response)
def paginate_reviews(self, response):
print('paginate_reviews')
print('================')
# Try Scrapy
selector = Selector(response)
search = '''//span[contains(@class,"item-next")]
//a[@class="next"]/@href
'''
next_reviews_link = selector.xpath(search).get()
# Try Beautiful Soup
if next_reviews_link is None:
soup = BeautifulSoup(response.text, "lxml")
try:
next_reviews_link = soup.find("a", {"class": "next"})['href']
except Exception as e:
pass
if next_reviews_link:
yield response.follow(next_reviews_link, self.parse_reviews)
这是一个 common feature request for Parsel,用于 XML/HTML 抓取的 Scrapy 库。
但是,您无需等待此类功能的实现。您可以使用 BeautifulSoup 修复 HTML 代码,并在固定的 HTML:
上使用 Parsel
from bs4 import BeautifulSoup
# …
response = response.replace(body=str(BeautifulSoup(response.body, "html5lib")))
如果原始页面不是 utf-8 编码,则使用 @Gallaecio 的回答可能会出现字符集错误,因为响应已设置为其他编码。
所以,你必须先切换编码。
另外,可能还有字符转义的问题。
例如,如果在 html 的文本中遇到字符 <
,则必须将其转义为 <
。否则,“lxml”将删除它和它附近的文本,认为它是一个错误的 html 标签。
"html5lib" 转义字符,但是很慢。
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html5lib')))
"html.parser" 更快,但还必须指定 from_encoding
(例如 'cp1251')。
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html.parser', from_encoding='cp1251')))
问题:有没有办法将 BeautifulSoup 的 html5lib 解析器集成到 scrapy 项目中——而不是 scrapy 默认的 lxml 解析器?
我的抓取页面的 Scrapy 解析器失败(对于某些元素)。
这只会在 20 页中的每 2 页发生一次。
作为修复,我已将 BeautifulSoup 的解析器添加到项目中(有效)。
也就是说,我觉得我正在加倍使用条件和多个解析器的工作......在某个时候,使用 Scrapy 的解析器的原因是什么?
代码确实有效....感觉像是破解。
我不是专家——有更优雅的方法吗?
预先感谢
更新:
将中间件 class 添加到 scrapy(来自 python 包 scrapy-beautifulsoup)就像一个魅力。显然,Scrapy 的 lxml 不如 BeautifulSoup 的 lxml 健壮。我不必求助于 html5lib 解析器——它慢了 30 倍以上。
class BeautifulSoupMiddleware(object):
def __init__(self, crawler):
super(BeautifulSoupMiddleware, self).__init__()
self.parser = crawler.settings.get('BEAUTIFULSOUP_PARSER', "html.parser")
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
"""Overridden process_response would "pipe" response.body through BeautifulSoup."""
return response.replace(body=str(BeautifulSoup(response.body, self.parser)))
原文:
import scrapy
from scrapy.item import Item, Field
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy import Selector
from scrapy.loader import ItemLoader
from w3lib.html import remove_tags
from bs4 import BeautifulSoup
class SimpleSpider(scrapy.Spider):
name = 'SimpleSpider'
allowed_domains = ['totally-above-board.com']
start_urls = [
'https://totally-above-board.com/nefarious-scrape-page.html'
]
custom_settings = {
'ITEM_PIPELINES': {
'crawler.spiders.simple_spider.Pipeline': 400
}
}
def parse(self, response):
yield from self.parse_company_info(response)
yield from self.parse_reviews(response)
def parse_company_info(self, response):
print('parse_company_info')
print('==================')
loader = ItemLoader(CompanyItem(), response=response)
loader.add_xpath('company_name',
'//h1[contains(@class,"sp-company-name")]//span//text()')
yield loader.load_item()
def parse_reviews(self, response):
print('parse_reviews')
print('=============')
# Beautiful Soup
selector = Selector(response)
# On the Page (Total Reviews) # 49
search = '//span[contains(@itemprop,"reviewCount")]//text()'
review_count = selector.xpath(search).get()
review_count = int(float(review_count))
# Number of elements Scrapy's LXML Could find # 0
search = '//div[@itemprop ="review"]'
review_element_count = len(selector.xpath(search))
# Use Scrapy or Beautiful Soup?
if review_count > review_element_count:
# Try Beautiful Soup
soup = BeautifulSoup(response.text, "lxml")
root = soup.findAll("div", {"itemprop": "review"})
for review in root:
loader = ItemLoader(ReviewItem(), selector=review)
review_text = review.find("span", {"itemprop": "reviewBody"}).text
loader.add_value('review_text', review_text)
author = review.find("span", {"itemprop": "author"}).text
loader.add_value('author', author)
yield loader.load_item()
else:
# Try Scrapy
review_list_xpath = '//div[@itemprop ="review"]'
selector = Selector(response)
for review in selector.xpath(review_list_xpath):
loader = ItemLoader(ReviewItem(), selector=review)
loader.add_xpath('review_text',
'.//span[@itemprop="reviewBody"]//text()')
loader.add_xpath('author',
'.//span[@itemprop="author"]//text()')
yield loader.load_item()
yield from self.paginate_reviews(response)
def paginate_reviews(self, response):
print('paginate_reviews')
print('================')
# Try Scrapy
selector = Selector(response)
search = '''//span[contains(@class,"item-next")]
//a[@class="next"]/@href
'''
next_reviews_link = selector.xpath(search).get()
# Try Beautiful Soup
if next_reviews_link is None:
soup = BeautifulSoup(response.text, "lxml")
try:
next_reviews_link = soup.find("a", {"class": "next"})['href']
except Exception as e:
pass
if next_reviews_link:
yield response.follow(next_reviews_link, self.parse_reviews)
这是一个 common feature request for Parsel,用于 XML/HTML 抓取的 Scrapy 库。
但是,您无需等待此类功能的实现。您可以使用 BeautifulSoup 修复 HTML 代码,并在固定的 HTML:
上使用 Parselfrom bs4 import BeautifulSoup
# …
response = response.replace(body=str(BeautifulSoup(response.body, "html5lib")))
如果原始页面不是 utf-8 编码,则使用 @Gallaecio 的回答可能会出现字符集错误,因为响应已设置为其他编码。 所以,你必须先切换编码。
另外,可能还有字符转义的问题。
例如,如果在 html 的文本中遇到字符 <
,则必须将其转义为 <
。否则,“lxml”将删除它和它附近的文本,认为它是一个错误的 html 标签。
"html5lib" 转义字符,但是很慢。
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html5lib')))
"html.parser" 更快,但还必须指定 from_encoding
(例如 'cp1251')。
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html.parser', from_encoding='cp1251')))