爬行 kununu - 0 项返回与斗志
Crawling kununu - 0 items back with scrappy
我是 python 新手,正在尝试使用 scrapy 抓取 kununu。当我使用它进行抓取时,我得到了 0 个页面和 0 个项目。
输出:
...
'scrapy.extensions.logstats.LogStats']
2021-07-25 11:56:08 [scrapy.core.engine] INFO: Spider opened
2021-07-25 11:56:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 11:56:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-07-25 11:56:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.kununu.com/de/joimax1/kommentare> from <GET https://www.kununu.com/de/joimax1/kommentare/>
2021-07-25 11:56:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.kununu.com/de/joimax1/kommentare> from <GET http://www.kununu.com/de/joimax1/kommentare>
Aktuelle Seite : https://www.kununu.com/de/joimax1/kommentare
....
import scrapy
import logging
class KununuSpider(scrapy.Spider):
name = "kununu"
allowed_domains = ["kununu.com"]
# Reduce Log-Level of some Loggers to avoid "spam" messages in Command line
def __init__(self, *args, **kwargs):
logger = logging.getLogger('scrapy.core.scraper')
logger.setLevel(logging.INFO)
logger2 = logging.getLogger('scrapy.core.engine')
logger2.setLevel(logging.INFO)
logger3 = logging.getLogger('scrapy.middleware')
logger3.setLevel(logging.WARNING)
logger4 = logging.getLogger('kununu')
logger4.setLevel(logging.WARNING)
super().__init__(*args, **kwargs)
def start_requests(self):
yield scrapy.Request('https://www.kununu.com/de/joimax1/kommentare/',self.parse)
def parse(self, response):
print("Aktuelle Seite : {}".format(response.url))
review_list = response.css('article.company-profile-review')
print(review_list)
for elem in review_list:
item = {
'url': response.url,
'date': elem.css('span::text')[1].extract(),
'title': elem.css('a::text')[0].extract(),
'rating': elem.css('div.tile-heading::text')[0].extract()
}
yield item
next_page_url = response.css('a.btn.btn-default.btn-block::attr(href)') # does this attribute exist at all or is returned an empty list?
if next_page_url:
next_page_url = next_page_url[0].extract()
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback = self.parse)
else:
self.log('Last page reached: ' + response.url)
self.log('Last page contained {} item(s)'.format(len(review_list)))
这是因为当你使用Scrapy的默认设置时,网站拒绝了你的请求user-agent
。
您可以通过以下方式检查:
scrapy shell "https://www.kununu.com/de/joimax1/kommentare/"
view(response)
这将在浏览器中打开如下所示的响应:
使用以下代码发送自定义 header:
request = scrapy.Request(
url = url,
headers={
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
}
)
fetch(request)
view(response)
现在您将看到:
此外,您的 css 路径不正确。
review_list 与 class 一起存储 class="index__reviewBlock__27gnB"
我不太了解这个网站,但看起来这些 class 名称是随机生成的。所以最好这样称呼它们:
In []: response.xpath("//*[contains(@class,'index__reviewBlock')]")
Out[]:
[<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>]
希望对您有所帮助:)
编辑:在您的代码中,您将像这样调用此函数:
def start_requests(self):
yield scrapy.Request(
url = "https://www.kununu.com/de/joimax1/kommentare/",
headers={
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
},
callback=self.parse
)
0 项返回,因为在 JavaScript
的帮助下后端正在生成数据。转到 chrome devtool,然后是网络选项卡,然后是 xhr 选项卡,然后单击 header 选项卡,然后您将获得 url,然后单击预览选项卡以查看数据。
这是可行的解决方案:
import scrapy
import json
class KununuSpider(scrapy.Spider):
name = 'kununu'
headers = {
"authority": "www.kununu.com",
"path": "/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=2",
"scheme": "https",
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,bn;q=0.8,es;q=0.7,ar;q=0.6",
"content-type": "application/json",
"referer": "https://www.kununu.com/de/joimax1/kommentare",
#"sec-ch-ua":""Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest":"empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"x-lang": "de_DE"
}
def start_requests(self):
yield scrapy.Request(
url = 'https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1',
callback = self.parse,
method = "GET",
headers = self.headers
)
def parse(self, response):
response = json.loads(response.body)
for resp in response['reviews']:
items = {
'title':resp['title'],
'date':resp['createdAt'],
'rating':resp['roundedScore']
}
yield items
输出:
2021-07-25 17:28:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1> (referer: https://www.kununu.com/de/joimax1/kommentare)
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Mit viel Abstand betrachtet leider viel Negatives und wenig Positives', 'date': '2021-06-30T00:00:00+00:00', 'rating': 2}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Eigene Meinung ist nicht willkommen.', 'date': '2021-02-01T00:00:00+00:00', 'rating': 1}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Gar nicht so schlimm', 'date': '2021-04-21T00:00:00+00:00', 'rating': 4}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Es könnte alles so schön sein...', 'date': '2021-01-30T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Außen Hui...', 'date': '2020-12-16T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Mirco-Managment as its best', 'date': '2020-08-20T00:00:00+00:00', 'rating':
2}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Katastrophal', 'date': '2020-07-01T00:00:00+00:00', 'rating': 1}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Licht und Schatten sind sehr nahe beieinander.', 'date': '2020-05-01T00:00:00+00:00', 'rating': 3.5}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Leider keine Empfehlung von mir', 'date': '2019-11-19T00:00:00+00:00', 'rating': 2.5}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Wohl und Weh nahe beieinander', 'date': '2019-03-30T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-25 17:28:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 748,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 12850,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
我是 python 新手,正在尝试使用 scrapy 抓取 kununu。当我使用它进行抓取时,我得到了 0 个页面和 0 个项目。
输出:
...
'scrapy.extensions.logstats.LogStats']
2021-07-25 11:56:08 [scrapy.core.engine] INFO: Spider opened
2021-07-25 11:56:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 11:56:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-07-25 11:56:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.kununu.com/de/joimax1/kommentare> from <GET https://www.kununu.com/de/joimax1/kommentare/>
2021-07-25 11:56:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.kununu.com/de/joimax1/kommentare> from <GET http://www.kununu.com/de/joimax1/kommentare>
Aktuelle Seite : https://www.kununu.com/de/joimax1/kommentare
....
import scrapy
import logging
class KununuSpider(scrapy.Spider):
name = "kununu"
allowed_domains = ["kununu.com"]
# Reduce Log-Level of some Loggers to avoid "spam" messages in Command line
def __init__(self, *args, **kwargs):
logger = logging.getLogger('scrapy.core.scraper')
logger.setLevel(logging.INFO)
logger2 = logging.getLogger('scrapy.core.engine')
logger2.setLevel(logging.INFO)
logger3 = logging.getLogger('scrapy.middleware')
logger3.setLevel(logging.WARNING)
logger4 = logging.getLogger('kununu')
logger4.setLevel(logging.WARNING)
super().__init__(*args, **kwargs)
def start_requests(self):
yield scrapy.Request('https://www.kununu.com/de/joimax1/kommentare/',self.parse)
def parse(self, response):
print("Aktuelle Seite : {}".format(response.url))
review_list = response.css('article.company-profile-review')
print(review_list)
for elem in review_list:
item = {
'url': response.url,
'date': elem.css('span::text')[1].extract(),
'title': elem.css('a::text')[0].extract(),
'rating': elem.css('div.tile-heading::text')[0].extract()
}
yield item
next_page_url = response.css('a.btn.btn-default.btn-block::attr(href)') # does this attribute exist at all or is returned an empty list?
if next_page_url:
next_page_url = next_page_url[0].extract()
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback = self.parse)
else:
self.log('Last page reached: ' + response.url)
self.log('Last page contained {} item(s)'.format(len(review_list)))
这是因为当你使用Scrapy的默认设置时,网站拒绝了你的请求user-agent
。
您可以通过以下方式检查:
scrapy shell "https://www.kununu.com/de/joimax1/kommentare/"
view(response)
这将在浏览器中打开如下所示的响应:
使用以下代码发送自定义 header:
request = scrapy.Request(
url = url,
headers={
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
}
)
fetch(request)
view(response)
现在您将看到:
此外,您的 css 路径不正确。
review_list 与 class 一起存储 class="index__reviewBlock__27gnB"
我不太了解这个网站,但看起来这些 class 名称是随机生成的。所以最好这样称呼它们:
In []: response.xpath("//*[contains(@class,'index__reviewBlock')]")
Out[]:
[<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>]
希望对您有所帮助:)
编辑:在您的代码中,您将像这样调用此函数:
def start_requests(self):
yield scrapy.Request(
url = "https://www.kununu.com/de/joimax1/kommentare/",
headers={
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
},
callback=self.parse
)
0 项返回,因为在 JavaScript
的帮助下后端正在生成数据。转到 chrome devtool,然后是网络选项卡,然后是 xhr 选项卡,然后单击 header 选项卡,然后您将获得 url,然后单击预览选项卡以查看数据。
这是可行的解决方案:
import scrapy
import json
class KununuSpider(scrapy.Spider):
name = 'kununu'
headers = {
"authority": "www.kununu.com",
"path": "/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=2",
"scheme": "https",
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,bn;q=0.8,es;q=0.7,ar;q=0.6",
"content-type": "application/json",
"referer": "https://www.kununu.com/de/joimax1/kommentare",
#"sec-ch-ua":""Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest":"empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"x-lang": "de_DE"
}
def start_requests(self):
yield scrapy.Request(
url = 'https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1',
callback = self.parse,
method = "GET",
headers = self.headers
)
def parse(self, response):
response = json.loads(response.body)
for resp in response['reviews']:
items = {
'title':resp['title'],
'date':resp['createdAt'],
'rating':resp['roundedScore']
}
yield items
输出:
2021-07-25 17:28:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1> (referer: https://www.kununu.com/de/joimax1/kommentare)
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Mit viel Abstand betrachtet leider viel Negatives und wenig Positives', 'date': '2021-06-30T00:00:00+00:00', 'rating': 2}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Eigene Meinung ist nicht willkommen.', 'date': '2021-02-01T00:00:00+00:00', 'rating': 1}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Gar nicht so schlimm', 'date': '2021-04-21T00:00:00+00:00', 'rating': 4}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Es könnte alles so schön sein...', 'date': '2021-01-30T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Außen Hui...', 'date': '2020-12-16T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Mirco-Managment as its best', 'date': '2020-08-20T00:00:00+00:00', 'rating':
2}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Katastrophal', 'date': '2020-07-01T00:00:00+00:00', 'rating': 1}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Licht und Schatten sind sehr nahe beieinander.', 'date': '2020-05-01T00:00:00+00:00', 'rating': 3.5}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Leider keine Empfehlung von mir', 'date': '2019-11-19T00:00:00+00:00', 'rating': 2.5}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Wohl und Weh nahe beieinander', 'date': '2019-03-30T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-25 17:28:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 748,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 12850,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,