爬行 kununu - 0 项返回与斗志

Crawling kununu - 0 items back with scrappy

我是 python 新手,正在尝试使用 scrapy 抓取 kununu。当我使用它进行抓取时,我得到了 0 个页面和 0 个项目。

输出:

...
'scrapy.extensions.logstats.LogStats']
2021-07-25 11:56:08 [scrapy.core.engine] INFO: Spider opened
2021-07-25 11:56:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 11:56:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-07-25 11:56:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.kununu.com/de/joimax1/kommentare> from <GET https://www.kununu.com/de/joimax1/kommentare/>
2021-07-25 11:56:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.kununu.com/de/joimax1/kommentare> from <GET http://www.kununu.com/de/joimax1/kommentare>
Aktuelle Seite : https://www.kununu.com/de/joimax1/kommentare
....
import scrapy
import logging


class KununuSpider(scrapy.Spider):
    name = "kununu"
    allowed_domains = ["kununu.com"]

    # Reduce Log-Level of some Loggers to avoid "spam" messages in Command line
    def __init__(self, *args, **kwargs):
        logger = logging.getLogger('scrapy.core.scraper')
        logger.setLevel(logging.INFO)
        logger2 = logging.getLogger('scrapy.core.engine')
        logger2.setLevel(logging.INFO)
        logger3 = logging.getLogger('scrapy.middleware')
        logger3.setLevel(logging.WARNING)
        logger4 = logging.getLogger('kununu')
        logger4.setLevel(logging.WARNING)
        super().__init__(*args, **kwargs)

    def start_requests(self):
        yield scrapy.Request('https://www.kununu.com/de/joimax1/kommentare/',self.parse)

    def parse(self, response):
        print("Aktuelle Seite : {}".format(response.url))
        review_list = response.css('article.company-profile-review')
        print(review_list)
        for elem in review_list:
                    item = {
                               'url': response.url,
                               'date': elem.css('span::text')[1].extract(),
                               'title': elem.css('a::text')[0].extract(),
                               'rating': elem.css('div.tile-heading::text')[0].extract()
                            }
                    yield item


                    
        next_page_url = response.css('a.btn.btn-default.btn-block::attr(href)') # does this attribute exist at all or is returned an empty list?
        if next_page_url:
           next_page_url = next_page_url[0].extract()
           next_page_url = response.urljoin(next_page_url)
           yield scrapy.Request(url=next_page_url, callback = self.parse)
        else:
            self.log('Last page reached: ' + response.url)
            self.log('Last page contained {} item(s)'.format(len(review_list)))

这是因为当你使用Scrapy的默认设置时,网站拒绝了你的请求user-agent

您可以通过以下方式检查:

scrapy shell "https://www.kununu.com/de/joimax1/kommentare/"
view(response)

这将在浏览器中打开如下所示的响应:

使用以下代码发送自定义 header:

request = scrapy.Request(
              url = url,
              headers={
                  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
              }
          )
fetch(request)
view(response)

现在您将看到:

此外,您的 css 路径不正确。

review_list 与 class 一起存储 class="index__reviewBlock__27gnB"

我不太了解这个网站,但看起来这些 class 名称是随机生成的。所以最好这样称呼它们:

In []: response.xpath("//*[contains(@class,'index__reviewBlock')]")

Out[]: 
[<Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
 <Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
 <Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
 <Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
 <Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
 <Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
 <Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
 <Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
 <Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>,
 <Selector xpath="//*[contains(@class,'index__reviewBlock')]" data='<div class="index__reviewBlock__27gnB...'>]

希望对您有所帮助:)

编辑:在您的代码中,您将像这样调用此函数:

def start_requests(self):
    yield scrapy.Request(
        url = "https://www.kununu.com/de/joimax1/kommentare/",
        headers={
            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
        },
        callback=self.parse
    )

0 项返回,因为在 JavaScript 的帮助下后端正在生成数据。转到 chrome devtool,然后是网络选项卡,然后是 xhr 选项卡,然后单击 header 选项卡,然后您将获得 url,然后单击预览选项卡以查看数据。

这是可行的解决方案:

import scrapy
import json


class KununuSpider(scrapy.Spider):
    name = 'kununu'
    
    headers = {
        "authority": "www.kununu.com",
        "path": "/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=2",
        "scheme": "https",
        "accept": "application/json",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "en-US,en;q=0.9,bn;q=0.8,es;q=0.7,ar;q=0.6",
        "content-type": "application/json",
        "referer": "https://www.kununu.com/de/joimax1/kommentare",
        #"sec-ch-ua":""Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest":"empty",
        "sec-fetch-mode": "cors",
        "sec-fetch-site": "same-origin",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
        "x-lang": "de_DE"
    }
    def start_requests(self):
        yield scrapy.Request(
            url = 'https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1',
            callback = self.parse,
            method = "GET",
            headers = self.headers
           
            )

    def parse(self, response):
        response = json.loads(response.body)
        for resp in response['reviews']:
            
            items = {
                'title':resp['title'],
                'date':resp['createdAt'],
                'rating':resp['roundedScore']
                }
            yield items
   

输出:

2021-07-25 17:28:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1> (referer: https://www.kununu.com/de/joimax1/kommentare)
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Mit viel Abstand betrachtet leider viel Negatives und wenig Positives', 'date': '2021-06-30T00:00:00+00:00', 'rating': 2}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Eigene Meinung ist nicht willkommen.', 'date': '2021-02-01T00:00:00+00:00', 'rating': 1}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Gar nicht so schlimm', 'date': '2021-04-21T00:00:00+00:00', 'rating': 4}     
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Es könnte alles so schön sein...', 'date': '2021-01-30T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Außen Hui...', 'date': '2020-12-16T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Mirco-Managment as its best', 'date': '2020-08-20T00:00:00+00:00', 'rating': 
2}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Katastrophal', 'date': '2020-07-01T00:00:00+00:00', 'rating': 1}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Licht und Schatten sind sehr nahe beieinander.', 'date': '2020-05-01T00:00:00+00:00', 'rating': 3.5}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Leider keine Empfehlung von mir', 'date': '2019-11-19T00:00:00+00:00', 'rating': 2.5}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Wohl und Weh nahe beieinander', 'date': '2019-03-30T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-25 17:28:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 748,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 12850,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,