使用 Scrapy / XPath 提取数据

Question

我创建了以下蜘蛛，当运行它时会导致这些问题：

标题被“截断”了 — 可能是因为 <em> 里面的标签
该位置包含空格和 \n

目前正在努力寻找这两个剩余问题的解决方案。

class GitHubSpider(scrapy.Spider):
    name = "github"
    start_urls = [
        "https://github.com/search?p=1&q=React+Django&type=Users",
    ]

    def parse(self, response):
        for github in response.css(".Box-row"):
            yield {
                "github_link": github.css(".mr-1::attr(href)").get(),
                "name": github.css(".mr-1::text").get(),
                "headline": github.css(".mb-1::text").get(),
                "location": github.css(".mr-3:nth-child(1)::text").get(),
            }

预期结果

# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
        'github_link': '/djangofan',
        'name': 'Jon Austen',
        'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, React-Native, and Docker. Focus: Testing, CI, and Micro-Services.',
        'location': 'Portland, OR'
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
    'github_link': '/django-wong',
    'name': ' Wong',
    'headline': 'PHP / Node.js / Dart (Flutter) / React Native / Scala',
    'location': 'China'
}
[...]

实际结果

# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
        'github_link': '/djangofan',
        'name': 'Jon Austen',
        'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, ',
        'location': '\n          Portland, OR\n        '
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
    'github_link': '/django-wong',
    'name': ' Wong',
    'headline': 'PHP / Node.js / Dart (Flutter) / ',
    'location': '\n          China\n        '
}
[...]

Answer 1

第一个问题可以用 xpath 和 string() 解决。

第二个问题可以用 strip() 解决。

class GitHubSpider(scrapy.Spider):
    name = "github"
    start_urls = [
        "https://github.com/search?p=1&q=React+Django&type=Users",
    ]

    def strip_string(self, string):
        if string is not None:
            return string.strip()

    def parse(self, response):
        for github in response.css(".Box-row"):
            github_link = self.strip_string(github.css(".mr-1::attr(href)").get())
            name = self.strip_string(github.css(".mr-1::text").get())
            headline = self.strip_string(github.xpath('string(//p[@class="mb-1"])').get())
            location = self.strip_string(github.css(".mr-3:nth-child(1)::text").get())
            yield {
                "github_link": github_link,
                "name": name,
                "headline": headline,
                "location": location
            }

Answer 2

解决方法如下：

代码：

import scrapy

class GitHubSpider(scrapy.Spider):
    name = "github"
    start_urls = [
        "https://github.com/search?p=1&q=React+Django&type=Users",
    ]

    def parse(self, response):
        for github in response.xpath('//*[@class="flex-auto"]'):
            
            yield {
                "github_link": github.xpath('.//*[@class="color-text-secondary"]/@href').get(),
                "name": github.xpath('.//*[@class="mr-1"]/text()').get(),
                "headline": github.xpath('.//*[@class="mb-1"]//text()').get(),
                "location": github.xpath('normalize-space(.//*[@class="mr-3"]/text())').get()
            }

输出：

{'github_link': '/djangofan', 'name': 'Jon Austen', 'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, ', 'location': 'Portland, OR'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/django-wong', 'name': ' Wong', 'headline': 'PHP / Node.js / Dart (Flutter) / ', 'location': 'China'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/DipanshKhandelwal', 'name': 'Dipansh Khandelwal', 'headline': 'React', 'location': 'Bengaluru, India'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/usj-django-dev', 'name': 'Utsho Sadhak Joy', 'headline': 'const Joy = (', 'location': 'Khulna,Bangladesh'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/kpnemre', 'name': 'Emre Kapan', 'headline': 'React', 'location': ''}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/indraasura', 'name': 'Swarup Hegde', 'headline': 'Proficient in JavaScript, Python, ', 'location': 'Indore, India'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/pongstr', 'name': 'Pongstr', 'headline': 'Vue. ', 'location': 'Tallinn, Estonia'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/wencakisa', 'name': 'Ventsislav Tashev', 'headline': 'Django', 'location': 'Sofia, Bulgaria'} 
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/novelview9', 'name': 'GarakdongBigBoy', 'headline': 'Django', 'location': ''}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/willemarcel', 'name': 'Wille Marcel', 'headline': 'Software engineer. Python, ', 'location': 'Salvador-BA-Brazil'}
2021-08-07 17:19:10 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-07 17:19:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 327,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 24422,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.271329,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 8, 7, 11, 19, 10, 249326),
 'httpcompression/response_bytes': 132530,
 'httpcompression/response_count': 1,
 'item_scraped_count': 10,

使用 Scrapy / XPath 提取数据

Extracting data w/ Scrapy / XPath

xpath

scrapy