使用 Scrapy 提取时的问题

Question

我正在试验 Scrapy，目前正在尝试以下方法

scrapy shell https://github.com/search?p=1&q=React+Django&type=Users

# FName LName
response.css(".mr-1::text").get()

# Headline
response.css(".mb-1::text").get()

# Location
response.css("#user_search_results .mr-3:nth-child(1)::text").get()

# Email
response.css(".Link--muted::attr(href)").get()

我现在运行这两个问题：

response.css(".mb-1::text").get()

Expected: Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, React-Native, and Docker. Focus: Testing, CI, and Micro-Services.

Result: Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure,

response.css(".Link--muted::attr(href)").get()

Expected: djangofan@gmail.com

Result: None

你对我在这里做错了什么有什么建议吗？

Answer 1

对于这些情况，请使用 xpath 而不是 css，因为有多个 .mb-1，您需要隔离第一个并获取包含其所有子元素的文本。

示例：

''.join(response.xpath('(//p[contains(@class, "mb-1")])[1]//text()').extract())

会给你：

Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, React-Native, and Docker.  Focus: Testing, CI, and Micro-Services.

使用 Scrapy 提取时的问题

Issues while extracting with Scrapy

python

xpath

scrapy