使用 Scrapy / XPath 提取数据
Extracting data w/ Scrapy / XPath
我创建了以下蜘蛛,当 运行 它时会导致这些问题:
- 标题被“截断”了 — 可能是因为
<em>
里面的标签
- 该位置包含空格和
\n
目前正在努力寻找这两个剩余问题的解决方案。
class GitHubSpider(scrapy.Spider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
def parse(self, response):
for github in response.css(".Box-row"):
yield {
"github_link": github.css(".mr-1::attr(href)").get(),
"name": github.css(".mr-1::text").get(),
"headline": github.css(".mb-1::text").get(),
"location": github.css(".mr-3:nth-child(1)::text").get(),
}
预期结果
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/djangofan',
'name': 'Jon Austen',
'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, React-Native, and Docker. Focus: Testing, CI, and Micro-Services.',
'location': 'Portland, OR'
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/django-wong',
'name': ' Wong',
'headline': 'PHP / Node.js / Dart (Flutter) / React Native / Scala',
'location': 'China'
}
[...]
实际结果
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/djangofan',
'name': 'Jon Austen',
'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, ',
'location': '\n Portland, OR\n '
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/django-wong',
'name': ' Wong',
'headline': 'PHP / Node.js / Dart (Flutter) / ',
'location': '\n China\n '
}
[...]
第一个问题可以用 xpath 和 string() 解决。
第二个问题可以用 strip() 解决。
class GitHubSpider(scrapy.Spider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
def strip_string(self, string):
if string is not None:
return string.strip()
def parse(self, response):
for github in response.css(".Box-row"):
github_link = self.strip_string(github.css(".mr-1::attr(href)").get())
name = self.strip_string(github.css(".mr-1::text").get())
headline = self.strip_string(github.xpath('string(//p[@class="mb-1"])').get())
location = self.strip_string(github.css(".mr-3:nth-child(1)::text").get())
yield {
"github_link": github_link,
"name": name,
"headline": headline,
"location": location
}
解决方法如下:
代码:
import scrapy
class GitHubSpider(scrapy.Spider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
def parse(self, response):
for github in response.xpath('//*[@class="flex-auto"]'):
yield {
"github_link": github.xpath('.//*[@class="color-text-secondary"]/@href').get(),
"name": github.xpath('.//*[@class="mr-1"]/text()').get(),
"headline": github.xpath('.//*[@class="mb-1"]//text()').get(),
"location": github.xpath('normalize-space(.//*[@class="mr-3"]/text())').get()
}
输出:
{'github_link': '/djangofan', 'name': 'Jon Austen', 'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, ', 'location': 'Portland, OR'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/django-wong', 'name': ' Wong', 'headline': 'PHP / Node.js / Dart (Flutter) / ', 'location': 'China'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/DipanshKhandelwal', 'name': 'Dipansh Khandelwal', 'headline': 'React', 'location': 'Bengaluru, India'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/usj-django-dev', 'name': 'Utsho Sadhak Joy', 'headline': 'const Joy = (', 'location': 'Khulna,Bangladesh'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/kpnemre', 'name': 'Emre Kapan', 'headline': 'React', 'location': ''}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/indraasura', 'name': 'Swarup Hegde', 'headline': 'Proficient in JavaScript, Python, ', 'location': 'Indore, India'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/pongstr', 'name': 'Pongstr', 'headline': 'Vue. ', 'location': 'Tallinn, Estonia'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/wencakisa', 'name': 'Ventsislav Tashev', 'headline': 'Django', 'location': 'Sofia, Bulgaria'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/novelview9', 'name': 'GarakdongBigBoy', 'headline': 'Django', 'location': ''}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/willemarcel', 'name': 'Wille Marcel', 'headline': 'Software engineer. Python, ', 'location': 'Salvador-BA-Brazil'}
2021-08-07 17:19:10 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-07 17:19:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 327,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 24422,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.271329,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 8, 7, 11, 19, 10, 249326),
'httpcompression/response_bytes': 132530,
'httpcompression/response_count': 1,
'item_scraped_count': 10,
我创建了以下蜘蛛,当 运行 它时会导致这些问题:
- 标题被“截断”了 — 可能是因为
<em>
里面的标签 - 该位置包含空格和
\n
目前正在努力寻找这两个剩余问题的解决方案。
class GitHubSpider(scrapy.Spider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
def parse(self, response):
for github in response.css(".Box-row"):
yield {
"github_link": github.css(".mr-1::attr(href)").get(),
"name": github.css(".mr-1::text").get(),
"headline": github.css(".mb-1::text").get(),
"location": github.css(".mr-3:nth-child(1)::text").get(),
}
预期结果
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/djangofan',
'name': 'Jon Austen',
'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, React-Native, and Docker. Focus: Testing, CI, and Micro-Services.',
'location': 'Portland, OR'
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/django-wong',
'name': ' Wong',
'headline': 'PHP / Node.js / Dart (Flutter) / React Native / Scala',
'location': 'China'
}
[...]
实际结果
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/djangofan',
'name': 'Jon Austen',
'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, ',
'location': '\n Portland, OR\n '
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/django-wong',
'name': ' Wong',
'headline': 'PHP / Node.js / Dart (Flutter) / ',
'location': '\n China\n '
}
[...]
第一个问题可以用 xpath 和 string() 解决。
第二个问题可以用 strip() 解决。
class GitHubSpider(scrapy.Spider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
def strip_string(self, string):
if string is not None:
return string.strip()
def parse(self, response):
for github in response.css(".Box-row"):
github_link = self.strip_string(github.css(".mr-1::attr(href)").get())
name = self.strip_string(github.css(".mr-1::text").get())
headline = self.strip_string(github.xpath('string(//p[@class="mb-1"])').get())
location = self.strip_string(github.css(".mr-3:nth-child(1)::text").get())
yield {
"github_link": github_link,
"name": name,
"headline": headline,
"location": location
}
解决方法如下:
代码:
import scrapy
class GitHubSpider(scrapy.Spider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
def parse(self, response):
for github in response.xpath('//*[@class="flex-auto"]'):
yield {
"github_link": github.xpath('.//*[@class="color-text-secondary"]/@href').get(),
"name": github.xpath('.//*[@class="mr-1"]/text()').get(),
"headline": github.xpath('.//*[@class="mb-1"]//text()').get(),
"location": github.xpath('normalize-space(.//*[@class="mr-3"]/text())').get()
}
输出:
{'github_link': '/djangofan', 'name': 'Jon Austen', 'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, ', 'location': 'Portland, OR'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/django-wong', 'name': ' Wong', 'headline': 'PHP / Node.js / Dart (Flutter) / ', 'location': 'China'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/DipanshKhandelwal', 'name': 'Dipansh Khandelwal', 'headline': 'React', 'location': 'Bengaluru, India'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/usj-django-dev', 'name': 'Utsho Sadhak Joy', 'headline': 'const Joy = (', 'location': 'Khulna,Bangladesh'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/kpnemre', 'name': 'Emre Kapan', 'headline': 'React', 'location': ''}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/indraasura', 'name': 'Swarup Hegde', 'headline': 'Proficient in JavaScript, Python, ', 'location': 'Indore, India'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/pongstr', 'name': 'Pongstr', 'headline': 'Vue. ', 'location': 'Tallinn, Estonia'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/wencakisa', 'name': 'Ventsislav Tashev', 'headline': 'Django', 'location': 'Sofia, Bulgaria'}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/novelview9', 'name': 'GarakdongBigBoy', 'headline': 'Django', 'location': ''}
2021-08-07 17:19:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{'github_link': '/willemarcel', 'name': 'Wille Marcel', 'headline': 'Software engineer. Python, ', 'location': 'Salvador-BA-Brazil'}
2021-08-07 17:19:10 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-07 17:19:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 327,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 24422,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.271329,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 8, 7, 11, 19, 10, 249326),
'httpcompression/response_bytes': 132530,
'httpcompression/response_count': 1,
'item_scraped_count': 10,