Scrapy - 比较数据
Scrapy - Comparing Data
我是 scrapy 的新手,在我的项目中我不确定如何进行。我的想法是我想抓取 hackernews 的前 2 页并打印出所有分数超过 300 的文章/标题。根据我有限的知识,以下代码是我弄清楚如何获取我想要的信息的最佳方式。我的最终目标是我需要将 id 与 post id 进行比较以匹配它们,将点添加到相应的匹配项,然后过滤掉小于 300 的点。我不确定如何比较字典值我已经能刮了。代码如下:
import scrapy
class ArticlesSpider(scrapy.Spider):
name = 'articles'
start_urls = [
'https://news.ycombinator.com',
# 'https://news.ycombinator.com/news?p=2'
]
def parse(self, response):
link = response.css('tr.athing')
score = response.css('td.subtext')
for website in link:
yield {
'title': website.css('tr.athing td.title a.storylink::text').get(),
'link': website.css('tr.athing td.title a::attr(href)').get(),
'id': website.css('tr::attr(id)').get(),
}
for points in score:
yield {
'post_id': points.css('span::attr(id)').get(),
'points': points.css('span.score::text').get()
}
有没有更好的方法来实现我想做的事情?
posts
和 scores
列表的长度和顺序相同。
在每次迭代中,检查对应post的得分点是否>=300
。
import scrapy
class ArticlesSpider(scrapy.Spider):
name = 'articles'
start_urls = [
'https://news.ycombinator.com',
#'https://news.ycombinator.com/news?p=2'
]
def parse(self, response):
posts = response.css('tr.athing')
scores = response.css('td.subtext')
for i in range(len(posts)):
# get the post
post = posts[i]
# get the score point of the corresponding post
score = scores[i]
score_point = score.css('span.score::text').get()
# handle some post has no score point
score_point = int(score_point.split(' ')[0]) if score_point else 0
if score_point >= 300:
yield {
'title': post.css('tr.athing td.title a.storylink::text').get(),
'link': post.css('tr.athing td.title a::attr(href)').get(),
'id': post.css('tr::attr(id)').get(),
'points': score_point
}
打印分数 >= 300
的帖子:
{'title': 'One man's fight for the right to repair broken MacBooks', 'link': 'https://columbianewsse
rvice.com/2021/05/21/one-mans-fight-for-the-right-to-repair-broken-macbooks/', 'id': '27254719', 'po
ints': 1138}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Why I prefer making useless stuff', 'link': 'https://web.eecs.utk.edu/~azh/blog/makinguse
lessstuff.html', 'id': '27256867', 'points': 604}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Why Decentralised Applications Don't Work', 'link': 'https://ingrids.space/posts/why-dist
ributed-systems-dont-work/', 'id': '27259321', 'points': 320}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Freesound just reached 500K Creative Commons sounds', 'link': 'https://blog.freesound.org
/?p=1340', 'id': '27232297', 'points': 696}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'The Limits to Blockchain Scalability', 'link': 'https://vitalik.ca/general/2021/05/23/sca
ling.html', 'id': '27257641', 'points': 378}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Teardown of a PC Power Supply', 'link': 'https://www.righto.com/2021/05/teardown-of-pc-po
wer-supply.html', 'id': '27256515', 'points': 351}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Dorodango: the Japanese art of making shiny mud balls (2019)', 'link': 'https://www.laure
nceking.com/blog/2019/09/26/dorodango-blog/', 'id': '27255755', 'points': 313}
我是 scrapy 的新手,在我的项目中我不确定如何进行。我的想法是我想抓取 hackernews 的前 2 页并打印出所有分数超过 300 的文章/标题。根据我有限的知识,以下代码是我弄清楚如何获取我想要的信息的最佳方式。我的最终目标是我需要将 id 与 post id 进行比较以匹配它们,将点添加到相应的匹配项,然后过滤掉小于 300 的点。我不确定如何比较字典值我已经能刮了。代码如下:
import scrapy
class ArticlesSpider(scrapy.Spider):
name = 'articles'
start_urls = [
'https://news.ycombinator.com',
# 'https://news.ycombinator.com/news?p=2'
]
def parse(self, response):
link = response.css('tr.athing')
score = response.css('td.subtext')
for website in link:
yield {
'title': website.css('tr.athing td.title a.storylink::text').get(),
'link': website.css('tr.athing td.title a::attr(href)').get(),
'id': website.css('tr::attr(id)').get(),
}
for points in score:
yield {
'post_id': points.css('span::attr(id)').get(),
'points': points.css('span.score::text').get()
}
有没有更好的方法来实现我想做的事情?
posts
和 scores
列表的长度和顺序相同。
在每次迭代中,检查对应post的得分点是否>=300
。
import scrapy
class ArticlesSpider(scrapy.Spider):
name = 'articles'
start_urls = [
'https://news.ycombinator.com',
#'https://news.ycombinator.com/news?p=2'
]
def parse(self, response):
posts = response.css('tr.athing')
scores = response.css('td.subtext')
for i in range(len(posts)):
# get the post
post = posts[i]
# get the score point of the corresponding post
score = scores[i]
score_point = score.css('span.score::text').get()
# handle some post has no score point
score_point = int(score_point.split(' ')[0]) if score_point else 0
if score_point >= 300:
yield {
'title': post.css('tr.athing td.title a.storylink::text').get(),
'link': post.css('tr.athing td.title a::attr(href)').get(),
'id': post.css('tr::attr(id)').get(),
'points': score_point
}
打印分数 >= 300
的帖子:
{'title': 'One man's fight for the right to repair broken MacBooks', 'link': 'https://columbianewsse
rvice.com/2021/05/21/one-mans-fight-for-the-right-to-repair-broken-macbooks/', 'id': '27254719', 'po
ints': 1138}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Why I prefer making useless stuff', 'link': 'https://web.eecs.utk.edu/~azh/blog/makinguse
lessstuff.html', 'id': '27256867', 'points': 604}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Why Decentralised Applications Don't Work', 'link': 'https://ingrids.space/posts/why-dist
ributed-systems-dont-work/', 'id': '27259321', 'points': 320}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Freesound just reached 500K Creative Commons sounds', 'link': 'https://blog.freesound.org
/?p=1340', 'id': '27232297', 'points': 696}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'The Limits to Blockchain Scalability', 'link': 'https://vitalik.ca/general/2021/05/23/sca
ling.html', 'id': '27257641', 'points': 378}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Teardown of a PC Power Supply', 'link': 'https://www.righto.com/2021/05/teardown-of-pc-po
wer-supply.html', 'id': '27256515', 'points': 351}
2021-05-24 18:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.ycombinator.com>
{'title': 'Dorodango: the Japanese art of making shiny mud balls (2019)', 'link': 'https://www.laure
nceking.com/blog/2019/09/26/dorodango-blog/', 'id': '27255755', 'points': 313}