Scrapy:点击链接为每个项目抓取附加信息
Scrapy: follow links to scrape additional information for each item
我正在尝试抓取一个网站,该网站的每个页面上都有大约 15 篇文章的一些信息。对于每篇文章,我想获取 title
、date
,然后按照 "Read More" link
获取更多信息(例如,文章的 source
)。
到目前为止,我可以成功地抓取所有页面中每篇文章的 title
和 date
,并将它们存储在 CSV 文件中。
我的问题是我无法按照 Read More
link 获取每篇文章的附加信息 (source
)。我看了很多类似的问题和他们的答案,但我还是无法解决。
这是我的代码:
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
print("procesing:"+response.url)
Title = response.xpath('//h2[@class="entry-title"]/a/text()').extract()
Date = response.xpath('//p[@class="entry-content__text"]/strong/text()').extract()
ReadMore_links = response.xpath('//a[@class="button entry-content__button entry-content__button--smaller"]/@href').extract()
for link in ReadMore_links:
yield scrapy.Request(response.urljoin(links, callback=self.parsepage2)
def parsepage2(self, response):
Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
return Source
row_data = zip(Title, Date, Source)
for item in row_data:
scraped_info = {
'page':response.url,
'Title': item[0],
'Date': item[1],
'Source': item[2],
}
yield scraped_info
next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
您需要处理每篇文章,获取日期、标题和 "Read More",然后 yield
另一篇 scrapy.Request
使用 cb_kwargs
(或 request.meta
在旧版本中):
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
for article in response.xpath('//article'):
Title = article.xpath('.//h2[@class="entry-title"]/a/text()').get()
Date = article.xpath('.//p[@class="entry-content__text"]/strong/text()').get()
ReadMore_link = article.xpath('.//a[@class="button entry-content__button entry-content__button--smaller"]/@href').get()
yield scrapy.Request(
url=response.urljoin(ReadMore_link),
callback=self.parse_article_details,
cb_kwargs={
'article_title': Title,
'article_date': Date,
}
)
next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
def parse_article_details(self, response, article_title, article_date):
Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
scraped_info = {
'page':response.url,
'Title': article_title,
'Date': article_date,
'Source': Source,
}
yield scraped_info
更新
我这边一切正常:
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus>
{'page': 'https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus', 'Title': ' Japanese schools re-opened then were closed again due to a second wave of coronavirus.', 'Date': '2020/05/12 | France', 'Source': "This false claim originated from: CGT Educ'Action", 'files': []}
2020-05-14 00:59:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19>
{'page': 'https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19', 'Title': ' Famous French blue cheese, roquefort, is a “medecine against Covid-19”.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents>
{'page': 'https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents', 'Title': ' Administrative documents French people need to fill to go out are a copy paste from 1940 documents.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: IndignezVous', 'files': []}
2020-05-14 00:59:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable>
{'page': 'https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable', 'Title': ' Spanish and French masks prices are comparable.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown', 'Title': ' French President Macron and its spouse are jetskiing during the lockdown.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air', 'Title': ' French Minister of Justice Nicole Belloubet threathened the famous anchor Jean-Pierre Pernaut after he criticized the government policy about the pandemic on air.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
您可能想看看 follow_all 函数,它是比 urljoin:
更好的选择
https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns
我正在尝试抓取一个网站,该网站的每个页面上都有大约 15 篇文章的一些信息。对于每篇文章,我想获取 title
、date
,然后按照 "Read More" link
获取更多信息(例如,文章的 source
)。
到目前为止,我可以成功地抓取所有页面中每篇文章的 title
和 date
,并将它们存储在 CSV 文件中。
我的问题是我无法按照 Read More
link 获取每篇文章的附加信息 (source
)。我看了很多类似的问题和他们的答案,但我还是无法解决。
这是我的代码:
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
print("procesing:"+response.url)
Title = response.xpath('//h2[@class="entry-title"]/a/text()').extract()
Date = response.xpath('//p[@class="entry-content__text"]/strong/text()').extract()
ReadMore_links = response.xpath('//a[@class="button entry-content__button entry-content__button--smaller"]/@href').extract()
for link in ReadMore_links:
yield scrapy.Request(response.urljoin(links, callback=self.parsepage2)
def parsepage2(self, response):
Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
return Source
row_data = zip(Title, Date, Source)
for item in row_data:
scraped_info = {
'page':response.url,
'Title': item[0],
'Date': item[1],
'Source': item[2],
}
yield scraped_info
next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
您需要处理每篇文章,获取日期、标题和 "Read More",然后 yield
另一篇 scrapy.Request
使用 cb_kwargs
(或 request.meta
在旧版本中):
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
for article in response.xpath('//article'):
Title = article.xpath('.//h2[@class="entry-title"]/a/text()').get()
Date = article.xpath('.//p[@class="entry-content__text"]/strong/text()').get()
ReadMore_link = article.xpath('.//a[@class="button entry-content__button entry-content__button--smaller"]/@href').get()
yield scrapy.Request(
url=response.urljoin(ReadMore_link),
callback=self.parse_article_details,
cb_kwargs={
'article_title': Title,
'article_date': Date,
}
)
next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
def parse_article_details(self, response, article_title, article_date):
Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
scraped_info = {
'page':response.url,
'Title': article_title,
'Date': article_date,
'Source': Source,
}
yield scraped_info
更新 我这边一切正常:
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus>
{'page': 'https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus', 'Title': ' Japanese schools re-opened then were closed again due to a second wave of coronavirus.', 'Date': '2020/05/12 | France', 'Source': "This false claim originated from: CGT Educ'Action", 'files': []}
2020-05-14 00:59:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19>
{'page': 'https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19', 'Title': ' Famous French blue cheese, roquefort, is a “medecine against Covid-19”.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents>
{'page': 'https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents', 'Title': ' Administrative documents French people need to fill to go out are a copy paste from 1940 documents.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: IndignezVous', 'files': []}
2020-05-14 00:59:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable>
{'page': 'https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable', 'Title': ' Spanish and French masks prices are comparable.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown', 'Title': ' French President Macron and its spouse are jetskiing during the lockdown.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air', 'Title': ' French Minister of Justice Nicole Belloubet threathened the famous anchor Jean-Pierre Pernaut after he criticized the government policy about the pandemic on air.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
您可能想看看 follow_all 函数,它是比 urljoin:
更好的选择https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns