Scrapy:点击链接为每个项目抓取附加信息

Scrapy: follow links to scrape additional information for each item

我正在尝试抓取一个网站,该网站的每个页面上都有大约 15 篇文章的一些信息。对于每篇文章,我想获取 titledate,然后按照 "Read More" link 获取更多信息(例如,文章的 source)。

到目前为止,我可以成功地抓取所有页面中每篇文章的 titledate,并将它们存储在 CSV 文件中。

我的问题是我无法按照 Read More link 获取每篇文章的附加信息 (source)。我看了很多类似的问题和他们的答案,但我还是无法解决。

这是我的代码:

import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
    name = 'Poynter_FakeNews'
    allowed_domains = ['poynter.org']
    start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']

    custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'} 

    def parse(self, response):
        print("procesing:"+response.url)
        Title = response.xpath('//h2[@class="entry-title"]/a/text()').extract()
        Date = response.xpath('//p[@class="entry-content__text"]/strong/text()').extract()
        
        ReadMore_links = response.xpath('//a[@class="button entry-content__button entry-content__button--smaller"]/@href').extract()
        for link in ReadMore_links:
        yield scrapy.Request(response.urljoin(links, callback=self.parsepage2)

    def parsepage2(self, response):
        Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
        return Source 

    row_data = zip(Title, Date, Source)
    for item in row_data:
        scraped_info = {
            'page':response.url,
            'Title': item[0], 
            'Date': item[1],
            'Source': item[2],
        }
        yield scraped_info

     next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
     if next_page: 
         yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

您需要处理每篇文章,获取日期、标题和 "Read More",然后 yield 另一篇 scrapy.Request 使用 cb_kwargs(或 request.meta 在旧版本中):

import scrapy


class PoynterFakenewsSpider(scrapy.Spider):
    name = 'Poynter_FakeNews'
    allowed_domains = ['poynter.org']
    start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']

    custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'} 

    def parse(self, response):

        for article in response.xpath('//article'):
            Title = article.xpath('.//h2[@class="entry-title"]/a/text()').get()
            Date = article.xpath('.//p[@class="entry-content__text"]/strong/text()').get()
            ReadMore_link = article.xpath('.//a[@class="button entry-content__button entry-content__button--smaller"]/@href').get()

            yield scrapy.Request(
                url=response.urljoin(ReadMore_link), 
                callback=self.parse_article_details,
                cb_kwargs={
                    'article_title': Title,
                    'article_date': Date,
                }
            )
        next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
        if next_page: 
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

    def parse_article_details(self, response, article_title, article_date):
        Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
        scraped_info = {
            'page':response.url,
            'Title': article_title, 
            'Date': article_date,
            'Source': Source,
        }
        yield scraped_info

更新 我这边一切正常:

2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus>
{'page': 'https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus', 'Title': ' Japanese schools re-opened then were closed again due to a second wave of coronavirus.', 'Date': '2020/05/12 | France', 'Source': "This false claim originated from: CGT Educ'Action", 'files': []}
2020-05-14 00:59:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19>
{'page': 'https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19', 'Title': ' Famous French blue cheese, roquefort, is a “medecine against Covid-19”.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents>
{'page': 'https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents', 'Title': ' Administrative documents French people need to fill to go out are a copy paste from 1940 documents.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: IndignezVous', 'files': []}
2020-05-14 00:59:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable>
{'page': 'https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable', 'Title': ' Spanish and French masks prices are comparable.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown', 'Title': ' French President Macron and its spouse are jetskiing during the lockdown.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air', 'Title': ' French Minister of Justice Nicole Belloubet threathened the famous anchor Jean-Pierre Pernaut after he criticized the government policy about the pandemic on air.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}

您可能想看看 follow_all 函数,它是比 urljoin:

更好的选择

https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns