Scrapy css 选择器未在 <i></i> 中获取文本

Scrapy css selector not getting text in <i></i>

我是一个新手,试图从 goodreads.com 中抓取一些引语,但无法使 text = ... 部分正常工作。 我不确定我遗漏了什么,希望得到一些帮助。

for quote in response.css("div.quoteDetails"):
    text = quote.css("div.quoteText:not(.authorOrTitle)::text").getall() # not getting <i>
    author = quote.css("span.authorOrTitle::text").get().strip()
    book = quote.css("a.authorOrTitle::text").get()
    tags = quote.css("div.quoteFooter div.left a::text").getall()
    print(dict(text=text, author=author, book=book, tags=tags))

我尝试了一些排列,例如 text = quote.css("div.quoteText :not(span):not(script) ::text").getall()

我最接近的是 text = quote.css("div.quoteText:not(.authorOrTitle)::text").getall() 返回(缺少 <i>smiles all the time</i>

{'text': ["\n      “God does not play dice with the universe; He plays an ineffable game of His own devising, which might be compared, from the perspective of any of the other players [i.e. everybody], to being involved in an obscure and complex variant of poker in a pitch-dark room, with blank cards, for infinite stakes, with a Dealer who won't tell you the rules, and who ", '.”\n  ', '  ―\n  ', '\n    ', '\n    \n\n\n', '\n\n'], 
'author': 'Terry Pratchett,', 
'book': 'Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch', 
'tags': ['einstein', 'gaiman', 'god', 'humor']}

html 我要抓取的页面片段 https://www.goodreads.com/quotes/tag/god?page=1

<div class="quoteDetails ">
        <a class="leftAlignedImage" href="/author/show/1654.Terry_Pratchett">
      <img alt="Terry Pratchett" src="https://images.gr-assets.com/authors/1235562205p2/1654.jpg">
</a>
<div class="quoteText">
      “God does not play dice with the universe; He plays an ineffable game of His own devising, which might be compared, from the perspective of any of the other players [i.e. everybody], to being involved in an obscure and complex variant of poker in a pitch-dark room, with blank cards, for infinite stakes, with a Dealer who won't tell you the rules, and who <i>smiles all the time</i>.”
  <br>  ―
  <span class="authorOrTitle">
    Terry Pratchett,
  </span>
    <span id="quote_book_link_12067">
      <a class="authorOrTitle" href="/work/quotes/4110990">Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch</a>
    </span>
    


<script>
//<![CDATA[
      var newTip = new Tip($('quote_book_link_12067'), "\n\n  <h2><a class=\"readable bookTitle\" href=\"https://www.goodreads.com/book/show/12067.Good_Omens?from_choice=false&amp;from_home_module=false\">Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch<\/a><\/h2>\n\n      <div>\n        by <a class=\"authorName\" href=\"/author/show/1654.Terry_Pratchett\">Terry Pratchett<\/a>\n      <\/div>\n\n          <div class=\"smallText uitext darkGreyText\">\n            <span class=\"minirating\"><span class=\"stars staticStars notranslate\"><span size=\"12x12\" class=\"staticStar p10\"><\/span><span size=\"12x12\" class=\"staticStar p10\"><\/span><span size=\"12x12\" class=\"staticStar p10\"><\/span><span size=\"12x12\" class=\"staticStar p10\"><\/span><span size=\"12x12\" class=\"staticStar p3\"><\/span><\/span> 4.24 avg rating &mdash; 595,193 ratings<\/span>            &mdash; published 1990\n          <\/div>\n\n    <div class=\"addBookTipDescription\">\n      \n<span id=\"freeTextContainer8297402955929030295\">‘Armageddon only happens once, you know. They don’t let you go around again until you get it right.’\n\nPeople have been predicting the end of the world almost from its very beginning, so it’s only natural to be sceptical when a new date is set for Jud<\/span>\n  <span id=\"freeText8297402955929030295\" style=\"display:none\">‘Armageddon only happens once, you know. They don’t let you go around again until you get it right.’\n\nPeople have been predicting the end of the world almost from its very beginning, so it’s only natural to be sceptical when a new date is set for Judgement Day. But what if, for once, the predictions are right, and the apocalypse really is due to arrive next Saturday, just after tea?\n\nYou could spend the time left drowning your sorrows, giving away all your possessions in preparation for the rapture, or laughing it off as (hopefully) just another hoax. Or you could just try to do something about it.\n\nIt’s a predicament that Aziraphale, a somewhat fussy angel, and Crowley, a fast-living demon now finds themselves in. They’ve been living amongst Earth’s mortals since The Beginning and, truth be told, have grown rather fond of the lifestyle and, in all honesty, are not actually looking forward to the coming Apocalypse.\n\nAnd then there’s the small matter that someone appears to have misplaced the Antichrist…<\/span>\n  <a data-text-id=\"8297402955929030295\" href=\"#\" onclick=\"swapContent($(this));; return false;\">...more<\/a>\n\n    <\/div>\n\n      <div class=\'wtrButtonContainer wtrSignedOut\' id=\'10_book_12067\'>\n<div class=\'wtrUp wtrLeft\'>\n<form action=\"/shelf/add_to_shelf\" accept-charset=\"UTF-8\" method=\"post\"><input name=\"utf8\" type=\"hidden\" value=\"&#x2713;\" /><input type=\"hidden\" name=\"authenticity_token\" value=\"eHxnso0cZ8mac56WZu5b+f9QTJuvU6CbzkAY03nf4vcdXMimazuX3oOLs4umgVsafX94R4eGUAdsRhvwmfdIaA==\" />\n<input type=\"hidden\" name=\"book_id\" id=\"book_id\" value=\"12067\" />\n<input type=\"hidden\" name=\"name\" id=\"name\" value=\"to-read\" />\n<input type=\"hidden\" name=\"unique_id\" id=\"unique_id\" value=\"10_book_12067\" />\n<input type=\"hidden\" name=\"wtr_new\" id=\"wtr_new\" value=\"true\" />\n<input type=\"hidden\" name=\"from_choice\" id=\"from_choice\" value=\"false\" />\n<input type=\"hidden\" name=\"from_home_module\" id=\"from_home_module\" value=\"false\" />\n<input type=\"hidden\" name=\"ref\" id=\"ref\" value=\"\" class=\"wtrLeftUpRef\" />\n<input type=\"hidden\" name=\"existing_review\" id=\"existing_review\" value=\"false\" class=\"wtrExisting\" />\n<input type=\"hidden\" name=\"page_url\" id=\"page_url\" value=\"/quotes/tag/god\" />\n<button class=\'wtrToRead\' type=\'submit\'>\n<span class=\'progressTrigger\'>Want to Read<\/span>\n<span class=\'progressIndicator\'>saving…<\/span>\n<\/button>\n<\/form>\n\n<\/div>\n\n<div class=\'wtrRight wtrUp\'>\n<form class=\"hiddenShelfForm\" action=\"/shelf/add_to_shelf\" accept-charset=\"UTF-8\" method=\"post\"><input name=\"utf8\" type=\"hidden\" value=\"&#x2713;\" /><input type=\"hidden\" name=\"authenticity_token\" value=\"SKP2jUje5gC8x8+/fKoz/fZEAz53hITka0SMTXDRT2Etg1mZrvkWF6U/4qK8xTMedGs34l9RdHjJQo9ukPnl/g==\" />\n<input type=\"hidden\" name=\"unique_id\" id=\"unique_id\" value=\"10_book_12067\" />\n<input type=\"hidden\" name=\"book_id\" id=\"book_id\" value=\"12067\" />\n<input type=\"hidden\" name=\"a\" id=\"a\" />\n<input type=\"hidden\" name=\"name\" id=\"name\" />\n<input type=\"hidden\" name=\"from_choice\" id=\"from_choice\" value=\"false\" />\n<input type=\"hidden\" name=\"from_home_module\" id=\"from_home_module\" value=\"false\" />\n<input type=\"hidden\" name=\"page_url\" id=\"page_url\" value=\"/quotes/tag/god\" />\n<\/form>\n\n<button class=\'wtrShelfButton\'><\/button>\n<\/div>\n\n<div class=\'ratingStars wtrRating\'>\n<div class=\'starsErrorTooltip hidden\'>\nError rating book. Refresh and try again.\n<\/div>\n<div class=\'myRating uitext greyText\'>Rate this book<\/div>\n<div class=\'clearRating uitext\'>Clear rating<\/div>\n<div class=\"stars\" data-resource-id=\"12067\" data-user-id=\"0\" data-submit-url=\"/review/rate/12067?page_url=%2Fquotes%2Ftag%2Fgod&rate_books_page=false&stars_click=false&wtr_button_id=10_book_12067\" data-rating=\"0\"><a class=\"star off\" title=\"did not like it\" href=\"#\" ref=\"\">1 of 5 stars<\/a><a class=\"star off\" title=\"it was ok\" href=\"#\" ref=\"\">2 of 5 stars<\/a><a class=\"star off\" title=\"liked it\" href=\"#\" ref=\"\">3 of 5 stars<\/a><a class=\"star off\" title=\"really liked it\" href=\"#\" ref=\"\">4 of 5 stars<\/a><a class=\"star off\" title=\"it was amazing\" href=\"#\" ref=\"\">5 of 5 stars<\/a><\/div>\n<\/div>\n\n<\/div>\n\n\n\n\n", { style: 'addbook', stem: 'leftMiddle', hook: { tip: 'leftMiddle', target: 'rightMiddle' }, offset: { x: 5, y: 0 }, hideOn: false, width: 400, hideAfter: 0.05, delay: 0.35 });
      $('quote_book_link_12067').observe('prototip:shown', function() {
        if (this.up('#box')) {
          $$('div.prototip').each(function(i){i.setStyle({zIndex: $('box').getStyle('z-index')})});
        } else {
          $$('div.prototip').each(function(i){i.setStyle({zIndex: 6000})});
        }
      });

      newTip['wrapper'].addClassName('prototipAllowOverflow');

        $('quote_book_link_12067').observe('prototip:shown', function () {
          $$('div.prototip').each(function (e) {
            if ($('quote_book_link_12067').hasClassName('ignored')) {
              e.setStyle({'display': 'none'});
              return;
            }
            e.setStyle({'overflow': 'visible'});
          });
        });
      $('quote_book_link_12067').observe('prototip:hidden', function () {
        $$('span.elementTwo').each(function (e) {
          if (e.getStyle('display') !== 'none') {
            var lessLink = e.next();
            swapContent(lessLink);
          }
        });
      });

//]]>
</script>

</div>


<div class="quoteFooter">
   <div class="greyText smallText left">
     tags:
       <a href="/quotes/tag/einstein">einstein</a>,
       <a href="/quotes/tag/gaiman">gaiman</a>,
       <a href="/quotes/tag/god">god</a>,
       <a href="/quotes/tag/humor">humor</a>
   </div>
   <div class="right">
     <a class="smallText" title="View this quote" href="/quotes/11285-god-does-not-play-dice-with-the-universe-he-plays">2464 likes</a>
   </div>
</div>

  </div>

我不知道如何使用 css 选择器,所以我使用了 xpath 路径选择器。然后我使用 MapCompose 删除空格并加入。

#spider.py 片段

def parse(self, response):
                
        for quote in response.css("div.quoteDetails"):
            l = ItemLoader(GoodreadsItem(), quote)
            l.add_xpath("text", './/div[@class="quoteText"]/text() | .//div[@class="quoteText"]/i/text()')
            
            yield l.load_item()

#items.py 片段

class GoodreadsItem(scrapy.Item):
    text = scrapy.Field(
        input_processor=MapCompose(lambda string: string.strip()),
        output_processor=Join()
    )