如何抓取 <li> 和 children 的文本

How to scrape the text of <li> and children

我正在尝试抓取 <li> 个标签及其内部的内容。

HTML 看起来像:

 <div class="insurancesAccepted">
   <h4>What insurance does he accept?*</h4>
   <ul class="noBottomMargin">
      <li class="first"><span>Aetna</span></li>
      <li>
         <a title="See accepted plans" class="insurancePlanToggle arrowUp">AvMed</a>
         <ul style="display: block;" class="insurancePlanList">
            <li class="last first">Open Access</li>
         </ul>
      </li>
      <li>
         <a title="See accepted plans" class="insurancePlanToggle arrowUp">Blue Cross Blue Shield</a>
         <ul style="display: block;" class="insurancePlanList">
            <li class="last first">Blue Card PPO</li>
         </ul>
      </li>
      <li>
         <a title="See accepted plans" class="insurancePlanToggle arrowUp">Cigna</a>
         <ul style="display: block;" class="insurancePlanList">
            <li class="first">Cigna HMO</li>
            <li>Cigna PPO</li>
            <li class="last">Great West Healthcare-Cigna PPO</li>
         </ul>
      </li>
      <li class="last">
         <a title="See accepted plans" class="insurancePlanToggle arrowUp">Empire Blue Cross Blue Shield</a>
         <ul style="display: block;" class="insurancePlanList">
            <li class="last first">Empire Blue Cross Blue Shield HMO</li>
         </ul>
      </li>
   </ul>
  </div>

主要问题是当我试图从以下位置获取内容时:

doc.css('.insurancesAccepted li').text.strip

它一次显示所有 <li> 文本。我想要“AvMed”和“Open Access”scraped 同时带有一个关系参数,这样我就可以将它插入到我的 MySQL table 中作为参考。

问题是 doc.css('.insurancesAccepted li') 匹配 所有 嵌套列表项,而不仅仅是直接后代。要仅匹配直系后代,应使用 parent > child CSS 规则。要完成您的任务,您需要仔细 assemble 迭代的结果:

doc = Nokogiri::HTML(html)
result = doc.css('div.insurancesAccepted > ul > li').each do |li|
  chapter = li.css('span').text.strip
  section = li.css('a').text.strip
  subsections = li.css('ul > li').map(&:text).map(&:strip)

  puts "#{chapter} ⇒ [ #{section} ⇒ [ #{subsections.join(', ')} ] ]"
  puts '=' * 40
end

结果:

# Aetna ⇒ [  ⇒ [  ] ]
# ========================================
#  ⇒ [ AvMed ⇒ [ Open Access ] ]
# ========================================
#  ⇒ [ Blue Cross Blue Shield ⇒ [ Blue Card PPO ] ]
# ========================================
#  ⇒ [ Cigna ⇒ [ Cigna HMO, Cigna PPO, Great West Healthcare-Cigna PPO ] ]
# ========================================
#  ⇒ [ Empire Blue Cross Blue Shield ⇒ [ Empire Blue Cross Blue Shield HMO ] ]
# ========================================