使用 Watir 进行网络抓取时,如何解析相同 class 中的结果并将它们输入到单独的 CSV 单元格中?

When web scraping with Watir, how do I parse results in same class and enter them into separate CSV cells?

我正在使用 Watir 从网站上抓取搜索结果并将其输入到 CSV 文件中。当我 运行 搜索时,结果分为 span classes。所以 HTML 看起来像:

<span class="sn_auth_name">foo</span>
<span class="sn_target_lang">English</span>

我的代码如下所示:

sn_auth_name   = row.xpath('span[@class="sn_auth_name"]/text()').text.strip
sn_target_lang = row.xpath('span[@class="sn_target_lang"]/text()').text.strip

CSV.open("file.csv", "a") do |csv|
        csv << [sn_auth_name, sn_target_lang]

问题是,对于某些搜索结果,多个项目分配给同一个 class。也就是有时候只有一个sn_auth_name,有时候是三个!现在,这两个结果最终都塞进了我的 CSV 文件中的同一个单元格中。

有没有一种方法可以让我偶尔将多个结果分配给同一个 class?将第二个(或第三个)结果输入单独单元格的解决方案?

谢谢!


有人要求提供更多详细信息,所以这是我通常得到的输出。

<table class="restable"><tr>
<td class="res1">1/1</td>
<td class="res2">
    <span class="sn_auth_name">Imām</span>, 
    <span class="sn_auth_firstname">Abū Bakr</span>:
    <span class="sn_target_title">Al-Kalām rasmāl</span> [
    <span class="sn_target_lang">Arabic</span>]/ 
    <span class="sn_transl_name">Ḥijāzī al-Sayyid</span>, 
    <span class="sn_transl_firstname">Muṣṭafā</span> /
    <span class="sn_pub">
      <span class="place">Al-Qāhirah</span>: 
      <span class="publisher">Al-Majlis al-Alā lil-Thaqāfah</span> [
      <span class="sn_country">Egypt</span>]</span>,
    <span class="sn_year">2000</span>.
    <span class="sn_pagination">588 p.</span>
    <span class="sn_orig_title">Magana jarice</span> [
    <span class="sn_orig_lang">Afrikaans</span>]
</td></tr>
</table>

抓取没问题,因为我想抓取的每段文本都有一个 class 类型。但每隔一段时间,我就会得到这样的结果:

<tr>
<td class="res1">7/8</td>
<td class="res2">
    <span class="sn_auth_name">Plenge</span>, 
    <span class="sn_auth_firstname">Vagn</span>;
    <span class="sn_auth_name">Wyk</span>, 
    <span class="sn_auth_firstname">Chris van</span>:
    <span class="sn_target_title">Opbrud</span> [
    <span class="sn_target_lang">Danish</span>] / 
    <span class="sn_transl_name">Hansen</span>, 
    <span class="sn_transl_firstname">Finn Holten</span>;
    <span class="sn_transl_name">Madelung</span>, 
    <span class="sn_transl_firstname">Marianne</span>;
    <span class="sn_transl_name">Seiketso</span>, 
    <span class="sn_transl_firstname">Helen Gaohenngwe</span> /
    <span class="sn_pub">
      <span class="place">Frederiksberg</span>: 
      <span class="publisher">AKS</span>,
      <span class="place">Frederiksberg</span>: 
      <span class="publisher">Hjulet</span> [
      <span class="sn_country">Denmark</span>]</span>,
    <span class="sn_year">2000</span>.
    <span class="sn_pagination">247 p.</span> [
    <span class="sn_orig_lang">Afrikaans</span>], [
    <span class="sn_orig_lang">English</span>]
</td></tr>

这里,例如,sn_auth_name 有多个条目。在我的 CSV 文件中最终出现的是一个带有 PlengeWyk 的单元格。理想的做法是让脚本创建一个 sn_auth_name2 值并将其记录在单独的单元格中,即 PlengeWyk.

有什么想法吗?

#xpath方法returns一个NodeSet,它是匹配节点的集合。 NodeSet 包括 Enumerable,它提供了许多用于遍历集合的方法。您想遍历每个节点并收集其文本,而不是获取整个节点集的文本。

sn_auth_name = row.xpath('span[@class="sn_auth_name"]').map { |node| node.text.strip }
#=> ["Plenge", "Wyk"]

作为名称数组,sn_auth_name 仍将写入单个单元格中的 CSV。如果要将每个名称写入其自己的单元格,则需要展平数组。您可以使用 splat 展平单个列:

csv << [*sn_auth_name, sn_target_lang]

如果有多个需要展平,也可以展平整个数组:

csv << [sn_auth_name, sn_target_lang].flatten

执行上述操作意味着每一行都有不同数量的列。您可以填充所有行,使它们具有相同的列数:

# Variable to define which column is the first name column
col_auth_name = 0

# Collect the data from the table into an Array
data = []
doc.css('td.res2').each do |row|
  sn_auth_name = row.xpath('span[@class="sn_auth_name"]').map { |node| node.text.strip }
  sn_target_lang = row.xpath('span[@class="sn_target_lang"]/text()').text.strip
  data << [sn_auth_name, sn_target_lang]
end

# Determine max number of names in a row
max_auth_name = data.map { |row| row[col_auth_name].length }.max

CSV.open("file.csv", "a") do |csv|
  data.each do |row|
    # Fill the Array of names to meet the max length
    row[col_auth_name].fill('', row[col_auth_name].length..(max_auth_name - 1))

    # Write to the CSV file
    csv << row.flatten
  end
end