如何修复 nokogiri (yahoo) table 刮刀?

How to repair nokogiri (yahoo) table scraper?

18 个月前,我们使用 ruby 和 nokogiri 制作了一个小 table 抓取器,输出到 csv 文件。对页面结构的更改使输出不尽如人意。以下是我们使用的简化版本:

#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'

url = "http://finance.yahoo.com/q/op?s=FISV&date=1426809600"#mar
doc = Nokogiri::HTML(open(url))
csv = CSV.open("output.csv", 'w')
doc.xpath('//table//tr').each do |row|
tarray = [] #temporary array
row.xpath('td').each do |cell|
    tarray << cell.text #Build array of that row of data.
end
csv << tarray #Write that row out to csv file
#puts "#{row}"
end

csv.close

当前输出:

"^M

^M

^M

✕^M

[修改]^M

                    ^M

                "

"^M

        50.00^M

    ","^M

        FISV150320C00050000^M

    ","^M

        19.70^M

不用说这种类型的输出很难处理。

在尝试了很多 xpath 和 csv 库的组合之后,终于意识到是时候寻求帮助了。

假设以下代码片段不包含 csv:

#!/usr/bin/ruby
require 'open-uri'
require 'nokogiri'
url = "http://finance.yahoo.com/q/op?s=FISV&date=1426809600"#mar
#url = "http://finance.yahoo.com/q/op?s=FISV&date=1434672000"#jun
doc = Nokogiri::HTML(open(url))

doc.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s)   {2,}/m, ''), "\", "
end
print "\n"
end

生成类似于以下内容的输出:

" 50.00 ", " FISV150320C00050000 ", " 19.70 ", " 26.90 ", " 30.50 ", " 0.00 ", " 0.00% ", " 5 ", " 0 ", " 83.20% ", 

顶层(输出到 csv)版本需要更改什么才能使它更好地工作?

假设您想将 "Calls" 和 "Puts" 表中的数据转储到 CSV 中,您可以这样做:

require 'csv'
require 'nokogiri'
require 'open-uri'

def options_to_csv(url)
  CSV.generate do |csv|
    doc = Nokogiri::HTML(open(url))
    doc.xpath('//tr[@data-row]').each do |tr|
      csv << tr.xpath('td').map { |td| td.text.strip }
    end
  end
end

url = 'http://finance.yahoo.com/q/op?s=FISV&date=1426809600'
options_to_csv(url) # =>
# 50.00,FISV150320C00050000,19.70,26.90,29.00,0.00,0.00%,5,0,110.06%
# 55.00,FISV150320C00055000,11.91,22.00,24.00,0.00,0.00%,21,21,90.33%
# 60.00,FISV150320C00060000,17.48,18.30,19.00,0.00,0.00%,5,22,71.97%
# 65.00,FISV150320C00065000,10.70,13.30,14.00,0.00,0.00%,26,85,54.49%
# 70.00,FISV150320C00070000,8.90,8.40,8.90,0.00,0.00%,1,504,34.42%
# 75.00,FISV150320C00075000,3.80,3.70,4.10,0.00,0.00%,1,318,22.07%
# 80.00,FISV150320C00080000,0.55,0.45,0.60,0.00,0.00%,24,1435,14.55%
# 50.00,FISV150320P00050000,0.55,0.00,0.15,0.00,0.00%,6,10,83.98%
# 55.00,FISV150320P00055000,0.05,0.00,0.15,0.00,0.00%,3,14,68.16%
# 60.00,FISV150320P00060000,0.15,0.00,0.20,0.00,0.00%,1,84,56.06%
# 65.00,FISV150320P00065000,0.20,0.00,0.20,0.00,0.00%,3,166,47.56%
# 70.00,FISV150320P00070000,0.10,0.00,0.20,0.00,0.00%,14,472,32.13%
# 75.00,FISV150320P00075000,0.20,0.15,0.30,0.00,0.00%,42,557,18.80%
# 80.00,FISV150320P00080000,1.60,1.75,2.00,0.00,0.00%,22,91,15.06%

请注意,这些表也有 ID "optionsCallsTable" 和 "optionsPutsTable",因此您可以使用该信息轻松分隔行。