使用 Python 从特定网站抓取历史数据 - table 从具有许多行的表单中创建，并带有孤立的 header。需要自动化

Question

我正在尝试从该网站抓取一些数据：http://www.meteoprog.sk/sk/fwarchive/Bratislava/

我基本上是在寻找从 2012 年 1 月到 2013 年 12 月的每月天气数据 (30 dni a noci)。我想自动抓取并将数据保存在 txt 或 cvs 文件中。

但是，table 的编译方式似乎有问题 - 我的 Google Chrome 抓取器无法拾取它。

我写代码是为了看看我能从 table:

中得到什么数据

import requests
from bs4 import BeautifulSoup


url = 'http://www.meteoprog.sk/sk/fwarchive/Bratislava/'
response = requests.get(url)
html = response.content

soup=BeautifulSoup(html)
table = soup.find("table",attrs={"class":"fwtab"})

for row in table.findAll("tr"):
    print table

数值没有显示，这似乎只是在 header 中获取数值。是否有任何简单的方法可以自动从该网站或任何形式查询的网站上进行抓取？

非常感谢任何帮助。

谢谢！

Answer 1

#_*_coding: utf-8_*_
import requests
import BeautifulSoup as bs
import csv
import calendar as cal
import datetime as dt

url = 'http://www.meteoprog.sk/sk/fwarchive/Bratislava/'

此服务需要 POST 这种格式的数据 {'data': '2012-01-31', 'days': '30', 'search': u'Hľadanie' } 来获取 2012 年 1 月的数据。这里我构建了 POST 格式的请求数据，用于 2012 年和 2013 年的每个月.

years = (2012, 2013)
months = range(1,13)
data = {'search': u'Hľadanie'}
for y in years:
        for m in months:
                days = cal.monthrange(y,m)[1]
                data['days'] = str(days - 1)
                data['data'] = dt.date(y,m,days).isoformat()

cal.monthrange returns 给定月份和年份的工作日和月份天数的元组。我使用天数作为 dt.date 调用中月份值的最后一天以“2012-01-31”格式获取日期并计算 'days' 数据的 POST 值.

                response = requests.post(url,data=data)
                html = response.content

                soup=bs.BeautifulSoup(html)
                table = soup.find("table",attrs={"class":"fwtab"})

                list_of_rows=[]
                hr = table.find('tr')
                hr.extract()

因为 table 包含 header 行，hr = table.find('tr');hr.extract() 将其从 table 中删除。

                for row in table.findAll('tr'):
                    list_of_cells = []
                    for cell in row.findAll('td'):
                        text = cell.text.replace('&nbsp;', '')
                        text = text.encode('utf-8')
                        list_of_cells.append(text)
                    list_of_rows.append(list_of_cells)

                list_of_rows.reverse()

table 首先有最近的值。所以按时间顺序排列，我倒转列表。

                otptfile = open("./meteo.csv", "ab")
                writer = csv.writer(otptfile)
                writer.writerows(list_of_rows)
                print 'data stored for %s, %s' % (m,y)

使用 Python 从特定网站抓取历史数据 - table 从具有许多行的表单中创建，并带有孤立的 header。需要自动化

Scraping historic data with Python from a specific website - table created out of a form with many rows with isolated header. Automation needed

html

python

forms

datatables

web-scraping