如何识别 pandas 索引和 header 中的日期？

Question

我正在使用 HTML 解析器从网站中提取 table，如下所示：

当前输出：

          1 Sep         2 Sep       3 Sep        4 Sep  
00:00     11             47          54           10 
01:00     12             49          46           22
...
23:00     15             34          22           40

预期输出：

          2017-09-0    2017-09-02    2017-09-03    2017-09-04   
00:00:00     11             47          54           10 
01:00:00     12             49          46           22
...
23:00:00     15             34          22           40

但索引和 header 都被识别为 objects 而不是日期和小时。那么我该如何解决这个问题？

下面是我正在使用的代码：

url = r'https://www.nratrafficdata.ie/c2/tfmonthreport.asp?sgid=ZvyVmXU8jBt9PJE$c7UXt6&spid=NRA_000000001508&reportdate=2017-09-01&enddate=2017-09-30&sidebar=1'
tables = pd.read_html(url, header=[1], index_col=0)
sp500_table = tables[0]

Answer 1

首先将 Series.str.contains 与正则表达式模式一起用于匹配 dd:dd 值，然后从列表中过滤掉列，最后将列转换为日期时间并将索引转换为时间增量：

cols = sp500_table.columns.difference(['Workday','7 Day','Count'], sort=False)
sp500_table = sp500_table.loc[sp500_table.index.str.contains('^\d{2}:\d{2}$', na=False), cols]
sp500_table.columns = pd.to_datetime(sp500_table.columns + ' 2017', format='%d %b %Y')
sp500_table.index = pd.to_timedelta(sp500_table.index + ':00') 
print (sp500_table.head())
         2017-09-01 2017-09-02 2017-09-03 2017-09-04 2017-09-05 2017-09-06  \
00:00:00       1362       1576       1589       1321       1198       1218   
01:00:00        998       1105       1359        859        722        881   
02:00:00        751        613        844        550        470        600   
03:00:00        822        813        808        749        627        672   
04:00:00       1461       1357       1108       1466       1497       1386   

         2017-09-07 2017-09-08 2017-09-09 2017-09-10  ... 2017-09-21  \
00:00:00       1319       1422       1717       1684  ...       1310   
01:00:00        802        987       1234       1377  ...        807   
02:00:00        525        659        651        834  ...        531   
03:00:00        697        714        796        786  ...        661   
04:00:00       1397       1412       1417       1123  ...       1319   

         2017-09-22 2017-09-23 2017-09-24 2017-09-25 2017-09-26 2017-09-27  \
00:00:00       1615       1807       1600       1219       1131       1283   
01:00:00       1046       1263       1322        760        732        888   
02:00:00        554        738        842        471        487        528   
03:00:00        754        823        734        595        608        633   
04:00:00       1514       1315       1091       1451       1287       1376   

         2017-09-28 2017-09-29 2017-09-30  
00:00:00       1357       1579       1858  
01:00:00        968       1025       1146  
02:00:00        554        642        704  
03:00:00        767        739        727  
04:00:00       1378       1420       1243  

[5 rows x 30 columns]

如何识别 pandas 索引和 header 中的日期？

How to recognize dates in pandas index and header?

html

python

html-parsing

python-3.x

pandas