Python如何将未格式化的数据拆分成几列?

How to split unformatted data into several columns in Python?

亲爱的,

最近用爬虫从网站上抓取信息,得到一列数据是这样的:

|               **Hotel Info**           |  
| 2014 open    2016 retrofit    50 rooms |  
| 60 rooms                               |       
| 2012 open    100 rooms                 |
| 80 rooms                               |
| 2010 open                              |

最后我要这样:

| **Hotel Open** | **Hotel Retrofit** | **Hotel Rooms** |
|   2014         |   2016             |   50            |
|   null         |   null             |   60            |
|   2012         |   null             |   100           |
|   null         |   null             |   80            |
|   2010         |   null             |   null          |

注意:
原网站没有单独拆分这3个'information blocks'。它们都在 <p>...</p> 块下。因此我无法回避这个问题。

我正在使用 Python,而且是全新的。请帮助我,非常感谢!!!

假设您在 test.xlsx 文件中有数据,您可以试试这个:

import collections
import numpy as np
import pandas as pd
df = pd.read_excel('test.xlsx', sheetname='Sheet1')
df_dict = collections.defaultdict(list)
for i in df['**Hotel Info**']:
    i_list = i.split('    ') #split with multiple spaces (&nbsp;&nbsp;)
    df_dict['**Hotel Open**'].append([e.split('open')[0].strip() for e in i_list if 'open' in e])
    df_dict['**Hotel Retrofit**'].append([e.split('retrofit')[0].strip() for e in i_list if 'retrofit' in e])
    df_dict['**Hotel Rooms**'].append([e.split('rooms')[0].strip() for e in i_list if 'rooms' in e])
df_dict['**Hotel Open**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Open**']]
df_dict['**Hotel Retrofit**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Retrofit**']]
df_dict['**Hotel Rooms**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Rooms**']]
new_df = pd.DataFrame(df_dict)
new_df

new_df 将是:

    **Hotel Open**  **Hotel Retrofit**  **Hotel Rooms**
0   2014            2016                50
1   NaN             NaN                 60
2   2012            NaN                 100
3   NaN             NaN                 80
4   2010            NaN                 NaN