Python如何将未格式化的数据拆分成几列？

Question

亲爱的，

最近用爬虫从网站上抓取信息，得到一列数据是这样的：

|               **Hotel Info**           |  
| 2014 open    2016 retrofit    50 rooms |  
| 60 rooms                               |       
| 2012 open    100 rooms                 |
| 80 rooms                               |
| 2010 open                              |

最后我要这样:

| **Hotel Open** | **Hotel Retrofit** | **Hotel Rooms** |
|   2014         |   2016             |   50            |
|   null         |   null             |   60            |
|   2012         |   null             |   100           |
|   null         |   null             |   80            |
|   2010         |   null             |   null          |

注意：
原网站没有单独拆分这3个'information blocks'。它们都在 <p>...</p> 块下。因此我无法回避这个问题。

我正在使用 Python，而且是全新的。请帮助我，非常感谢！！！

Answer 1

假设您在 test.xlsx 文件中有数据，您可以试试这个：

import collections
import numpy as np
import pandas as pd
df = pd.read_excel('test.xlsx', sheetname='Sheet1')
df_dict = collections.defaultdict(list)
for i in df['**Hotel Info**']:
    i_list = i.split('    ') #split with multiple spaces (&nbsp;&nbsp;)
    df_dict['**Hotel Open**'].append([e.split('open')[0].strip() for e in i_list if 'open' in e])
    df_dict['**Hotel Retrofit**'].append([e.split('retrofit')[0].strip() for e in i_list if 'retrofit' in e])
    df_dict['**Hotel Rooms**'].append([e.split('rooms')[0].strip() for e in i_list if 'rooms' in e])
df_dict['**Hotel Open**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Open**']]
df_dict['**Hotel Retrofit**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Retrofit**']]
df_dict['**Hotel Rooms**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Rooms**']]
new_df = pd.DataFrame(df_dict)
new_df

new_df 将是：

    **Hotel Open**  **Hotel Retrofit**  **Hotel Rooms**
0   2014            2016                50
1   NaN             NaN                 60
2   2012            NaN                 100
3   NaN             NaN                 80
4   2010            NaN                 NaN

Python如何将未格式化的数据拆分成几列？

How to split unformatted data into several columns in Python?

python

split

python-2.7

data-cleaning