Python如何将未格式化的数据拆分成几列?
How to split unformatted data into several columns in Python?
亲爱的,
最近用爬虫从网站上抓取信息,得到一列数据是这样的:
| **Hotel Info** |
| 2014 open 2016 retrofit 50 rooms |
| 60 rooms |
| 2012 open 100 rooms |
| 80 rooms |
| 2010 open |
最后我要这样:
| **Hotel Open** | **Hotel Retrofit** | **Hotel Rooms** |
| 2014 | 2016 | 50 |
| null | null | 60 |
| 2012 | null | 100 |
| null | null | 80 |
| 2010 | null | null |
注意:
原网站没有单独拆分这3个'information blocks'。它们都在 <p>...</p>
块下。因此我无法回避这个问题。
我正在使用 Python,而且是全新的。请帮助我,非常感谢!!!
假设您在 test.xlsx
文件中有数据,您可以试试这个:
import collections
import numpy as np
import pandas as pd
df = pd.read_excel('test.xlsx', sheetname='Sheet1')
df_dict = collections.defaultdict(list)
for i in df['**Hotel Info**']:
i_list = i.split(' ') #split with multiple spaces ( )
df_dict['**Hotel Open**'].append([e.split('open')[0].strip() for e in i_list if 'open' in e])
df_dict['**Hotel Retrofit**'].append([e.split('retrofit')[0].strip() for e in i_list if 'retrofit' in e])
df_dict['**Hotel Rooms**'].append([e.split('rooms')[0].strip() for e in i_list if 'rooms' in e])
df_dict['**Hotel Open**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Open**']]
df_dict['**Hotel Retrofit**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Retrofit**']]
df_dict['**Hotel Rooms**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Rooms**']]
new_df = pd.DataFrame(df_dict)
new_df
new_df 将是:
**Hotel Open** **Hotel Retrofit** **Hotel Rooms**
0 2014 2016 50
1 NaN NaN 60
2 2012 NaN 100
3 NaN NaN 80
4 2010 NaN NaN
亲爱的,
最近用爬虫从网站上抓取信息,得到一列数据是这样的:
| **Hotel Info** |
| 2014 open 2016 retrofit 50 rooms |
| 60 rooms |
| 2012 open 100 rooms |
| 80 rooms |
| 2010 open |
最后我要这样:
| **Hotel Open** | **Hotel Retrofit** | **Hotel Rooms** |
| 2014 | 2016 | 50 |
| null | null | 60 |
| 2012 | null | 100 |
| null | null | 80 |
| 2010 | null | null |
注意:
原网站没有单独拆分这3个'information blocks'。它们都在 <p>...</p>
块下。因此我无法回避这个问题。
我正在使用 Python,而且是全新的。请帮助我,非常感谢!!!
假设您在 test.xlsx
文件中有数据,您可以试试这个:
import collections
import numpy as np
import pandas as pd
df = pd.read_excel('test.xlsx', sheetname='Sheet1')
df_dict = collections.defaultdict(list)
for i in df['**Hotel Info**']:
i_list = i.split(' ') #split with multiple spaces ( )
df_dict['**Hotel Open**'].append([e.split('open')[0].strip() for e in i_list if 'open' in e])
df_dict['**Hotel Retrofit**'].append([e.split('retrofit')[0].strip() for e in i_list if 'retrofit' in e])
df_dict['**Hotel Rooms**'].append([e.split('rooms')[0].strip() for e in i_list if 'rooms' in e])
df_dict['**Hotel Open**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Open**']]
df_dict['**Hotel Retrofit**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Retrofit**']]
df_dict['**Hotel Rooms**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Rooms**']]
new_df = pd.DataFrame(df_dict)
new_df
new_df 将是:
**Hotel Open** **Hotel Retrofit** **Hotel Rooms**
0 2014 2016 50
1 NaN NaN 60
2 2012 NaN 100
3 NaN NaN 80
4 2010 NaN NaN