Pandas:更快地将字符串元组列表转换为数据帧?
Pandas: convert list of string tuples to dataframe faster?
从文本字段中我有以下 输入 系列,包含地理坐标元组作为字符串:
import pandas as pd
coords = pd.Series([
'(29.65271977700047, -82.33086252299967)',
'(29.652914019000434, -82.42682220199964)',
'(29.65301114200048, -82.36455186899968)',
'(29.642610841000476, -82.29853169599966)',
])
我想解析这些元组中的数字并得到以下 结果 DataFrame:
lat lon
0 29.652720 -82.330863
1 29.652914 -82.426822
2 29.653011 -82.364552
3 29.642611 -82.298532
这是我想出的:
str_coords = coords.str[1:-1].str.split(', ')
latlon = str_coords.apply(pd.Series).astype(float)
latlon.columns = ['lat', 'lon']
我的 问题 :对 .apply(pd.Series)
的调用在实际列表中占用了 "forever",该列表有大约 120 万个条目。有没有更快的方法?
另一种访问列表第一个和第二个元素的方法也是通过 str
:
In [174]: coords = pd.Series([
.....: '(29.65271977700047, -82.33086252299967)',
.....: '(29.652914019000434, -82.42682220199964)',
.....: '(29.65301114200048, -82.36455186899968)',
.....: '(29.642610841000476, -82.29853169599966)'])
In [175]: str_coords = coords.str[1:-1].str.split(', ')
In [176]: coords_df = pd.DataFrame({'lat': str_coords.str[0], 'lon': str_coords.str[1]})
In [177]: coords_df.astype(float).head()
Out[177]:
lat lon
0 29.652720 -82.330863
1 29.652914 -82.426822
2 29.653011 -82.364552
3 29.642611 -82.298532
4 29.652720 -82.330863
一些时间表明我的解决方案和@ajcr 的解决方案都比 apply(pd.Series) 方法快得多(并且两者之间的差异可以忽略不计):
In [197]: coords = pd.Series([
.....: '(29.65271977700047, -82.33086252299967)',
.....: '(29.652914019000434, -82.42682220199964)',
.....: '(29.65301114200048, -82.36455186899968)',
.....: '(29.642610841000476, -82.29853169599966)'])
In [198]: coords = pd.concat([coords]*1000, ignore_index=True)
In [199]: %%timeit
.....: str_coords = coords.str[1:-1].str.split(', ')
.....: df_coords = pd.DataFrame({'lat': str_coords.str[0], 'lon': str_coords.str[1]}, dtype=float)
.....:
100 loops, best of 3: 14.1 ms per loop
In [200]: %%timeit
.....: str_coords = coords.str[1:-1].str.split(', ')
.....: df_coords = str_coords.apply(pd.Series).astype(float)
.....:
1 loops, best of 3: 821 ms per loop
In [201]: %%timeit
.....: df_coords = coords.str.extract(r'\((?P<lat>[\d\.]+),\s+(?P<lon>[^()\s,]+)\)')
.....: df_coords.astype(float)
.....:
100 loops, best of 3: 16.2 ms per loop
另一种方法是使用矢量化字符串方法 extract
:
>>> coords.str.extract(r'\((?P<lat>[\-\d\.]+),\s+(?P<lon>[\-\d\.]+)\)')
lat lon
0 29.65271977700047 -82.33086252299967
1 29.652914019000434 -82.42682220199964
2 29.65301114200048 -82.36455186899968
3 29.642610841000476 -82.29853169599966
您可以将命名的正则表达式捕获组传递给 extract
- 它会创建一个以组名作为列名的 DataFrame。
然后您可以将此 DataFrame df
转换为 float
数据类型:
>>> df.astype(float)
lat lon
0 29.652720 -82.330863
1 29.652914 -82.426822
2 29.653011 -82.364552
3 29.642611 -82.298532
从文本字段中我有以下 输入 系列,包含地理坐标元组作为字符串:
import pandas as pd
coords = pd.Series([
'(29.65271977700047, -82.33086252299967)',
'(29.652914019000434, -82.42682220199964)',
'(29.65301114200048, -82.36455186899968)',
'(29.642610841000476, -82.29853169599966)',
])
我想解析这些元组中的数字并得到以下 结果 DataFrame:
lat lon
0 29.652720 -82.330863
1 29.652914 -82.426822
2 29.653011 -82.364552
3 29.642611 -82.298532
这是我想出的:
str_coords = coords.str[1:-1].str.split(', ')
latlon = str_coords.apply(pd.Series).astype(float)
latlon.columns = ['lat', 'lon']
我的 问题 :对 .apply(pd.Series)
的调用在实际列表中占用了 "forever",该列表有大约 120 万个条目。有没有更快的方法?
另一种访问列表第一个和第二个元素的方法也是通过 str
:
In [174]: coords = pd.Series([
.....: '(29.65271977700047, -82.33086252299967)',
.....: '(29.652914019000434, -82.42682220199964)',
.....: '(29.65301114200048, -82.36455186899968)',
.....: '(29.642610841000476, -82.29853169599966)'])
In [175]: str_coords = coords.str[1:-1].str.split(', ')
In [176]: coords_df = pd.DataFrame({'lat': str_coords.str[0], 'lon': str_coords.str[1]})
In [177]: coords_df.astype(float).head()
Out[177]:
lat lon
0 29.652720 -82.330863
1 29.652914 -82.426822
2 29.653011 -82.364552
3 29.642611 -82.298532
4 29.652720 -82.330863
一些时间表明我的解决方案和@ajcr 的解决方案都比 apply(pd.Series) 方法快得多(并且两者之间的差异可以忽略不计):
In [197]: coords = pd.Series([
.....: '(29.65271977700047, -82.33086252299967)',
.....: '(29.652914019000434, -82.42682220199964)',
.....: '(29.65301114200048, -82.36455186899968)',
.....: '(29.642610841000476, -82.29853169599966)'])
In [198]: coords = pd.concat([coords]*1000, ignore_index=True)
In [199]: %%timeit
.....: str_coords = coords.str[1:-1].str.split(', ')
.....: df_coords = pd.DataFrame({'lat': str_coords.str[0], 'lon': str_coords.str[1]}, dtype=float)
.....:
100 loops, best of 3: 14.1 ms per loop
In [200]: %%timeit
.....: str_coords = coords.str[1:-1].str.split(', ')
.....: df_coords = str_coords.apply(pd.Series).astype(float)
.....:
1 loops, best of 3: 821 ms per loop
In [201]: %%timeit
.....: df_coords = coords.str.extract(r'\((?P<lat>[\d\.]+),\s+(?P<lon>[^()\s,]+)\)')
.....: df_coords.astype(float)
.....:
100 loops, best of 3: 16.2 ms per loop
另一种方法是使用矢量化字符串方法 extract
:
>>> coords.str.extract(r'\((?P<lat>[\-\d\.]+),\s+(?P<lon>[\-\d\.]+)\)')
lat lon
0 29.65271977700047 -82.33086252299967
1 29.652914019000434 -82.42682220199964
2 29.65301114200048 -82.36455186899968
3 29.642610841000476 -82.29853169599966
您可以将命名的正则表达式捕获组传递给 extract
- 它会创建一个以组名作为列名的 DataFrame。
然后您可以将此 DataFrame df
转换为 float
数据类型:
>>> df.astype(float)
lat lon
0 29.652720 -82.330863
1 29.652914 -82.426822
2 29.653011 -82.364552
3 29.642611 -82.298532