如何在 pandas 中进行关键字映射
How to do keyword mapping in pandas
我有关键词
India
Japan
United States
Germany
China
这是示例数据框
id Address
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan
2 Arcisstraße 21, 80333 München, Germany
3 Liberty Street, Manhattan, New York, United States
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China
5 Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India
我的目标是
id Address India Japan United States Germany China
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 0 1 0 0 0
2 Arcisstraße 21, 80333 München, Germany 0 0 0 1 0
3 Liberty Street, Manhattan, New York, USA 0 0 1 0 0
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 0 0 0 0 1
5 Vaishnavi Summit,80feet Road,Bangalore, Karnataka, India 1 0 0 0 0
基本思路是创建关键字检测器,我想使用 str.contain
和 word2vec
但我无法理解逻辑
In [58]: df = df.join(df.Address.str.extract(r'.*,(.*)', expand=False).str.get_dummies())
In [59]: df
Out[59]:
id Address China Germany India Japan United States
0 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, J... 0 0 0 1 0
1 2 Arcisstra?e 21, 80333 Munchen, Germany 0 1 0 0 0
2 3 Liberty Street, Manhattan, New York, United St... 0 0 0 0 1
3 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 1 0 0 0 0
4 5 Vaishnavi Summit,80feet Road,3rd Block,Bangalo... 0 0 1 0 0
注意:如果国家/地区不在 Address
列的最后位置或者国家/地区名称包含 ,
,则此方法将无效
利用pd.get_dummies()
:
countries = df.Address.str.extract('(India|Japan|United States|Germany|China)', expand = False)
dummies = pd.get_dummies(countries)
pd.concat([df,dummies],axis = 1)
此外,最直接的方法是将国家列在列表中并使用 for 循环,比如说
countries = ['India','Japan','United States','Germany','China']
for c in countries:
df[c] = df.Address.str.contains(c) * 1
但如果您有大量数据和国家/地区,速度可能会很慢。
from numpy.core.defchararray import find
kw = 'India|Japan|United States|Germany|China'.split('|')
a = df.Address.values.astype(str)[:, None]
df.join(
pd.DataFrame(
find(a, kw) >= 0,
df.index, kw,
dtype=int
)
)
id Address India Japan United States Germany China
0 1 Chome-2-8 Shibakoen, Minat... 0 1 0 0 0
1 2 Arcisstraße 21, 80333 Münc... 0 0 0 1 0
2 3 Liberty Street, Manhattan,... 0 0 1 0 0
3 4 30 Shuangqing Rd, Haidian ... 0 0 0 0 1
4 5 Vaishnavi Summit,80feet Ro... 1 0 0 0 0
我有关键词
India
Japan
United States
Germany
China
这是示例数据框
id Address
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan
2 Arcisstraße 21, 80333 München, Germany
3 Liberty Street, Manhattan, New York, United States
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China
5 Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India
我的目标是
id Address India Japan United States Germany China
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 0 1 0 0 0
2 Arcisstraße 21, 80333 München, Germany 0 0 0 1 0
3 Liberty Street, Manhattan, New York, USA 0 0 1 0 0
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 0 0 0 0 1
5 Vaishnavi Summit,80feet Road,Bangalore, Karnataka, India 1 0 0 0 0
基本思路是创建关键字检测器,我想使用 str.contain
和 word2vec
但我无法理解逻辑
In [58]: df = df.join(df.Address.str.extract(r'.*,(.*)', expand=False).str.get_dummies())
In [59]: df
Out[59]:
id Address China Germany India Japan United States
0 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, J... 0 0 0 1 0
1 2 Arcisstra?e 21, 80333 Munchen, Germany 0 1 0 0 0
2 3 Liberty Street, Manhattan, New York, United St... 0 0 0 0 1
3 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 1 0 0 0 0
4 5 Vaishnavi Summit,80feet Road,3rd Block,Bangalo... 0 0 1 0 0
注意:如果国家/地区不在 Address
列的最后位置或者国家/地区名称包含 ,
利用pd.get_dummies()
:
countries = df.Address.str.extract('(India|Japan|United States|Germany|China)', expand = False)
dummies = pd.get_dummies(countries)
pd.concat([df,dummies],axis = 1)
此外,最直接的方法是将国家列在列表中并使用 for 循环,比如说
countries = ['India','Japan','United States','Germany','China']
for c in countries:
df[c] = df.Address.str.contains(c) * 1
但如果您有大量数据和国家/地区,速度可能会很慢。
from numpy.core.defchararray import find
kw = 'India|Japan|United States|Germany|China'.split('|')
a = df.Address.values.astype(str)[:, None]
df.join(
pd.DataFrame(
find(a, kw) >= 0,
df.index, kw,
dtype=int
)
)
id Address India Japan United States Germany China
0 1 Chome-2-8 Shibakoen, Minat... 0 1 0 0 0
1 2 Arcisstraße 21, 80333 Münc... 0 0 0 1 0
2 3 Liberty Street, Manhattan,... 0 0 1 0 0
3 4 30 Shuangqing Rd, Haidian ... 0 0 0 0 1
4 5 Vaishnavi Summit,80feet Ro... 1 0 0 0 0