如何替换 pandas 数据框中看起来相似的值？

Question

我是 Pandas 的新手。我的数据集中有以下数据类型。（数据集是从 Kaggle 下载的 Indian Startup Funding。）

Date                datetime64[ns]
StartupName                 object
IndustryVertical            object
CityLocation                object
InvestorsName               object
InvestmentType              object
AmountInUSD                 object
dtype: object

data['AmountInUSD'].groupby(data['CityLocation']).describe()

我做了上面的操作，发现很多城市都差不多例如，

Bangalore   
Bangalore / Palo Alto
Bangalore / SFO
Bangalore / San Mateo
Bangalore / USA
Bangalore/ Bangkok

我想做下面的操作，但是我不知道这个的代码。

在 CityLocation 列中，找到所有以 'Bang' 开头的单元格并将它们全部替换为 'Bangalore'。将不胜感激。

我这样做了

data[data.CityLocation.str.startswith('Bang')]

而且我不知道之后该怎么办。

Answer 1

您可以使用 loc 函数在您的列中查找其子字符串匹配的值，并将其替换为您选择的值。

import pandas as pd

df = pd.DataFrame({'CityLocation': ['Bangalore', 'Dangerlore', 'Bangalore/USA'], 'Values': [1, 2, 3]})
print(df)
#     CityLocation  Values
# 0      Bangalore       1
# 1     Dangerlore       2
# 2  Bangalore/USA       3


df.loc[df.CityLocation.str.startswith('Bang'), 'CityLocation'] = 'Bangalore'
print(df)
#   CityLocation  Values
# 0    Bangalore       1
# 1   Dangerlore       2
# 2    Bangalore       3

Answer 2

pandas 0.23 有一个处理文本的好方法。请参阅文档 Working with Text Data. You can use regular expressions 以捕获和替换文本。

import pandas as pd
df = pd.DataFrame({'CityLocation': ["Bangalore / Palo Alto", "Bangalore / SFO", "Other"]})

df['CityLocation'] = df['CityLocation'].str.replace("^Bang.*", "Bangalore")

print(df)

会产生

  CityLocation
0    Bangalore
1    Bangalore
2        Other

如何替换 pandas 数据框中看起来相似的值？

How do I replace the similar looking values in a pandas dataframe?

python

data-analysis

pandas

data-science