根据列值创建列表,并使用该列表从 df 中的字符串列中提取单词,而不用 for 循环覆盖行值
Create list based on column value and use that list to extract words from string column in df without overwriting row value with for loop
好吧,我承认,我被卡住了。希望有人能帮我解决这个问题!我会尽力解释。我有两个df。其中一个有字符串列和城市,另一个 df 有城市和街道。我想为每行创建一个街道列表(针对特定的自治市),以便它只提取该特定自治市的字符串列中的街道。我现在拥有的代码有点管用,但它不断遍历所有城市,因此提取其他城市的街道并将街道添加到错误的行中。我希望下面的代码示例能让我的问题更清楚一些。
创建数据框:
import pandas as pd
import re
# Sample dataframe with the municipality and string column
data1 = {'municipality': ['Urk','Utrecht','Almere','Utrecht','Huizen'],
'text': ["I'm going to Plantage, Pollux and Oostvaardersdiep","Tomorrow I'm going to Hoog Catharijne",
"I'm not going to the Balijelaan","I'm not going to Socrateshof today",
"Next week I'll be going to Socrateshof"]}
df = pd.DataFrame(data1, columns = ['municipality','text'])
print(df)
输出:
municipality text
0 Urk I'm going to Plantage, Pollux and Oostvaarders...
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof
# Sample dataframe with the municipality and street
data2 = {'municipality': ['Urk','Urk','Utrecht','Almere','Almere','Huizen'],
'street_name': ['Plantage','Pollux','Balijelaan','Oostvaardersdiep','Catharijne','Socrateshof']}
df2 = pd.DataFrame(data2, columns = ['municipality','street_name'])
print(df2)
输出:
municipality street_name
0 Urk Plantage
1 Urk Pollux
2 Utrecht Balijelaan
3 Almere Oostvaardersdiep
4 Almere Catharijne
5 Huizen Socrateshof
运行下面的函数:
# Function
street = []
def extract_street(txt):
mun_list_filter = df['municipality'] # I want the streets for this municipality
df_bag_filter_mun = df2[df2['municipality'].isin(mun_list_filter)] # Filter second df on the wanted municipality
street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
st = re.findall(r"\b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
street.append(st) # Append to empty street list
return street # As you can see it keeps iterating over all municipalities
# Call function by iterating over rows in string column
for txt in df['text']:
extract_street(txt)
# Add street list to df
df = df.assign(**{'street_match': street})
df['street_match'] = [', '.join(map(str, l)) for l in df['street_match']]
df
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux, Oostvaardersdiep
1 Utrecht Tomorrow I'm going to Hoog Catharijne Catharijne
2 Almere I'm not going to the Balijelaan Balijelaan
3 Utrecht I'm not going to Socrateshof today Socrateshof
4 Huizen Next week I'll be going to Socrateshof Socrateshof
正如您在市政当局 'Urk' 的第一行中看到的那样,函数添加了街道 'Oostvaardersdiep',即使只有当第一行的市政当局是 [=37] 时才应该匹配=].只有最后一行是正确的,因为 'Socrateshof' 实际上在自治市 'Huizen'.
想要的结果:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
我知道问题出在哪里,只是不知道如何解决。我试过 apply/lambda 但也不走运。谢谢!
仅传入 text
的一个问题是您无法进行市政筛选。这就是为什么您在 'Urk' 获得街道 'Oostvaardersdiep',即使它在 'Almere'。您得到它是因为名称 'Oostvaardersdiep' 出现在 'Urk' 条目的文本中。您的 extract_streets()
函数 不知道 要匹配哪个城市。
让您的代码正常工作的最小更改是:
- 将
mun
与txt
一起传入extract_street()
mun_list_filter
应该使用mun
而不是所有的直辖市
street = []
def extract_street(txt, mun): # Pass in municipality
df_bag_filter_mun = df2[df2['municipality'] == mun]
### everything below is COPY-PASTED from your question
street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
st = re.findall(r"\b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
street.append(st) # Append to empty street list
return street # As you can see it keeps iterating over all municipalities
# add the 'municipality' for the extract loop
for txt, mun in zip(df['text'], df['municipality']):
extract_street(txt, mun)
# Add street list to df
df = df.assign(**{'street_match': street})
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep [Plantage, Pollux]
1 Utrecht Tomorrow I'm going to Hoog Catharijne []
2 Almere I'm not going to the Balijelaan []
3 Utrecht I'm not going to Socrateshof today []
4 Huizen Next week I'll be going to Socrateshof [Socrateshof]
然后加入列表使其成为字符串:
df['street_match'] = df['street_match'].str.join(', ')
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
添加另一个答案以显示 shorter/simpler 方式来做你想做的事。 ( 只是为了修复您的代码中不起作用的部分。)
使用 .apply()
,您可以在 df
的每一行 调用函数的修改版本 ,然后使用 df
中的街道名称进行检查=14=].
def extract_street(row):
street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
streets_regex = r'\b(' + '|'.join(street_list_mun) + r')\b'
streets_found = set(re.findall(streets_regex, row['text']))
return ', '.join(streets_found)
## or if you want this to return a list of streets
# return list(streets_found)
df['street_match'] = df.apply(extract_street, axis=1)
df
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
注:
您的正则表达式存在问题 - 表达式的 join
部分生成类似于 Plantage\b|Pollux
的字符串。如果 (a) 最后的街道名称位于另一个单词的开头,或者 (b) 如果除最后一个街道名称之外的任何街道名称位于另一个单词的末尾,这将给出匹配:“我要去 NotPlantage , Polluxsss 和 Oostvaardersdiep" 将匹配两条街道,但它不应该匹配。相反,单词 boundary \b
应该位于选项列表的末尾,并用括号将它们分开。它应该生成如下字符串:\b(Plantage|Pollux)\b
。这与“Polluxsss”或“NotPlantage”不匹配。我已经在上面的代码中进行了更改。
我正在使用 set
来获取唯一的街头匹配列表。如果这行是“I'm going to Pollux, Pollux, Pollux”,它会给出结果 3 次而不是一次。
@aneroid 我现在想从相似的文本列中提取多个完全匹配项(在列表中)。下面的代码(基于你的正则表达式)适用于这个简单的例子,但在我更大更复杂的数据集上,我得到了一堆元组和空字符串。你知道我如何改进这段代码吗?
# String column
data1 = {'text': ["Today I'm going to Utrecht","Tomorrow I'm going to Utrecht and Urk",
"Next week I'll be going to the Amsterdamsestraatweg"]}
df = pd.DataFrame(data1, columns = ['text'])
print(df)
# City column in other df
data2 = {'city': ['Urk','Utrecht','Almere','Huizen','Amsterdam','Urk']}
df2 = pd.DataFrame(data2, columns = ['city'])
print(df2)
# I create a list of all the unique cities in df2
city_list = list(df2['city'].unique())
len(city_list)
len(set(city_list))
# Extract the words if there is an exact match
df['city_match'] = df['text'].str.findall(r'\b(' + '|'.join(city_list) + r')\b')
df['city_match'] = [', '.join(map(str, l)) for l in df['city_match']]
print(df)
# Output
text city_match
0 Today I'm going to Utrecht Utrecht
1 Tomorrow I'm going to Utrecht and Urk Utrecht, Urk
2 Next week I'll be going to the Amsterdamsestra...
如您所见,它有效。 'Amsterdamsestraatweg' 不是完全匹配,因此不匹配。奇怪的是,在我较大的 df 中,我得到了一堆元组和空字符串作为输出,如下所示:
0 ('Wijk bij Duurstede', '', '')
6 ('Utrecht', '', '')
7 ('Huizen', '', ''), ('Huizen', '', ''), ('Huiz...
9 ('Utrecht', '', ''), ('Utrecht', '', ''), ('Ut...
10 ('Urk', '', ''), ('Urk', '', '')
11 ('Amersfoort', '', ''), ('Amersfoort', '', '')...
12 ('Lelystad', '', '')
13 ('Utrecht', '', ''), ('Utrecht', '', '')
16 ('Hilversum', '', ''), ('Hilversum', '', ''), ...
18 ('De Bilt', '', ''), ('De Bilt', '', '')
19 ('Urk', '', '')
再次感谢
好吧,我承认,我被卡住了。希望有人能帮我解决这个问题!我会尽力解释。我有两个df。其中一个有字符串列和城市,另一个 df 有城市和街道。我想为每行创建一个街道列表(针对特定的自治市),以便它只提取该特定自治市的字符串列中的街道。我现在拥有的代码有点管用,但它不断遍历所有城市,因此提取其他城市的街道并将街道添加到错误的行中。我希望下面的代码示例能让我的问题更清楚一些。
创建数据框:
import pandas as pd
import re
# Sample dataframe with the municipality and string column
data1 = {'municipality': ['Urk','Utrecht','Almere','Utrecht','Huizen'],
'text': ["I'm going to Plantage, Pollux and Oostvaardersdiep","Tomorrow I'm going to Hoog Catharijne",
"I'm not going to the Balijelaan","I'm not going to Socrateshof today",
"Next week I'll be going to Socrateshof"]}
df = pd.DataFrame(data1, columns = ['municipality','text'])
print(df)
输出:
municipality text
0 Urk I'm going to Plantage, Pollux and Oostvaarders...
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof
# Sample dataframe with the municipality and street
data2 = {'municipality': ['Urk','Urk','Utrecht','Almere','Almere','Huizen'],
'street_name': ['Plantage','Pollux','Balijelaan','Oostvaardersdiep','Catharijne','Socrateshof']}
df2 = pd.DataFrame(data2, columns = ['municipality','street_name'])
print(df2)
输出:
municipality street_name
0 Urk Plantage
1 Urk Pollux
2 Utrecht Balijelaan
3 Almere Oostvaardersdiep
4 Almere Catharijne
5 Huizen Socrateshof
运行下面的函数:
# Function
street = []
def extract_street(txt):
mun_list_filter = df['municipality'] # I want the streets for this municipality
df_bag_filter_mun = df2[df2['municipality'].isin(mun_list_filter)] # Filter second df on the wanted municipality
street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
st = re.findall(r"\b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
street.append(st) # Append to empty street list
return street # As you can see it keeps iterating over all municipalities
# Call function by iterating over rows in string column
for txt in df['text']:
extract_street(txt)
# Add street list to df
df = df.assign(**{'street_match': street})
df['street_match'] = [', '.join(map(str, l)) for l in df['street_match']]
df
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux, Oostvaardersdiep
1 Utrecht Tomorrow I'm going to Hoog Catharijne Catharijne
2 Almere I'm not going to the Balijelaan Balijelaan
3 Utrecht I'm not going to Socrateshof today Socrateshof
4 Huizen Next week I'll be going to Socrateshof Socrateshof
正如您在市政当局 'Urk' 的第一行中看到的那样,函数添加了街道 'Oostvaardersdiep',即使只有当第一行的市政当局是 [=37] 时才应该匹配=].只有最后一行是正确的,因为 'Socrateshof' 实际上在自治市 'Huizen'.
想要的结果:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
我知道问题出在哪里,只是不知道如何解决。我试过 apply/lambda 但也不走运。谢谢!
仅传入 text
的一个问题是您无法进行市政筛选。这就是为什么您在 'Urk' 获得街道 'Oostvaardersdiep',即使它在 'Almere'。您得到它是因为名称 'Oostvaardersdiep' 出现在 'Urk' 条目的文本中。您的 extract_streets()
函数 不知道 要匹配哪个城市。
让您的代码正常工作的最小更改是:
- 将
mun
与txt
一起传入extract_street()
mun_list_filter
应该使用mun
而不是所有的直辖市
street = []
def extract_street(txt, mun): # Pass in municipality
df_bag_filter_mun = df2[df2['municipality'] == mun]
### everything below is COPY-PASTED from your question
street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
st = re.findall(r"\b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
street.append(st) # Append to empty street list
return street # As you can see it keeps iterating over all municipalities
# add the 'municipality' for the extract loop
for txt, mun in zip(df['text'], df['municipality']):
extract_street(txt, mun)
# Add street list to df
df = df.assign(**{'street_match': street})
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep [Plantage, Pollux]
1 Utrecht Tomorrow I'm going to Hoog Catharijne []
2 Almere I'm not going to the Balijelaan []
3 Utrecht I'm not going to Socrateshof today []
4 Huizen Next week I'll be going to Socrateshof [Socrateshof]
然后加入列表使其成为字符串:
df['street_match'] = df['street_match'].str.join(', ')
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
添加另一个答案以显示 shorter/simpler 方式来做你想做的事。 (
使用 .apply()
,您可以在 df
的每一行 调用函数的修改版本 ,然后使用 df
中的街道名称进行检查=14=].
def extract_street(row):
street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
streets_regex = r'\b(' + '|'.join(street_list_mun) + r')\b'
streets_found = set(re.findall(streets_regex, row['text']))
return ', '.join(streets_found)
## or if you want this to return a list of streets
# return list(streets_found)
df['street_match'] = df.apply(extract_street, axis=1)
df
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
注:
您的正则表达式存在问题 - 表达式的
join
部分生成类似于Plantage\b|Pollux
的字符串。如果 (a) 最后的街道名称位于另一个单词的开头,或者 (b) 如果除最后一个街道名称之外的任何街道名称位于另一个单词的末尾,这将给出匹配:“我要去 NotPlantage , Polluxsss 和 Oostvaardersdiep" 将匹配两条街道,但它不应该匹配。相反,单词 boundary\b
应该位于选项列表的末尾,并用括号将它们分开。它应该生成如下字符串:\b(Plantage|Pollux)\b
。这与“Polluxsss”或“NotPlantage”不匹配。我已经在上面的代码中进行了更改。我正在使用
set
来获取唯一的街头匹配列表。如果这行是“I'm going to Pollux, Pollux, Pollux”,它会给出结果 3 次而不是一次。
@aneroid 我现在想从相似的文本列中提取多个完全匹配项(在列表中)。下面的代码(基于你的正则表达式)适用于这个简单的例子,但在我更大更复杂的数据集上,我得到了一堆元组和空字符串。你知道我如何改进这段代码吗?
# String column
data1 = {'text': ["Today I'm going to Utrecht","Tomorrow I'm going to Utrecht and Urk",
"Next week I'll be going to the Amsterdamsestraatweg"]}
df = pd.DataFrame(data1, columns = ['text'])
print(df)
# City column in other df
data2 = {'city': ['Urk','Utrecht','Almere','Huizen','Amsterdam','Urk']}
df2 = pd.DataFrame(data2, columns = ['city'])
print(df2)
# I create a list of all the unique cities in df2
city_list = list(df2['city'].unique())
len(city_list)
len(set(city_list))
# Extract the words if there is an exact match
df['city_match'] = df['text'].str.findall(r'\b(' + '|'.join(city_list) + r')\b')
df['city_match'] = [', '.join(map(str, l)) for l in df['city_match']]
print(df)
# Output
text city_match
0 Today I'm going to Utrecht Utrecht
1 Tomorrow I'm going to Utrecht and Urk Utrecht, Urk
2 Next week I'll be going to the Amsterdamsestra...
如您所见,它有效。 'Amsterdamsestraatweg' 不是完全匹配,因此不匹配。奇怪的是,在我较大的 df 中,我得到了一堆元组和空字符串作为输出,如下所示:
0 ('Wijk bij Duurstede', '', '')
6 ('Utrecht', '', '')
7 ('Huizen', '', ''), ('Huizen', '', ''), ('Huiz...
9 ('Utrecht', '', ''), ('Utrecht', '', ''), ('Ut...
10 ('Urk', '', ''), ('Urk', '', '')
11 ('Amersfoort', '', ''), ('Amersfoort', '', '')...
12 ('Lelystad', '', '')
13 ('Utrecht', '', ''), ('Utrecht', '', '')
16 ('Hilversum', '', ''), ('Hilversum', '', ''), ...
18 ('De Bilt', '', ''), ('De Bilt', '', '')
19 ('Urk', '', '')
再次感谢