Pandas DataFrame 中的正则表达式 - 查找字符之间的最小长度
Regex within Pandas DataFrame - finding minimum length between characters
编辑:已针对重现性进行更新
我目前在 Pandas DataFrame 中工作,在 [Column A] 列的每一行中都有一个字符串列表。我正在尝试提取关键字列表(列表 B)的任何子列表组合之间的最小距离
ListB = [['abc','def'],['ghi','jkl'],['mno','pqr']]
而 Dataframe 列中的每一行都包含一个字符串列表。
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array([['1', '2', ['random string to be searched abc def ghi jkl','random string to be searched abc','abc random string to be searched def']],
['4', '5', ['random string to be searched ghi jkl','random string to be searched',' mno random string to be searched pqr']],
['7', '8', ['abc random string to be searched def','random string to be searched mno pqr','random string to be searched']]]),
columns=['a', 'b', 'list_of_strings_to_search'])
在高层次上,我试图在 data['list_of_strings_to_search']
中包含的列表中搜索每个字符串,以查找 ListB
元素(必须满足两个条件)和 return 满足条件的 ListB
子列表,我可以从中计算每个子列表元素对之间的距离(以单词为单位)。
import pandas as pd
import numpy as np
import re
def find_distance_between_words(text, word_list):
'''This function does not work as intended yet.'''
keyword_list = []
# iterates through all sublists in ListB:
for i in word_list:
# iterates through all strings within list in dataframe column:
for strings in text:
# determines the two words to search (iterates through word_list)
word1, word2 = i[0], i[1]
# use regex to find both words:
p = re.compile('.*?'.join((word1, word2)))
iterator = p.finditer(strings)
# for each match, append the string:
for match in iterator:
keyword_list.append(match.group())
return keyword_list
data['try'] = data['list_of_strings_to_search'].apply(find_distance_between_words, word_list = ListB)
预期输出:
0 [abc def, ghi jkl, abc random string to be searched def]
1 [ghi jkl, mno random string to be searched pqr]
2 [abc random string to be searched def, mno pqr]
当前输出:
0 [abc def, abc random string to be searched def]
1 []
2 [abc random string to be searched def]
但是,从对字符串和输出的手动检查来看,大多数正则表达式组合并不是从下面的语句中 return 编辑的,我需要每个字符串中包含所有组合:
for match in iterator:
keyword_list.append(match.group())
我打算 return 每个字符串中存在的所有子列表组合(因此遍历子列表候选值列表),以评估元素之间的最小距离。
非常感谢任何帮助!!
让我们在列表理解中遍历列 list_of_strings_to_search
中的每个列表,然后对于列表中的每个字符串使用 re.findall
和正则表达式模式来查找具有最小长度的子字符串指定关键字:
import re
pat = '|'.join(fr'{x}.*?{y}' for x, y in ListB)
data['result'] = [np.hstack([re.findall(pat, s) for s in l]) for l in data['list_of_strings_to_search']]
结果:
0 [abc def, ghi jkl, abc random string to be searched def]
1 [ghi jkl, mno random string to be searched pqr]
2 [abc random string to be searched def, mno pqr]
Name: result, dtype: object
编辑:已针对重现性进行更新
我目前在 Pandas DataFrame 中工作,在 [Column A] 列的每一行中都有一个字符串列表。我正在尝试提取关键字列表(列表 B)的任何子列表组合之间的最小距离
ListB = [['abc','def'],['ghi','jkl'],['mno','pqr']]
而 Dataframe 列中的每一行都包含一个字符串列表。
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array([['1', '2', ['random string to be searched abc def ghi jkl','random string to be searched abc','abc random string to be searched def']],
['4', '5', ['random string to be searched ghi jkl','random string to be searched',' mno random string to be searched pqr']],
['7', '8', ['abc random string to be searched def','random string to be searched mno pqr','random string to be searched']]]),
columns=['a', 'b', 'list_of_strings_to_search'])
在高层次上,我试图在 data['list_of_strings_to_search']
中包含的列表中搜索每个字符串,以查找 ListB
元素(必须满足两个条件)和 return 满足条件的 ListB
子列表,我可以从中计算每个子列表元素对之间的距离(以单词为单位)。
import pandas as pd
import numpy as np
import re
def find_distance_between_words(text, word_list):
'''This function does not work as intended yet.'''
keyword_list = []
# iterates through all sublists in ListB:
for i in word_list:
# iterates through all strings within list in dataframe column:
for strings in text:
# determines the two words to search (iterates through word_list)
word1, word2 = i[0], i[1]
# use regex to find both words:
p = re.compile('.*?'.join((word1, word2)))
iterator = p.finditer(strings)
# for each match, append the string:
for match in iterator:
keyword_list.append(match.group())
return keyword_list
data['try'] = data['list_of_strings_to_search'].apply(find_distance_between_words, word_list = ListB)
预期输出:
0 [abc def, ghi jkl, abc random string to be searched def]
1 [ghi jkl, mno random string to be searched pqr]
2 [abc random string to be searched def, mno pqr]
当前输出:
0 [abc def, abc random string to be searched def]
1 []
2 [abc random string to be searched def]
但是,从对字符串和输出的手动检查来看,大多数正则表达式组合并不是从下面的语句中 return 编辑的,我需要每个字符串中包含所有组合:
for match in iterator:
keyword_list.append(match.group())
我打算 return 每个字符串中存在的所有子列表组合(因此遍历子列表候选值列表),以评估元素之间的最小距离。
非常感谢任何帮助!!
让我们在列表理解中遍历列 list_of_strings_to_search
中的每个列表,然后对于列表中的每个字符串使用 re.findall
和正则表达式模式来查找具有最小长度的子字符串指定关键字:
import re
pat = '|'.join(fr'{x}.*?{y}' for x, y in ListB)
data['result'] = [np.hstack([re.findall(pat, s) for s in l]) for l in data['list_of_strings_to_search']]
结果:
0 [abc def, ghi jkl, abc random string to be searched def]
1 [ghi jkl, mno random string to be searched pqr]
2 [abc random string to be searched def, mno pqr]
Name: result, dtype: object