根据原始 DF 对列表中的值进行分类 (Python 3, Pandas)
Classify values in a list based on original DF (Python 3, Pandas)
名为 KW 的虚构 df 如下所示:
Group Subgroup Word
orange zebra keys
green lion mouse
blue horse captain
我目前的代码 获取在 "Word" 列下找到的每个单词,并一次用字典中的其他字母替换某些字母。在此之后,创建所有这些拼写错误的列表。所以使用 KW df:
kw = df[['Word',"Group","Subgroup"]]
words = kw.to_dict()["Word"].values()
md = {"m":"w","o":"z"}
md = {k: v.split(',') for k, v in md.items()}
newwords = []
for word in words:
newwords.append(word)
for c in md:
occ = word.count(c)
pos = 0
for _ in range(occ):
pos = word.find(c, pos)
for r in md[c]:
tmp = word[:pos] + r + word[pos+1:]
newwords.append(tmp)
pos += 1
returns
Word
keys
mouse
wouse
mzuse
captain
我想做的基本上是根据被操纵的原始单词将这些拼写错误重新分类为适当的Group/Subgroup。所以理想情况下,不要吐出一个独立的拼写错误列表,它看起来像这样:
Group Subgroup Word
orange zebra keys
green lion mouse
green lion wouse
green lion mzuse
blue horse captain
我们需要以某种方式将新词与原始词相关联。
您可以通过在 newwords
中存储 2 元组(例如 ('mouse', 'wouse')
)来做到这一点。
然后你可以将 newwords
转换成 DataFrame,并使用 pd.merge
通过加入原始单词来将 newwords
与 kw
合并:
import pandas as pd
df = pd.read_table('data', sep='\s+')
kw = df[['Word',"Group","Subgroup"]]
words = df['Word']
md = {"m":"w","o":"z"}
md = {k: v.split(',') for k, v in md.items()}
newwords = []
for word in words:
# Save both the original word and the new word
newwords.append((word, word))
for c in md:
occ = word.count(c)
pos = 0
for _ in range(occ):
pos = word.find(c, pos)
for r in md[c]:
tmp = word[:pos] + r + word[pos+1:]
newwords.append((word, tmp))
newwords = pd.DataFrame(newwords, columns=['Word', 'New'])
# Merge on the original Word
result = pd.merge(newwords, kw, left_on='Word', right_on='Word', how='left')
result = result[['Group', 'Subgroup', 'New']]
result.columns = ['Group', 'Subgroup', 'Word']
print(result)
产量
Group Subgroup Word
0 orange zebra keys
1 green lion mouse
2 green lion wouse
3 green lion mzuse
4 blue horse captain
名为 KW 的虚构 df 如下所示:
Group Subgroup Word
orange zebra keys
green lion mouse
blue horse captain
我目前的代码 获取在 "Word" 列下找到的每个单词,并一次用字典中的其他字母替换某些字母。在此之后,创建所有这些拼写错误的列表。所以使用 KW df:
kw = df[['Word',"Group","Subgroup"]]
words = kw.to_dict()["Word"].values()
md = {"m":"w","o":"z"}
md = {k: v.split(',') for k, v in md.items()}
newwords = []
for word in words:
newwords.append(word)
for c in md:
occ = word.count(c)
pos = 0
for _ in range(occ):
pos = word.find(c, pos)
for r in md[c]:
tmp = word[:pos] + r + word[pos+1:]
newwords.append(tmp)
pos += 1
returns
Word
keys
mouse
wouse
mzuse
captain
我想做的基本上是根据被操纵的原始单词将这些拼写错误重新分类为适当的Group/Subgroup。所以理想情况下,不要吐出一个独立的拼写错误列表,它看起来像这样:
Group Subgroup Word
orange zebra keys
green lion mouse
green lion wouse
green lion mzuse
blue horse captain
我们需要以某种方式将新词与原始词相关联。
您可以通过在 newwords
中存储 2 元组(例如 ('mouse', 'wouse')
)来做到这一点。
然后你可以将 newwords
转换成 DataFrame,并使用 pd.merge
通过加入原始单词来将 newwords
与 kw
合并:
import pandas as pd
df = pd.read_table('data', sep='\s+')
kw = df[['Word',"Group","Subgroup"]]
words = df['Word']
md = {"m":"w","o":"z"}
md = {k: v.split(',') for k, v in md.items()}
newwords = []
for word in words:
# Save both the original word and the new word
newwords.append((word, word))
for c in md:
occ = word.count(c)
pos = 0
for _ in range(occ):
pos = word.find(c, pos)
for r in md[c]:
tmp = word[:pos] + r + word[pos+1:]
newwords.append((word, tmp))
newwords = pd.DataFrame(newwords, columns=['Word', 'New'])
# Merge on the original Word
result = pd.merge(newwords, kw, left_on='Word', right_on='Word', how='left')
result = result[['Group', 'Subgroup', 'New']]
result.columns = ['Group', 'Subgroup', 'Word']
print(result)
产量
Group Subgroup Word
0 orange zebra keys
1 green lion mouse
2 green lion wouse
3 green lion mzuse
4 blue horse captain