分解列名并在多个词而不是一个词上使用 wordnet.synsets()
Breaking up column names and use wordnet.synsets() on multiple words instead of one
我正在尝试获取列名称中每个单词的同义词列表。但是,当我 运行 wordnet.synsets() 时,它只会对包含一个单词的列名起作用。我如何 运行 它在多个单词上并像下面我想要的输出一样输出它?还有一种方法可以只显示前 4 个结果以提高可读性吗?
代码
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd
df = ['Unnamed 0',
'business id',
'name',
'postal code',
]
syns = {w : [] for w in df}
for k, v in syns.items():
for synset in wordnet.synsets(k):
for lemma in synset.lemmas():
if lemma.name() not in syns:
v.append(lemma.name())
pd.DataFrame([syns], columns = syns.keys())
当前输出:
Unnamed 0 business id name postal code
[] [] [gens, figure, public_figure, epithet, call, i... []
期望的输出:
Unnamed 0 business id name postal code
Unnamed[definitions], business[definitions], [gens, figure, public_figure] postal[definitions],
0[definitions] id[definitions] code[definitions]
更简单易用
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd
df = ['Unnamed 0',
'business id',
'name',
'postal code',
]
df = pd.DataFrame(
{tuple([k, t]):pd.Series(np.unique([l.name()
for s in wordnet.synsets(t)
for l in s.lemmas() if "_" not in l.name()])).to_dict()
for k in df
for t in nltk.word_tokenize(k)
}).fillna("")
df.columns.set_names(["sentance","word"],inplace = True)
df.loc[:4] # just first 5 matches...
只需更改list/dict comprehension a meet pandas格式
{"colA":[1,2], "colB":[3,4]}
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd
df = ['Unnamed 0',
'business id',
'name',
'postal code',
]
mr = max([len(k.split(" ")) for k in df])
pd.DataFrame(
# column for each requesed space delimited request
# use f-string to format as requested....
{k:[f"{v}:{np.unique([l.name() for s in wordnet.synsets(v) for l in s.lemmas() ]).tolist()}"
# need to pad request with fewer tokend to meet pandas required format
for v in f"{k}{(mr-len(k.split(' ')))*' '}".split(" ")]
for k in df}).replace({":[]":""})
输出
Unnamed 0 business id name postal code
0 Unnamed:['nameless', 'unidentified', 'unknown'... business:['business', 'business_concern', 'bus... name:['advert', 'appoint', 'bring_up', 'call',... postal:['postal']
1 0:['0', 'cipher', 'cypher', 'nought', 'zero'] id:['Gem_State', 'I.D.', 'ID', 'Idaho', 'id'] code:['cipher', 'code', 'codification', 'compu...
我正在尝试获取列名称中每个单词的同义词列表。但是,当我 运行 wordnet.synsets() 时,它只会对包含一个单词的列名起作用。我如何 运行 它在多个单词上并像下面我想要的输出一样输出它?还有一种方法可以只显示前 4 个结果以提高可读性吗?
代码
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd
df = ['Unnamed 0',
'business id',
'name',
'postal code',
]
syns = {w : [] for w in df}
for k, v in syns.items():
for synset in wordnet.synsets(k):
for lemma in synset.lemmas():
if lemma.name() not in syns:
v.append(lemma.name())
pd.DataFrame([syns], columns = syns.keys())
当前输出:
Unnamed 0 business id name postal code
[] [] [gens, figure, public_figure, epithet, call, i... []
期望的输出:
Unnamed 0 business id name postal code
Unnamed[definitions], business[definitions], [gens, figure, public_figure] postal[definitions],
0[definitions] id[definitions] code[definitions]
更简单易用
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd
df = ['Unnamed 0',
'business id',
'name',
'postal code',
]
df = pd.DataFrame(
{tuple([k, t]):pd.Series(np.unique([l.name()
for s in wordnet.synsets(t)
for l in s.lemmas() if "_" not in l.name()])).to_dict()
for k in df
for t in nltk.word_tokenize(k)
}).fillna("")
df.columns.set_names(["sentance","word"],inplace = True)
df.loc[:4] # just first 5 matches...
只需更改list/dict comprehension a meet pandas格式
{"colA":[1,2], "colB":[3,4]}
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd
df = ['Unnamed 0',
'business id',
'name',
'postal code',
]
mr = max([len(k.split(" ")) for k in df])
pd.DataFrame(
# column for each requesed space delimited request
# use f-string to format as requested....
{k:[f"{v}:{np.unique([l.name() for s in wordnet.synsets(v) for l in s.lemmas() ]).tolist()}"
# need to pad request with fewer tokend to meet pandas required format
for v in f"{k}{(mr-len(k.split(' ')))*' '}".split(" ")]
for k in df}).replace({":[]":""})
输出
Unnamed 0 business id name postal code
0 Unnamed:['nameless', 'unidentified', 'unknown'... business:['business', 'business_concern', 'bus... name:['advert', 'appoint', 'bring_up', 'call',... postal:['postal']
1 0:['0', 'cipher', 'cypher', 'nought', 'zero'] id:['Gem_State', 'I.D.', 'ID', 'Idaho', 'id'] code:['cipher', 'code', 'codification', 'compu...