分解列名并在多个词而不是一个词上使用 wordnet.synsets()

Breaking up column names and use wordnet.synsets() on multiple words instead of one

我正在尝试获取列名称中每个单词的同义词列表。但是,当我 运行 wordnet.synsets() 时,它只会对包含一个单词的列名起作用。我如何 运行 它在多个单词上并像下面我想要的输出一样输出它?还有一种方法可以只显示前 4 个结果以提高可读性吗?

代码

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd

df =  ['Unnamed 0',
 'business id',
 'name',
 'postal code',
]

syns = {w : [] for w in df}
for k, v in syns.items():
    for synset in wordnet.synsets(k):
        for lemma in synset.lemmas():
            if lemma.name() not in syns:
                v.append(lemma.name())

pd.DataFrame([syns], columns = syns.keys())

当前输出:

Unnamed 0   business id   name                                                postal code
[]          []            [gens, figure, public_figure, epithet, call, i...   []

期望的输出:

Unnamed 0               business id               name                            postal code
Unnamed[definitions],   business[definitions],    [gens, figure, public_figure]   postal[definitions],
0[definitions]          id[definitions]                                           code[definitions]

更简单易用

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd

df =  ['Unnamed 0',
 'business id',
 'name',
 'postal code',
]
df = pd.DataFrame(
{tuple([k, t]):pd.Series(np.unique([l.name() 
                                     for s in wordnet.synsets(t) 
                                     for l in s.lemmas() if "_" not in l.name()])).to_dict()
 for k in df 
 for t in nltk.word_tokenize(k)
}).fillna("")
df.columns.set_names(["sentance","word"],inplace = True)
df.loc[:4] # just first 5 matches...



只需更改list/dict comprehension a meet pandas格式 {"colA":[1,2], "colB":[3,4]}

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd

df =  ['Unnamed 0',
 'business id',
 'name',
 'postal code',
]

mr = max([len(k.split(" ")) for k in df])
pd.DataFrame(
    # column for each requesed space delimited request
    # use f-string to format as requested....
    {k:[f"{v}:{np.unique([l.name() for s in wordnet.synsets(v) for l in s.lemmas() ]).tolist()}" 
            # need to pad request with fewer tokend to meet pandas required format
            for v in f"{k}{(mr-len(k.split(' ')))*' '}".split(" ")] 
     for k in df}).replace({":[]":""})

输出

    Unnamed 0   business id name    postal code
0   Unnamed:['nameless', 'unidentified', 'unknown'...   business:['business', 'business_concern', 'bus...   name:['advert', 'appoint', 'bring_up', 'call',...   postal:['postal']
1   0:['0', 'cipher', 'cypher', 'nought', 'zero']   id:['Gem_State', 'I.D.', 'ID', 'Idaho', 'id']       code:['cipher', 'code', 'codification', 'compu...