从句子中计算 pandas 数据框中每行的不同单词

Question

我正在类比数据，其中每行示例中有一个句子

PhraseCleaned   
0   get house business distribute sell outside house opportunities  
1   business changing offices culture work business
2   search company best practices 
3   1 let go back desk spaces one

这是所有的句子我需要计算每行的单词多少次相同的单词并得到这样的结果

id    PhraseCleaned 
0   get house business house opportunities  
1   business changing offices culture work business
2   desk big work culture

这张图片是我真正需要的

我这样做了

tokenaize_data= PraseFinalD.apply(lambda row: nltk.word_tokenize(row['PhraseCleaned']), axis=1)

它会用逗号分隔单词

[get, house, business, house, opportunities ]
[business, changing, offices, culture, work, business]
[desk, big, work, culture]

现在我正在尝试对它们进行计数这只是将所有单词一起计数PhaseFinal 是一个列表..我清理了数据删除了一些东西

word2count = {} 
for data in PhraseFinal: 
words = nltk.word_tokenize(data) 
for word in words: 
    if word not in word2count.keys(): 
        word2count[word] = 1
    else: 
        word2count[word] += 1

Answer 1

鉴于您的数据为 df
使用 collections.Counter 创建字数统计字典，并使用 .tolist()
加入df

from collections import Counter
import pandas as pd

# create a word count dict and split it into columns
df1 = pd.DataFrame(df['PhraseCleaned'].apply(lambda x: Counter(x.split())).tolist())

print(df1)

 get  house  business  distribute  sell  outside  opportunities  changing  offices  culture  work  search  company  best  practices    1  let   go  back  desk  spaces  one
 1.0    2.0       1.0         1.0   1.0      1.0            1.0       NaN      NaN      NaN   NaN     NaN      NaN   NaN        NaN  NaN  NaN  NaN   NaN   NaN     NaN  NaN
 NaN    NaN       2.0         NaN   NaN      NaN            NaN       1.0      1.0      1.0   1.0     NaN      NaN   NaN        NaN  NaN  NaN  NaN   NaN   NaN     NaN  NaN
 NaN    NaN       NaN         NaN   NaN      NaN            NaN       NaN      NaN      NaN   NaN     1.0      1.0   1.0        1.0  NaN  NaN  NaN   NaN   NaN     NaN  NaN
 NaN    NaN       NaN         NaN   NaN      NaN            NaN       NaN      NaN      NaN   NaN     NaN      NaN   NaN        NaN  1.0  1.0  1.0   1.0   1.0     1.0  1.0

# join df and df1
df2 = df.join(df1)

print(df2)

                                                  PhraseCleaned  get  house  business  distribute  sell  outside  opportunities  changing  offices  culture  work  search  company  best  practices    1  let   go  back  desk  spaces  one
 get house business distribute sell outside house opportunities  1.0    2.0       1.0         1.0   1.0      1.0            1.0       NaN      NaN      NaN   NaN     NaN      NaN   NaN        NaN  NaN  NaN  NaN   NaN   NaN     NaN  NaN
                business changing offices culture work business  NaN    NaN       2.0         NaN   NaN      NaN            NaN       1.0      1.0      1.0   1.0     NaN      NaN   NaN        NaN  NaN  NaN  NaN   NaN   NaN     NaN  NaN
                                  search company best practices  NaN    NaN       NaN         NaN   NaN      NaN            NaN       NaN      NaN      NaN   NaN     1.0      1.0   1.0        1.0  NaN  NaN  NaN   NaN   NaN     NaN  NaN
                                  1 let go back desk spaces one  NaN    NaN       NaN         NaN   NaN      NaN            NaN       NaN      NaN      NaN   NaN     NaN      NaN   NaN        NaN  1.0  1.0  1.0   1.0   1.0     1.0  1.0

Answer 2

使用 scikit-learn 向量化器：

from operator import itemgetter

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.DataFrame({'text': texts})

# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])

# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
print('Vocab:', words_sorted_by_index)
print()
print('Matrix/Vectors:\n', vectorizer.transform(df['text']).toarray())

[出局]:

Vocab: ('back', 'best', 'business', 'changing', 'company', 'culture', 'desk', 'distribute', 'get', 'go', 'house', 'let', 'offices', 'one', 'opportunities', 'outside', 'practices', 'search', 'sell', 'spaces', 'work')

Matrix/Vectors:
 [[0 0 1 0 0 0 0 1 1 0 2 0 0 0 1 1 0 0 1 0 0]
 [0 0 2 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1]
 [0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0]
 [1 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0]]

将其放回 DataFrame。

from operator import itemgetter

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

texts = """get house business distribute sell outside house opportunities
business changing offices culture work business
search company best practices 
1 let go back desk spaces one""".split('\n')

df = pd.DataFrame({'text': texts})

# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])

# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
matrix = vectorizer.transform(df['text']).toarray()

# Putting it back to the DataFrame.
df_new = pd.concat([df, pd.DataFrame(matrix)], axis=1)
column_names = dict(zip(range(len(words_sorted_by_index)), words_sorted_by_index))
df_new.rename(column_names, axis=1)

并将其写入 'csv' 文件：

df_new.to_csv('data-analogize.csv', index=False)

从句子中计算 pandas 数据框中每行的不同单词

From a sentence count distinct words per line in a pandas dataframe

python

loops

nlp

nltk

pandas