从句子中计算 pandas 数据框中每行的不同单词
From a sentence count distinct words per line in a pandas dataframe
我正在类比数据,其中每行示例中有一个句子
PhraseCleaned
0 get house business distribute sell outside house opportunities
1 business changing offices culture work business
2 search company best practices
3 1 let go back desk spaces one
这是所有的句子我需要计算每行的单词多少次相同的单词并得到这样的结果
id PhraseCleaned
0 get house business house opportunities
1 business changing offices culture work business
2 desk big work culture
这张图片是我真正需要的
我这样做了
tokenaize_data= PraseFinalD.apply(lambda row: nltk.word_tokenize(row['PhraseCleaned']), axis=1)
它会用逗号分隔单词
[get, house, business, house, opportunities ]
[business, changing, offices, culture, work, business]
[desk, big, work, culture]
现在我正在尝试对它们进行计数这只是将所有单词一起计数PhaseFinal 是一个列表..我清理了数据删除了一些东西
word2count = {}
for data in PhraseFinal:
words = nltk.word_tokenize(data)
for word in words:
if word not in word2count.keys():
word2count[word] = 1
else:
word2count[word] += 1
- 鉴于您的数据为
df
- 使用
collections.Counter
创建字数统计字典,并使用 .tolist()
将其拆分为列
- 加入
df
from collections import Counter
import pandas as pd
# create a word count dict and split it into columns
df1 = pd.DataFrame(df['PhraseCleaned'].apply(lambda x: Counter(x.split())).tolist())
print(df1)
get house business distribute sell outside opportunities changing offices culture work search company best practices 1 let go back desk spaces one
1.0 2.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN 2.0 NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
# join df and df1
df2 = df.join(df1)
print(df2)
PhraseCleaned get house business distribute sell outside opportunities changing offices culture work search company best practices 1 let go back desk spaces one
get house business distribute sell outside house opportunities 1.0 2.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
business changing offices culture work business NaN NaN 2.0 NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
search company best practices NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN
1 let go back desk spaces one NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
使用 scikit-learn
向量化器:
from operator import itemgetter
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'text': texts})
# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])
# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
print('Vocab:', words_sorted_by_index)
print()
print('Matrix/Vectors:\n', vectorizer.transform(df['text']).toarray())
[出局]:
Vocab: ('back', 'best', 'business', 'changing', 'company', 'culture', 'desk', 'distribute', 'get', 'go', 'house', 'let', 'offices', 'one', 'opportunities', 'outside', 'practices', 'search', 'sell', 'spaces', 'work')
Matrix/Vectors:
[[0 0 1 0 0 0 0 1 1 0 2 0 0 0 1 1 0 0 1 0 0]
[0 0 2 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1]
[0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0]
[1 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0]]
将其放回 DataFrame。
from operator import itemgetter
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
texts = """get house business distribute sell outside house opportunities
business changing offices culture work business
search company best practices
1 let go back desk spaces one""".split('\n')
df = pd.DataFrame({'text': texts})
# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])
# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
matrix = vectorizer.transform(df['text']).toarray()
# Putting it back to the DataFrame.
df_new = pd.concat([df, pd.DataFrame(matrix)], axis=1)
column_names = dict(zip(range(len(words_sorted_by_index)), words_sorted_by_index))
df_new.rename(column_names, axis=1)
并将其写入 'csv' 文件:
df_new.to_csv('data-analogize.csv', index=False)
我正在类比数据,其中每行示例中有一个句子
PhraseCleaned
0 get house business distribute sell outside house opportunities
1 business changing offices culture work business
2 search company best practices
3 1 let go back desk spaces one
这是所有的句子我需要计算每行的单词多少次相同的单词并得到这样的结果
id PhraseCleaned
0 get house business house opportunities
1 business changing offices culture work business
2 desk big work culture
这张图片是我真正需要的
我这样做了
tokenaize_data= PraseFinalD.apply(lambda row: nltk.word_tokenize(row['PhraseCleaned']), axis=1)
它会用逗号分隔单词
[get, house, business, house, opportunities ]
[business, changing, offices, culture, work, business]
[desk, big, work, culture]
现在我正在尝试对它们进行计数这只是将所有单词一起计数PhaseFinal 是一个列表..我清理了数据删除了一些东西
word2count = {}
for data in PhraseFinal:
words = nltk.word_tokenize(data)
for word in words:
if word not in word2count.keys():
word2count[word] = 1
else:
word2count[word] += 1
- 鉴于您的数据为
df
- 使用
collections.Counter
创建字数统计字典,并使用.tolist()
将其拆分为列
- 加入
df
from collections import Counter
import pandas as pd
# create a word count dict and split it into columns
df1 = pd.DataFrame(df['PhraseCleaned'].apply(lambda x: Counter(x.split())).tolist())
print(df1)
get house business distribute sell outside opportunities changing offices culture work search company best practices 1 let go back desk spaces one
1.0 2.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN 2.0 NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
# join df and df1
df2 = df.join(df1)
print(df2)
PhraseCleaned get house business distribute sell outside opportunities changing offices culture work search company best practices 1 let go back desk spaces one
get house business distribute sell outside house opportunities 1.0 2.0 1.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
business changing offices culture work business NaN NaN 2.0 NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
search company best practices NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN
1 let go back desk spaces one NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
使用 scikit-learn
向量化器:
from operator import itemgetter
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'text': texts})
# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])
# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
print('Vocab:', words_sorted_by_index)
print()
print('Matrix/Vectors:\n', vectorizer.transform(df['text']).toarray())
[出局]:
Vocab: ('back', 'best', 'business', 'changing', 'company', 'culture', 'desk', 'distribute', 'get', 'go', 'house', 'let', 'offices', 'one', 'opportunities', 'outside', 'practices', 'search', 'sell', 'spaces', 'work')
Matrix/Vectors:
[[0 0 1 0 0 0 0 1 1 0 2 0 0 0 1 1 0 0 1 0 0]
[0 0 2 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1]
[0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0]
[1 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0]]
将其放回 DataFrame。
from operator import itemgetter
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
texts = """get house business distribute sell outside house opportunities
business changing offices culture work business
search company best practices
1 let go back desk spaces one""".split('\n')
df = pd.DataFrame({'text': texts})
# Initialize the counter.
vectorizer = CountVectorizer()
# Get the unique vocabulary and get the counts.
vectorizer.fit_transform(df['text'])
# Using idiom from https://www.kaggle.com/alvations/basic-nlp-with-nltk/#To-vectorize-any-new-sentences,-we-use--CountVectorizer.transform()
# Print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(vectorizer.vocabulary_.items(), key=itemgetter(1)))
matrix = vectorizer.transform(df['text']).toarray()
# Putting it back to the DataFrame.
df_new = pd.concat([df, pd.DataFrame(matrix)], axis=1)
column_names = dict(zip(range(len(words_sorted_by_index)), words_sorted_by_index))
df_new.rename(column_names, axis=1)
并将其写入 'csv' 文件:
df_new.to_csv('data-analogize.csv', index=False)