如何使用从 Python 中的另一个数据框计算出的最常用词来计算列？

Question

数据帧示例：

cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'], 
'Days_Sold': [3, 4, 1, 2, 2, 1]}

cake = pd.DataFrame(data=cup)

cake

我计算了数据帧中最常见的词（删除了停用词）

from collections import Counter

Counter(" ".join(cake['Description']).split()).most_common()

我把它放到一个新的数据框中并重置索引

count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())

count.columns = ['Words', 'Values']

count.index= np.arange(1, len(count)+1)

count.head()

值在 'count' 数据框中。 Days_Sold 在 'cake' 数据框中。我现在想做的是，如果 'count' 数据框中的常用词出现，比如纸杯蛋糕，我需要多长时间才能使用 'cake' 数据框销售产品，然后就可以了通过 'count' 数据框中的每个常用词直到完成？纸杯蛋糕的答案应该是 (3+4+1) 8。

我的实际数据框超过 3000 行（不完全是关于蛋糕的）。说明比较长。我需要40多个常用词，根据我的需要调整。

这就是为什么我不能输入每个单词的原因。我相信这需要 'nested for loop'。但我坚持了下来。

for day in cake:

    for top in count:

       top= count.Words

    day= cake.loc[cake['CleanDescr'] == count, ['Days_Sold']]

错误提示：'int' 对象不可迭代

谢谢！

更新：

非常感谢大家在这个大项目上帮助我。我将我的解决方案发布到 #3，根据 Mark Moretto.

的答案进行了调整

# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
df

# Merge counts to main DataFrame
df_freq = pd.merge(df, count, on="Description")
df_freq

# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
            .loc[:, ["Description_x", "Values", "Days_Sold"]]
            .rename(columns={"Description_x": "Description"})
            )
df_freq

# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").mean().round(4)
df_metrics

df_metrics.head(5).sort_values(by='Values', ascending=False)
#print(df_metrics)

Answer 1

给定

cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'], 
'Days_Sold': [3, 4, 1, 2, 2, 1]}

cake = pd.DataFrame(data=cup)

count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())

count.columns = ['Words', 'Values']

count.index= np.arange(1, len(count)+1)

您的最终计数数据框如下所示：


    Words   Values
1   cupcake 3
2   cookie  3
3   organic 3
4   strawberry  2
5   blueberry   2

您可以：

将索引转换为列，参见How to convert index of a pandas dataframe into a column
然后，通过 Words
重新索引您的计数数据框
最后，你可以用.loc(<key>)['Values]来获取号码。天

count_by_words = count.set_index('Words')
count_by_words.loc['cupcake']['Values']

count_by_words DataFrame 将如下所示：


      index Values
Words       
cupcake 1   3
cookie  2   3
organic 3   3
strawberry  4   2
blueberry   5   2
grape   6   1
lemon   7   1

如果目标是根据描述中的文字估算最大销售天数，您可以尝试：

import pandas as pd
import numpy as np

from collections import Counter, defaultdict

cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'], 
'Days_Sold': [3, 4, 1, 2, 2, 1]}

df = pd.DataFrame(data=cup)

word_counter = Counter() # Keeps track of the word count
word_days = defaultdict(list) # Keeps track of the max days sold
max_days = {}

# Iterate each row at a time.
for _, s in df.iterrows():
    words = s['Description'].split()
    word_counter += Counter(words)
    for word in words:
        # Keep tracks of different days_sold given a specific word.
        word_days[word].append(s['Days_Sold'])
        # If the max days for a word is lower than the row's days_sold
        if max(word_days.get(word, 0)) < s['Days_Sold']:
            # Set the max_days for the word as current days_sold
            max_days[word] = s['Days_Sold']
            
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})

            
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})

df2.loc['strawberry']['max_days_sold']

[输出]:

        max_days_sold   word_count
strawberry  3   2
cupcake 4   3
blueberry   4   2
cookie  2   3
grape   2   1
organic 2   3
lemon   1   1

Answer 2

另一种方式，虽然我并没有真正删除任何词的频率或任何东西。

# <...your starter code for dataframe creation...>

# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()

# Get count of words
df_counts = (df.groupby("Description")
            .size()
            .reset_index()
            .rename(columns={0: "word_count"})
            )

# Merge counts to main DataFrame
df_freq = pd.merge(df, df_counts, on="Description")

# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
            .loc[:, ["Description_x", "word_count", "Days_Sold"]]
            .rename(columns={"Description_x": "Description"})
            )

# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").max()

print(df_metrics)

输出：

             word_count  Days_Sold
Description
blueberry             2          4
cookie                3          2
cupcake               3          4
grape                 1          2
lemon                 1          1
organic               3          2
strawberry            2          3

如何使用从 Python 中的另一个数据框计算出的最常用词来计算列？

How to calculate a column using the most common words calculated from another dataframe in Python?

python

nlp

numpy

dataframe

pandas