如何使用从 Python 中的另一个数据框计算出的最常用词来计算列?
How to calculate a column using the most common words calculated from another dataframe in Python?
数据帧示例:
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
cake = pd.DataFrame(data=cup)
cake
我计算了数据帧中最常见的词(删除了停用词)
from collections import Counter
Counter(" ".join(cake['Description']).split()).most_common()
我把它放到一个新的数据框中并重置索引
count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())
count.columns = ['Words', 'Values']
count.index= np.arange(1, len(count)+1)
count.head()
值在 'count' 数据框中。 Days_Sold 在 'cake' 数据框中。我现在想做的是,如果 'count' 数据框中的常用词出现,比如纸杯蛋糕,我需要多长时间才能使用 'cake' 数据框销售产品,然后就可以了通过 'count' 数据框中的每个常用词直到完成?纸杯蛋糕的答案应该是 (3+4+1) 8。
我的实际数据框超过 3000 行(不完全是关于蛋糕的)。说明比较长。我需要40多个常用词,根据我的需要调整。
这就是为什么我不能输入每个单词的原因。我相信这需要 'nested for loop'。但我坚持了下来。
for day in cake:
for top in count:
top= count.Words
day= cake.loc[cake['CleanDescr'] == count, ['Days_Sold']]
错误提示:'int' 对象不可迭代
谢谢!
更新:
非常感谢大家在这个大项目上帮助我。我将我的解决方案发布到 #3,根据 Mark Moretto.
的答案进行了调整
# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
df
# Merge counts to main DataFrame
df_freq = pd.merge(df, count, on="Description")
df_freq
# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
.loc[:, ["Description_x", "Values", "Days_Sold"]]
.rename(columns={"Description_x": "Description"})
)
df_freq
# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").mean().round(4)
df_metrics
df_metrics.head(5).sort_values(by='Values', ascending=False)
#print(df_metrics)
给定
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
cake = pd.DataFrame(data=cup)
count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())
count.columns = ['Words', 'Values']
count.index= np.arange(1, len(count)+1)
您的最终计数数据框如下所示:
Words Values
1 cupcake 3
2 cookie 3
3 organic 3
4 strawberry 2
5 blueberry 2
您可以:
将索引转换为列,参见How to convert index of a pandas dataframe into a column
然后,通过 Words
重新索引您的计数数据框
最后,你可以用.loc(<key>)['Values]
来获取号码。天
count_by_words = count.set_index('Words')
count_by_words.loc['cupcake']['Values']
count_by_words
DataFrame 将如下所示:
index Values
Words
cupcake 1 3
cookie 2 3
organic 3 3
strawberry 4 2
blueberry 5 2
grape 6 1
lemon 7 1
如果目标是根据描述中的文字估算最大销售天数,您可以尝试:
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
df = pd.DataFrame(data=cup)
word_counter = Counter() # Keeps track of the word count
word_days = defaultdict(list) # Keeps track of the max days sold
max_days = {}
# Iterate each row at a time.
for _, s in df.iterrows():
words = s['Description'].split()
word_counter += Counter(words)
for word in words:
# Keep tracks of different days_sold given a specific word.
word_days[word].append(s['Days_Sold'])
# If the max days for a word is lower than the row's days_sold
if max(word_days.get(word, 0)) < s['Days_Sold']:
# Set the max_days for the word as current days_sold
max_days[word] = s['Days_Sold']
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})
df2.loc['strawberry']['max_days_sold']
[输出]:
max_days_sold word_count
strawberry 3 2
cupcake 4 3
blueberry 4 2
cookie 2 3
grape 2 1
organic 2 3
lemon 1 1
另一种方式,虽然我并没有真正删除任何词的频率或任何东西。
# <...your starter code for dataframe creation...>
# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
# Get count of words
df_counts = (df.groupby("Description")
.size()
.reset_index()
.rename(columns={0: "word_count"})
)
# Merge counts to main DataFrame
df_freq = pd.merge(df, df_counts, on="Description")
# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
.loc[:, ["Description_x", "word_count", "Days_Sold"]]
.rename(columns={"Description_x": "Description"})
)
# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").max()
print(df_metrics)
输出:
word_count Days_Sold
Description
blueberry 2 4
cookie 3 2
cupcake 3 4
grape 1 2
lemon 1 1
organic 3 2
strawberry 2 3
数据帧示例:
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
cake = pd.DataFrame(data=cup)
cake
我计算了数据帧中最常见的词(删除了停用词)
from collections import Counter Counter(" ".join(cake['Description']).split()).most_common()
我把它放到一个新的数据框中并重置索引
count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common()) count.columns = ['Words', 'Values'] count.index= np.arange(1, len(count)+1) count.head()
值在 'count' 数据框中。 Days_Sold 在 'cake' 数据框中。我现在想做的是,如果 'count' 数据框中的常用词出现,比如纸杯蛋糕,我需要多长时间才能使用 'cake' 数据框销售产品,然后就可以了通过 'count' 数据框中的每个常用词直到完成?纸杯蛋糕的答案应该是 (3+4+1) 8。
我的实际数据框超过 3000 行(不完全是关于蛋糕的)。说明比较长。我需要40多个常用词,根据我的需要调整。
这就是为什么我不能输入每个单词的原因。我相信这需要 'nested for loop'。但我坚持了下来。
for day in cake:
for top in count:
top= count.Words
day= cake.loc[cake['CleanDescr'] == count, ['Days_Sold']]
错误提示:'int' 对象不可迭代
谢谢!
更新:
非常感谢大家在这个大项目上帮助我。我将我的解决方案发布到 #3,根据 Mark Moretto.
的答案进行了调整# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
df
# Merge counts to main DataFrame
df_freq = pd.merge(df, count, on="Description")
df_freq
# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
.loc[:, ["Description_x", "Values", "Days_Sold"]]
.rename(columns={"Description_x": "Description"})
)
df_freq
# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").mean().round(4)
df_metrics
df_metrics.head(5).sort_values(by='Values', ascending=False)
#print(df_metrics)
给定
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
cake = pd.DataFrame(data=cup)
count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())
count.columns = ['Words', 'Values']
count.index= np.arange(1, len(count)+1)
您的最终计数数据框如下所示:
Words Values
1 cupcake 3
2 cookie 3
3 organic 3
4 strawberry 2
5 blueberry 2
您可以:
将索引转换为列,参见How to convert index of a pandas dataframe into a column
然后,通过
重新索引您的计数数据框Words
最后,你可以用
.loc(<key>)['Values]
来获取号码。天
count_by_words = count.set_index('Words')
count_by_words.loc['cupcake']['Values']
count_by_words
DataFrame 将如下所示:
index Values
Words
cupcake 1 3
cookie 2 3
organic 3 3
strawberry 4 2
blueberry 5 2
grape 6 1
lemon 7 1
如果目标是根据描述中的文字估算最大销售天数,您可以尝试:
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
df = pd.DataFrame(data=cup)
word_counter = Counter() # Keeps track of the word count
word_days = defaultdict(list) # Keeps track of the max days sold
max_days = {}
# Iterate each row at a time.
for _, s in df.iterrows():
words = s['Description'].split()
word_counter += Counter(words)
for word in words:
# Keep tracks of different days_sold given a specific word.
word_days[word].append(s['Days_Sold'])
# If the max days for a word is lower than the row's days_sold
if max(word_days.get(word, 0)) < s['Days_Sold']:
# Set the max_days for the word as current days_sold
max_days[word] = s['Days_Sold']
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})
df2.loc['strawberry']['max_days_sold']
[输出]:
max_days_sold word_count
strawberry 3 2
cupcake 4 3
blueberry 4 2
cookie 2 3
grape 2 1
organic 2 3
lemon 1 1
另一种方式,虽然我并没有真正删除任何词的频率或任何东西。
# <...your starter code for dataframe creation...>
# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
# Get count of words
df_counts = (df.groupby("Description")
.size()
.reset_index()
.rename(columns={0: "word_count"})
)
# Merge counts to main DataFrame
df_freq = pd.merge(df, df_counts, on="Description")
# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
.loc[:, ["Description_x", "word_count", "Days_Sold"]]
.rename(columns={"Description_x": "Description"})
)
# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").max()
print(df_metrics)
输出:
word_count Days_Sold
Description
blueberry 2 4
cookie 3 2
cupcake 3 4
grape 1 2
lemon 1 1
organic 3 2
strawberry 2 3