循环遍历交易文件以得出产品的平均价格

Question

我正在处理一个数据文件，其中包含来自各个连锁店的产品销售情况，例如超级市场。（取自 this dataset 以防有人熟悉）。该文件包含多个字段：

id - 唯一客户 ID
chain - 连锁店 id
部门 - 类别的聚合分组（例如水）
类别 - 产品类别（例如苏打水）
company - 销售商品的公司的 ID
品牌 - 商品所属品牌的 ID
日期 - 购买日期
productsize - 产品购买量（例如 16 盎司水）
productmeasure - 购买产品的单位（例如盎司）
purchasequantity - 购买的单位数量
purchaseamount - 购买的金额
productprice - 产品价格（由 purchaseamount/purchasequantity 得出）

我想计算每个产品在整个交易数据集中的平均价格。对于本练习，我假设我可以通过以下字段定义 独特的产品 ：类别、品牌、产品尺寸，以便任何独特的产品都对应于这 3 个字段的独特组合。

因此，首先我识别数据集中的唯一项目以获取所有产品的列表：

#transactions is the name of the data frame
items <- unique(transactions %>% select(category, brand, productsize))

我现在可以将其用作查询 table 从交易数据集中挑选独特的产品并得出每个产品的平均价格。

由于我是新手，我只设法让它与（不太优雅）for 循环:

一起工作

for (i in 1:nrow(items)) {
  temp1 <- filter(transactions, category==items[i,1])
  temp2 <- filter(temp1, brand==items[i,2])
  temp3 <- filter(temp2, productsize==items[i,3])
  items$meanvalue[i]<- mean(temp3$productprice)
}

这行得通，但当然速度很慢。 transaction 数据框有 480612 个条目，items 数据框有 56658 个条目。我没有处理大型数据集的经验，但我确信问题出在代码上，而不是大小。

pastebin 中的示例文件（300 行）。

编辑： 发现 summarise 与此配合得很好！

avgPrice <- transactions %>% group_by(category, brand, productsize) %>% summarise(avgPrice = mean(productprice))

Answer 1

由于 R 是矢量化的，这应该比使用 for 循环快得多！

# library(tidyverse) # if needed

# get item combinations
itemCombs <- transactions %>% 
  group_by(category, brand, productsize) %>% 
  slice(1) %>% 
  ungroup() %>% 
  mutate(item = 1:n()) %>% 
  select(item, everything())

# append item combinations to original dataset and calculate avg price per item 
avgPrice <- transactions %>% 
  left_join(itemCombs, by = c("category", "brand", "productsize")) %>% 
  select(item, productprice) %>% 
  arrange(item) %>% 
  group_by(item) %>% 
  mutate(nItems = n(),
         sumPrice = sum(productprice)) %>% 
  ungroup() %>% 
  mutate(avgPrice = sumPrice/nItems)

循环遍历交易文件以得出产品的平均价格

Loop through a transaction file to derive average prices for products

loops

r

vectorization