按多列聚合,对一列求和并保留其他列?根据聚合值创建新列?
Aggregate by multiple columns, sum one column and keep other columns? Create new column based on aggregated values?
我有一个包含销售额的数据框。我需要按 2 列 ProductID
和 Day
聚合 df,并对来自不同列 Amount
的每个聚合组的值求和,以便它现在显示总数。我希望保留其他可以分组的列(跨行的相同值),在本例中只是 Product
。最后一列 Store
不会保留,因为分组行中的值可能不同。但是,我需要添加一列 UniqueStores
,它计算每组具有相同 ProductID 和 Day 的唯一商店的数量。例如,ID=1 且 Day= Monday 的第一个组将有 1 个唯一商店“N”,因此值为 1。
我尝试在此处起草 table 文本,但我无法正确设置格式,所以这里作为汇总前的外观图片:
我已经尝试使用 group_by + summarize 和 df[sum,by] 进行聚合,但它们不会保留未作为索引给出的变量。是否有无需手动插入应保留的每一列的解决方法?
提前致谢,希望我说清楚了。
输入值:
df <- data.frame("ProductID" = c(1,1,1,1,2,2,2,2), "Day"=c("Monday","Monday", "Tuesday", "Tuesday","Wednesday", "Wednesday", "Friday", "Friday"), "Amount"=c(5,5,3,7,6,9,5,2), "Product"=c("Food","Food","Food","Food","Toys","Toys","Toys","Toys"), "Store"=c("N","N","W","N", "S","W", "S","S"))
我们可以在 dplyr
和 summarise
中对 'Amount' 和 n_distinct
的 sum
进行分组操作([ 的不同元素数=24=])
library(dplyr)
df %>%
group_by(ProductID, Day, Product) %>%
summarise(Amount = sum(Amount),
UniqueStores = n_distinct(Store), .groups = 'drop')
# A tibble: 4 x 5
# ProductID Day Product Amount UniqueStores
# <dbl> <chr> <chr> <dbl> <int>
#1 1 Monday Food 10 1
#2 1 Tuesday Food 10 2
#3 2 Friday Toys 7 1
#4 2 Wednesday Toys 15 2
如果有多个列,并且只想对一部分列进行子集化,同时保留其余列,一个选项是在数据集中 mutate
然后使用 distinct
来获取第一行
df %>%
group_by(ProductID, Day, Product) %>%
mutate(Amount = sum(Amount),
UniqueStores = n_distinct(Store), .keep = 'all') %>%
ungroup %>%
distinct(ProductID, Day, Product, .keep_all = TRUE)
在data.table
中:
library(data.table)
setDT(df)[, .(Amount = sum(Amount, na.rm = TRUE),
UniqueStores = uniqueN(Store, na.rm = TRUE)),
by = .(ProductID, Day, Product)
]
输出:
ProductID Day Product Amount UniqueStores
1: 1 Monday Food 10 1
2: 1 Tuesday Food 10 2
3: 2 Wednesday Toys 15 2
4: 2 Friday Toys 7 1
我有一个包含销售额的数据框。我需要按 2 列 ProductID
和 Day
聚合 df,并对来自不同列 Amount
的每个聚合组的值求和,以便它现在显示总数。我希望保留其他可以分组的列(跨行的相同值),在本例中只是 Product
。最后一列 Store
不会保留,因为分组行中的值可能不同。但是,我需要添加一列 UniqueStores
,它计算每组具有相同 ProductID 和 Day 的唯一商店的数量。例如,ID=1 且 Day= Monday 的第一个组将有 1 个唯一商店“N”,因此值为 1。
我尝试在此处起草 table 文本,但我无法正确设置格式,所以这里作为汇总前的外观图片:
我已经尝试使用 group_by + summarize 和 df[sum,by] 进行聚合,但它们不会保留未作为索引给出的变量。是否有无需手动插入应保留的每一列的解决方法?
提前致谢,希望我说清楚了。
输入值:
df <- data.frame("ProductID" = c(1,1,1,1,2,2,2,2), "Day"=c("Monday","Monday", "Tuesday", "Tuesday","Wednesday", "Wednesday", "Friday", "Friday"), "Amount"=c(5,5,3,7,6,9,5,2), "Product"=c("Food","Food","Food","Food","Toys","Toys","Toys","Toys"), "Store"=c("N","N","W","N", "S","W", "S","S"))
我们可以在 dplyr
和 summarise
中对 'Amount' 和 n_distinct
的 sum
进行分组操作([ 的不同元素数=24=])
library(dplyr)
df %>%
group_by(ProductID, Day, Product) %>%
summarise(Amount = sum(Amount),
UniqueStores = n_distinct(Store), .groups = 'drop')
# A tibble: 4 x 5
# ProductID Day Product Amount UniqueStores
# <dbl> <chr> <chr> <dbl> <int>
#1 1 Monday Food 10 1
#2 1 Tuesday Food 10 2
#3 2 Friday Toys 7 1
#4 2 Wednesday Toys 15 2
如果有多个列,并且只想对一部分列进行子集化,同时保留其余列,一个选项是在数据集中 mutate
然后使用 distinct
来获取第一行
df %>%
group_by(ProductID, Day, Product) %>%
mutate(Amount = sum(Amount),
UniqueStores = n_distinct(Store), .keep = 'all') %>%
ungroup %>%
distinct(ProductID, Day, Product, .keep_all = TRUE)
在data.table
中:
library(data.table)
setDT(df)[, .(Amount = sum(Amount, na.rm = TRUE),
UniqueStores = uniqueN(Store, na.rm = TRUE)),
by = .(ProductID, Day, Product)
]
输出:
ProductID Day Product Amount UniqueStores
1: 1 Monday Food 10 1
2: 1 Tuesday Food 10 2
3: 2 Wednesday Toys 15 2
4: 2 Friday Toys 7 1