是否有从多个数据集中获取多列计数的函数?
Is there a function to get counts in multiple columns from multiple datasets?
我有 2 列邮政编码。一个代表我的订单,另一个代表这些订单报告的问题,两者在不同的数据集中。
我的订单数据集中有一个邮政编码列:
B0E1H0
B3M0G4
B3K6R6
B3L1J7
B0E1H0
B3K3M2
B3K2Z8
B0E1H0
B3K6R6
B0E1H0
我报告的问题数据集中有一个邮政编码列:
B3K6R6
B3K6R6
B0E1H0
B0E1H0
B3L1J7
我想以一个数据框作为结尾,其中包含唯一邮政编码的列表、数量、发行数量以及每个邮政编码的发行比例,如下所示:
Postal code, Volume, Issues, Issue %
BOE1H0, 4, 2, 50%
B3K2Z8, 1, 0, 0%
B3K3M2, 1, 0, 0%
B3K6R6, 2, 2, 100%
B3L1J7, 1, 1, 100%
B3M0G4, 1, 0, 0%
我可以通过执行以下操作获得第 1 2 行:
orders <- read.csv("G:\My Drive\R\R Data\Stuff\Text File\Orders.csv", header = TRUE)
pcvec <- as.vector(orders["Postal.Code"])
unipc <- unique(pcvec,incomparables = F)
unipcvec <- as.vector(unipc)
pccount <- count(orders, "Postal.Code")
nrow(unipc)
x <- data.frame(pccount)
x <- rename(x, c("freq" = "Volume"))
x
Postal.Code Volume
1 B0C1H0 1
2 B0E1B0 3
3 B0E1H0 7
4 B0E1L0 1
5 B0E1N0 1
6 B0E1P0 1
7 B0E1V0 1
8 B0E1W0 1
9 B0E2K0 1
我的卷数据集中有大约 5000 行,问题数据集中有大约 300 行,是否可以轻松做到这一点?
抱歉,如果我没有正确的术语,请告诉我是否可以澄清这一点。
dplyr
的一种方式假设两个数据帧被称为 df1
和 df2
并且列在两个数据集中都被称为 V1
。我们 count
两个数据框中每个邮政编码的频率并将它们加入 V1
列,用 0 替换不匹配的列并通过将 Issues
除以 [= 来计算问题百分比20=].
library(dplyr)
df1 %>%
count(V1) %>%
left_join(df2 %>% count(V1), by = "V1") %>%
rename_all(~c("Postal_Code", "Volume", "Issues")) %>%
tidyr::replace_na(list(Issues = 0)) %>%
mutate(Issue_perc = Issues/Volume * 100)
# A tibble: 6 x 4
# Postal_Code Volume Issues Issue_perc
# <chr> <int> <dbl> <dbl>
#1 B0E1H0 4 2 50
#2 B3K2Z8 1 0 0
#3 B3K3M2 1 0 0
#4 B3K6R6 2 2 100
#5 B3L1J7 1 1 100
#6 B3M0G4 1 0 0
使用dplyr
很容易通过链接执行这样的操作。否则,我们也可以仅使用基数 R
进行相同的操作
temp_df <- merge(stack(table(df1)), stack(table(df2)), by = "ind", all.x = TRUE)
temp_df$values.y[is.na(temp_df$values.y)] <- 0
temp_df$Issue_perc <- temp_df$values.y/temp_df$values.x * 100
数据
df1 <- structure(list(V1 = c("B0E1H0", "B3M0G4", "B3K6R6", "B3L1J7",
"B0E1H0", "B3K3M2", "B3K2Z8", "B0E1H0", "B3K6R6", "B0E1H0")), row.names
= c(NA, -10L), class = "data.frame")
df2 <- structure(list(V1 = c("B3K6R6", "B3K6R6", "B0E1H0", "B0E1H0",
"B3L1J7")), row.names = c(NA, -5L), class = "data.frame")
这是 data.table
的一个选项。将 'data.frame' 转换为 'data.table' (setDT(df1)
, setDT(df2)
),通过 'V1' 得到行数 (.N
),做一个连接 on
的'V1',然后将不常见的列除以得到百分比,同时将NA
赋值给0
library(data.table)
setnames(setDT(df1)[, .N, V1][setDT(df2)[, .N, V1],
Issues := i.N, on = .(V1)][, Issue_perc:= Issues/N * 100][is.na(Issues),
c('Issues', 'Issue_perc') := 0], 'N', 'Volume')[]
# V1 Volume Issues Issue_perc
#1: B0E1H0 4 2 50
#2: B3M0G4 1 0 0
#3: B3K6R6 2 2 100
#4: B3L1J7 1 1 100
#5: B3K3M2 1 0 0
#6: B3K2Z8 1 0 0
或 dcast
的另一个选项
dcast(rbindlist(list(df1, df2), idcol = 'grp')[, .N, .(grp, V1)],
V1 ~ c("Volume", "Issues")[grp], value.var = "N", fill = 0)[,
Issue_perc := Issues/Volume * 100][]
# V1 Issues Volume Issue_perc
#1: B0E1H0 2 4 50
#2: B3K2Z8 0 1 0
#3: B3K3M2 0 1 0
#4: B3K6R6 2 2 100
#5: B3L1J7 1 1 100
#6: B3M0G4 0 1 0
或者使用 base R
,我们在两个数据集的 'V1' 列中创建 union
个元素,然后转换为 factor
并指定 levels
作为 'lvls',获取 table
,执行 merge
和 transform
以创建 'Issue_perc' 列
lvls <- union(df1$V1, df2$V1)
transform(merge(as.data.frame(table(factor(df1$V1, levels = lvls))),
as.data.frame(table(factor(df2$V1, levels = lvls))), by = 'Var1'),
Issue_perc = Freq.y/Freq.x * 100)
# Var1 Freq.x Freq.y Issue_perc
#1 B0E1H0 4 2 50
#2 B3K2Z8 1 0 0
#3 B3K3M2 1 0 0
#4 B3K6R6 2 2 100
#5 B3L1J7 1 1 100
#6 B3M0G4 1 0 0
或者一个带有tidyverse
的选项,我们通过list
把数据集变成list
,map
,把'V1'转换成factor
与之前指定的 levels
,reduce
list
通过执行 inner_join
到单个 data.frame,然后使用 [=39 创建百分比列=]
library(tidyverse)
list(df1, df2) %>%
map(~ .x %>%
mutate(V1 = factor(V1, levels = lvls)) %>%
count(V1, .drop = FALSE)) %>%
reduce(inner_join, by = 'V1') %>%
mutate(Issue_perc = n.y/n.x * 100) %>%
rename_at(vars(matches('n\.')), ~ c("Volume", "Issues"))
# A tibble: 6 x 4
# V1 Volume Issues Issue_perc
# <fct> <int> <int> <dbl>
#1 B0E1H0 4 2 50
#2 B3M0G4 1 0 0
#3 B3K6R6 2 2 100
#4 B3L1J7 1 1 100
#5 B3K3M2 1 0 0
#6 B3K2Z8 1 0 0
或者稍微不同的选项是将数据集放在 list
中,然后将它们与分组列绑定,count
以获得频率,spread
到 'wide' 格式,然后创建新的 'perc' 列
list(df1, df2) %>%
bind_rows(.id = 'grp') %>%
count(grp, V1) %>%
mutate(grp = c("Volume", "Issues")[as.integer(grp)]) %>%
spread(grp, n, fill = 0) %>%
mutate(Issue_perc = Issues/Volume * 100)
# A tibble: 6 x 4
# V1 Issues Volume Issue_perc
# <chr> <dbl> <dbl> <dbl>
#1 B0E1H0 2 4 50
#2 B3K2Z8 0 1 0
#3 B3K3M2 0 1 0
#4 B3K6R6 2 2 100
#5 B3L1J7 1 1 100
#6 B3M0G4 0 1 0
数据
df1 <- structure(list(V1 = c("B0E1H0", "B3M0G4", "B3K6R6", "B3L1J7",
"B0E1H0", "B3K3M2", "B3K2Z8", "B0E1H0", "B3K6R6", "B0E1H0")), row.names
= c(NA, -10L), class = "data.frame")
df2 <- structure(list(V1 = c("B3K6R6", "B3K6R6", "B0E1H0", "B0E1H0",
"B3L1J7")), row.names = c(NA, -5L), class = "data.frame")
我有 2 列邮政编码。一个代表我的订单,另一个代表这些订单报告的问题,两者在不同的数据集中。
我的订单数据集中有一个邮政编码列:
B0E1H0
B3M0G4
B3K6R6
B3L1J7
B0E1H0
B3K3M2
B3K2Z8
B0E1H0
B3K6R6
B0E1H0
我报告的问题数据集中有一个邮政编码列:
B3K6R6
B3K6R6
B0E1H0
B0E1H0
B3L1J7
我想以一个数据框作为结尾,其中包含唯一邮政编码的列表、数量、发行数量以及每个邮政编码的发行比例,如下所示:
Postal code, Volume, Issues, Issue %
BOE1H0, 4, 2, 50%
B3K2Z8, 1, 0, 0%
B3K3M2, 1, 0, 0%
B3K6R6, 2, 2, 100%
B3L1J7, 1, 1, 100%
B3M0G4, 1, 0, 0%
我可以通过执行以下操作获得第 1 2 行:
orders <- read.csv("G:\My Drive\R\R Data\Stuff\Text File\Orders.csv", header = TRUE)
pcvec <- as.vector(orders["Postal.Code"])
unipc <- unique(pcvec,incomparables = F)
unipcvec <- as.vector(unipc)
pccount <- count(orders, "Postal.Code")
nrow(unipc)
x <- data.frame(pccount)
x <- rename(x, c("freq" = "Volume"))
x
Postal.Code Volume
1 B0C1H0 1
2 B0E1B0 3
3 B0E1H0 7
4 B0E1L0 1
5 B0E1N0 1
6 B0E1P0 1
7 B0E1V0 1
8 B0E1W0 1
9 B0E2K0 1
我的卷数据集中有大约 5000 行,问题数据集中有大约 300 行,是否可以轻松做到这一点?
抱歉,如果我没有正确的术语,请告诉我是否可以澄清这一点。
dplyr
的一种方式假设两个数据帧被称为 df1
和 df2
并且列在两个数据集中都被称为 V1
。我们 count
两个数据框中每个邮政编码的频率并将它们加入 V1
列,用 0 替换不匹配的列并通过将 Issues
除以 [= 来计算问题百分比20=].
library(dplyr)
df1 %>%
count(V1) %>%
left_join(df2 %>% count(V1), by = "V1") %>%
rename_all(~c("Postal_Code", "Volume", "Issues")) %>%
tidyr::replace_na(list(Issues = 0)) %>%
mutate(Issue_perc = Issues/Volume * 100)
# A tibble: 6 x 4
# Postal_Code Volume Issues Issue_perc
# <chr> <int> <dbl> <dbl>
#1 B0E1H0 4 2 50
#2 B3K2Z8 1 0 0
#3 B3K3M2 1 0 0
#4 B3K6R6 2 2 100
#5 B3L1J7 1 1 100
#6 B3M0G4 1 0 0
使用dplyr
很容易通过链接执行这样的操作。否则,我们也可以仅使用基数 R
temp_df <- merge(stack(table(df1)), stack(table(df2)), by = "ind", all.x = TRUE)
temp_df$values.y[is.na(temp_df$values.y)] <- 0
temp_df$Issue_perc <- temp_df$values.y/temp_df$values.x * 100
数据
df1 <- structure(list(V1 = c("B0E1H0", "B3M0G4", "B3K6R6", "B3L1J7",
"B0E1H0", "B3K3M2", "B3K2Z8", "B0E1H0", "B3K6R6", "B0E1H0")), row.names
= c(NA, -10L), class = "data.frame")
df2 <- structure(list(V1 = c("B3K6R6", "B3K6R6", "B0E1H0", "B0E1H0",
"B3L1J7")), row.names = c(NA, -5L), class = "data.frame")
这是 data.table
的一个选项。将 'data.frame' 转换为 'data.table' (setDT(df1)
, setDT(df2)
),通过 'V1' 得到行数 (.N
),做一个连接 on
的'V1',然后将不常见的列除以得到百分比,同时将NA
赋值给0
library(data.table)
setnames(setDT(df1)[, .N, V1][setDT(df2)[, .N, V1],
Issues := i.N, on = .(V1)][, Issue_perc:= Issues/N * 100][is.na(Issues),
c('Issues', 'Issue_perc') := 0], 'N', 'Volume')[]
# V1 Volume Issues Issue_perc
#1: B0E1H0 4 2 50
#2: B3M0G4 1 0 0
#3: B3K6R6 2 2 100
#4: B3L1J7 1 1 100
#5: B3K3M2 1 0 0
#6: B3K2Z8 1 0 0
或 dcast
dcast(rbindlist(list(df1, df2), idcol = 'grp')[, .N, .(grp, V1)],
V1 ~ c("Volume", "Issues")[grp], value.var = "N", fill = 0)[,
Issue_perc := Issues/Volume * 100][]
# V1 Issues Volume Issue_perc
#1: B0E1H0 2 4 50
#2: B3K2Z8 0 1 0
#3: B3K3M2 0 1 0
#4: B3K6R6 2 2 100
#5: B3L1J7 1 1 100
#6: B3M0G4 0 1 0
或者使用 base R
,我们在两个数据集的 'V1' 列中创建 union
个元素,然后转换为 factor
并指定 levels
作为 'lvls',获取 table
,执行 merge
和 transform
以创建 'Issue_perc' 列
lvls <- union(df1$V1, df2$V1)
transform(merge(as.data.frame(table(factor(df1$V1, levels = lvls))),
as.data.frame(table(factor(df2$V1, levels = lvls))), by = 'Var1'),
Issue_perc = Freq.y/Freq.x * 100)
# Var1 Freq.x Freq.y Issue_perc
#1 B0E1H0 4 2 50
#2 B3K2Z8 1 0 0
#3 B3K3M2 1 0 0
#4 B3K6R6 2 2 100
#5 B3L1J7 1 1 100
#6 B3M0G4 1 0 0
或者一个带有tidyverse
的选项,我们通过list
把数据集变成list
,map
,把'V1'转换成factor
与之前指定的 levels
,reduce
list
通过执行 inner_join
到单个 data.frame,然后使用 [=39 创建百分比列=]
library(tidyverse)
list(df1, df2) %>%
map(~ .x %>%
mutate(V1 = factor(V1, levels = lvls)) %>%
count(V1, .drop = FALSE)) %>%
reduce(inner_join, by = 'V1') %>%
mutate(Issue_perc = n.y/n.x * 100) %>%
rename_at(vars(matches('n\.')), ~ c("Volume", "Issues"))
# A tibble: 6 x 4
# V1 Volume Issues Issue_perc
# <fct> <int> <int> <dbl>
#1 B0E1H0 4 2 50
#2 B3M0G4 1 0 0
#3 B3K6R6 2 2 100
#4 B3L1J7 1 1 100
#5 B3K3M2 1 0 0
#6 B3K2Z8 1 0 0
或者稍微不同的选项是将数据集放在 list
中,然后将它们与分组列绑定,count
以获得频率,spread
到 'wide' 格式,然后创建新的 'perc' 列
list(df1, df2) %>%
bind_rows(.id = 'grp') %>%
count(grp, V1) %>%
mutate(grp = c("Volume", "Issues")[as.integer(grp)]) %>%
spread(grp, n, fill = 0) %>%
mutate(Issue_perc = Issues/Volume * 100)
# A tibble: 6 x 4
# V1 Issues Volume Issue_perc
# <chr> <dbl> <dbl> <dbl>
#1 B0E1H0 2 4 50
#2 B3K2Z8 0 1 0
#3 B3K3M2 0 1 0
#4 B3K6R6 2 2 100
#5 B3L1J7 1 1 100
#6 B3M0G4 0 1 0
数据
df1 <- structure(list(V1 = c("B0E1H0", "B3M0G4", "B3K6R6", "B3L1J7",
"B0E1H0", "B3K3M2", "B3K2Z8", "B0E1H0", "B3K6R6", "B0E1H0")), row.names
= c(NA, -10L), class = "data.frame")
df2 <- structure(list(V1 = c("B3K6R6", "B3K6R6", "B0E1H0", "B0E1H0",
"B3L1J7")), row.names = c(NA, -5L), class = "data.frame")