计算每个站点的标签并在 R 中创建摘要 table

Question

下面是类似于我的数据集的一部分：

require(dplyr)
alldata
site    date    percent_rank    Label
01A  2013-01-01    0.32         Normal
01B  2013-01-01    0.12         Low
01C  2013-01-01    0.76         High
02A  2013-01-01     0           N/A
02B  2013-01-01    0.16         Low
02C  2013-01-01    0.5          Normal
01A  2013-01-02    0.67         Normal
01B  2013-01-02    0.01         Low
01C  2013-01-02    0.92         High

我根据值（三个类别的 0 到 0.25 到 0.75 到 1）为每个 percent_rank 分配了一个标签。我现在想以这种格式生成摘要 table：

site  Low  Normal  High  Missing
01A   32   47      92    194
01B   232  23      17    93
01C   82   265     12    6

其中每个站点都会计算具有该站点标签的所有日期的低值、正常值和高值的出现次数（一年中的每一天都有一个），并且 N/A 值将计入 "Missing" 列。

我尝试了以下方法：

alldata <- %>% group_by(site) %>% mutate(length(Label == "Low"))

其中 returns 所有记录的总价值，而不是每个站点 "Low" 的计数，并且

alldata <- %>% group_by(site) %>% mutate(length(which(Label == "Low")))

其中returns一个值比记录总数高几千。我的想法是，我将重复此函数以创建四个新列，其中包含四个单独的变异行（每个类别一个），这将生成我的摘要 table。我还尝试了 aggregate() 的一些变体，尽管我不太清楚我的目标是函数组件。这看起来应该是一件非常简单的事情（并且 group_by 对我计算排名百分比和相关标签很有帮助）但我还没有找到解决方案。非常感谢任何提示！

Answer 1

我们可以使用 data.table 中的 dcast，它也有 fun.aggregate，而且速度非常快。

library(data.table)
dcast(setDT(alldata), site~Label, length)

或使用dplyr/tidyr

library(dplyr)
library(tidyr)
alldata %>%
    group_by(site, Label) %>%
    tally() %>%
    spread(Label, n)

一个base R选项是

 reshape(aggregate(date~site + Label, alldata, length), 
           idvar = "site", timevar="Label", direction="wide")

Answer 2

在 dplyr 中有三种方法可以做到这一点。第一个是最冗长的，另外两个使用便利函数来缩短代码：

library(reshape2)
library(dplyr)

alldata %>% group_by(site, Label) %>% summarise(n=n()) %>% dcast(site ~ Label)

alldata %>% group_by(site, Label) %>% tally %>% dcast(site ~ Label)

alldata %>% count(site, Label) %>% dcast(site ~ Label)

Answer 3

要生成摘要 table，您可以使用 table:

with(df, table(site, Label, useNA="ifany"))[, c(2,4,1,3)]

     Label
site  Low Normal High N/A
  01A   0      2    0   0
  01B   2      0    0   0
  01C   0      0    2   0
  02A   0      0    0   1
  02B   1      0    0   0
  02C   0      1    0   0

数据

df <- read.table(header=T, text="site    date    percent_rank    Label
01A  2013-01-01    0.32         Normal
01B  2013-01-01    0.12         Low
01C  2013-01-01    0.76         High
02A  2013-01-01     0           N/A
02B  2013-01-01    0.16         Low
02C  2013-01-01    0.5          Normal
01A  2013-01-02    0.67         Normal
01B  2013-01-02    0.01         Low
01C  2013-01-02    0.92         High")

计算每个站点的标签并在 R 中创建摘要 table

count labels per site and create summary table in R

r

dplyr

tidyr