如何查找数据框中每个日期对应的唯一 ID 的数量
How to find number of unique ids corresponding to each date in a data drame
我有一个如下所示的数据框:
date time id datetime
1 2015-01-02 14:27:22.130 999000000007628 2015-01-02 14:27:22
2 2015-01-02 14:41:27.720 989001002807730 2015-01-02 14:41:27
3 2015-01-02 14:41:27.940 989001002807730 2015-01-02 14:41:27
4 2015-01-02 14:41:28.140 989001002807730 2015-01-02 14:41:28
5 2015-01-02 14:41:28.170 989001002807730 2015-01-02 14:41:28
6 2015-01-02 14:41:28.350 989001002807730 2015-01-02 14:41:28
我需要找到该数据框中每个 "date" 的唯一 "id" 的数量。
我试过这个:
sums<-data.frame(date=unique(data$date), numIDs=0)
for(i in unique(data$date)){
sums[sums$date==i,]$numIDs<-length(unique(data[data$date==i,]$id))
}
我收到以下错误:
Error in `$<-.data.frame`(`*tmp*`, "numIDs", value = 0L) :
replacement has 1 row, data has 0
In addition: Warning message:
In `==.default`(data$date, i) :
longer object length is not a multiple of shorter object length
有什么想法吗??谢谢!
希望这对您有所帮助!
data <- structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(115L, 115L, 115L, 115L,
115L, 115L, 115L, 115L, 115L, 115L), wday = c(5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L), yday = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), zone = c("PST", "PST", "PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), time = c("14:27:22.130",
"14:41:27.720", "14:41:27.940", "14:41:28.140", "14:41:28.170",
"14:41:28.350", "14:41:28.390", "14:41:28.520", "14:41:28.630",
"14:41:28.740"), id = c("999000000007628", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730"
), datetime = structure(list(sec = c(22.13, 27.72, 27.94, 28.14,
28.17, 28.35, 28.39, 28.52, 28.63, 28.74), min = c(27L, 41L,
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L), hour = c(14L, 14L, 14L,
14L, 14L, 14L, 14L, 14L, 14L, 14L), mday = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), year = c(115L, 115L, 115L, 115L, 115L, 115L, 115L,
115L, 115L, 115L), wday = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L), yday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), site = c("Chivato",
"Chivato", "Chivato", "Chivato", "Chivato", "Chivato", "Chivato",
"Chivato", "Chivato", "Chivato")), .Names = c("date", "time",
"id", "datetime", "site"), row.names = c(NA, 10L), class = "data.frame")
您可以使用 data.table
中的 uniqueN
函数:
library(data.table)
setDT(df)[, uniqueN(id), by = date]
或(根据@Richard Scriven 的评论):
aggregate(id ~ date, df, function(x) length(unique(x)))
或者我们可以使用 library(dplyr)
中的 n_distinct
library(dplyr)
df %>%
group_by(date) %>%
summarise(id=n_distinct(id))
此答案是对 post: group by and then count unique observations which was marked as duplicate as I was writing this draft. This is not in response to the question for the duplicate basis here: 的回应,其中询问有关查找唯一 ID 的问题。我不确定第二个 post 是否真的回答了 OP 的问题,即
"I want to create a table with the number of unique id
for each
combination of group1
and group2
."
这里的关键词是'combination'。解释是每个 id
都有一个特定的 group1
值和一个特定的 group2
值,因此感兴趣的数据集是特定的值集 c(id, group1, group2)
.
这是 OP 提供的 data.frame:
df1 <- data.frame(id=sample(letters, 10000, replace = T),
group1=sample(1:2, 10000, replace = T),
group2=sample(100:101, 10000, replace = T))
使用 data.table
受此启发 post -- :
>library(data.table)
>DT <- data.table(df1)
>DT[, .N, by = .(group1, group2)]
group1 group2 N
1: 1 100 2493
2: 1 101 2455
3: 2 100 2559
4: 2 101 2493
N 是具有特定 group1
值和特定 group2
值的 id
的计数。扩展以包括 id
和 returns 104 个独特的 id
、group1
、group2
组合的 table。
>DT[, .N, by = .(id, group1, group2)]
id group1 group2 N
1: t 1 100 107
2: g 1 101 85
3: l 1 101 98
4: a 1 100 83
5: j 1 101 98
---
100: p 1 101 96
101: r 2 101 91
102: y 1 101 104
103: g 1 100 83
104: r 2 100 77
我有一个如下所示的数据框:
date time id datetime
1 2015-01-02 14:27:22.130 999000000007628 2015-01-02 14:27:22
2 2015-01-02 14:41:27.720 989001002807730 2015-01-02 14:41:27
3 2015-01-02 14:41:27.940 989001002807730 2015-01-02 14:41:27
4 2015-01-02 14:41:28.140 989001002807730 2015-01-02 14:41:28
5 2015-01-02 14:41:28.170 989001002807730 2015-01-02 14:41:28
6 2015-01-02 14:41:28.350 989001002807730 2015-01-02 14:41:28
我需要找到该数据框中每个 "date" 的唯一 "id" 的数量。
我试过这个:
sums<-data.frame(date=unique(data$date), numIDs=0)
for(i in unique(data$date)){
sums[sums$date==i,]$numIDs<-length(unique(data[data$date==i,]$id))
}
我收到以下错误:
Error in `$<-.data.frame`(`*tmp*`, "numIDs", value = 0L) :
replacement has 1 row, data has 0
In addition: Warning message:
In `==.default`(data$date, i) :
longer object length is not a multiple of shorter object length
有什么想法吗??谢谢!
希望这对您有所帮助!
data <- structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(115L, 115L, 115L, 115L,
115L, 115L, 115L, 115L, 115L, 115L), wday = c(5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L), yday = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), zone = c("PST", "PST", "PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), time = c("14:27:22.130",
"14:41:27.720", "14:41:27.940", "14:41:28.140", "14:41:28.170",
"14:41:28.350", "14:41:28.390", "14:41:28.520", "14:41:28.630",
"14:41:28.740"), id = c("999000000007628", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730"
), datetime = structure(list(sec = c(22.13, 27.72, 27.94, 28.14,
28.17, 28.35, 28.39, 28.52, 28.63, 28.74), min = c(27L, 41L,
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L), hour = c(14L, 14L, 14L,
14L, 14L, 14L, 14L, 14L, 14L, 14L), mday = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), year = c(115L, 115L, 115L, 115L, 115L, 115L, 115L,
115L, 115L, 115L), wday = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L), yday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), site = c("Chivato",
"Chivato", "Chivato", "Chivato", "Chivato", "Chivato", "Chivato",
"Chivato", "Chivato", "Chivato")), .Names = c("date", "time",
"id", "datetime", "site"), row.names = c(NA, 10L), class = "data.frame")
您可以使用 data.table
中的 uniqueN
函数:
library(data.table)
setDT(df)[, uniqueN(id), by = date]
或(根据@Richard Scriven 的评论):
aggregate(id ~ date, df, function(x) length(unique(x)))
或者我们可以使用 library(dplyr)
n_distinct
library(dplyr)
df %>%
group_by(date) %>%
summarise(id=n_distinct(id))
此答案是对 post: group by and then count unique observations which was marked as duplicate as I was writing this draft. This is not in response to the question for the duplicate basis here:
"I want to create a table with the number of unique
id
for each combination ofgroup1
andgroup2
."
这里的关键词是'combination'。解释是每个 id
都有一个特定的 group1
值和一个特定的 group2
值,因此感兴趣的数据集是特定的值集 c(id, group1, group2)
.
这是 OP 提供的 data.frame:
df1 <- data.frame(id=sample(letters, 10000, replace = T),
group1=sample(1:2, 10000, replace = T),
group2=sample(100:101, 10000, replace = T))
使用 data.table
受此启发 post -- :
>library(data.table)
>DT <- data.table(df1)
>DT[, .N, by = .(group1, group2)]
group1 group2 N
1: 1 100 2493
2: 1 101 2455
3: 2 100 2559
4: 2 101 2493
N 是具有特定 group1
值和特定 group2
值的 id
的计数。扩展以包括 id
和 returns 104 个独特的 id
、group1
、group2
组合的 table。
>DT[, .N, by = .(id, group1, group2)]
id group1 group2 N
1: t 1 100 107
2: g 1 101 85
3: l 1 101 98
4: a 1 100 83
5: j 1 101 98
---
100: p 1 101 96
101: r 2 101 91
102: y 1 101 104
103: g 1 100 83
104: r 2 100 77