我可以访问 apply() 中使用的函数的行索引吗
Can I access the row index for a function used in apply()
我们需要填写一个class化验数据table。我倾向于写太多 for 循环,我正在尝试弄清楚如何使用 apply()
来完成它。我正在扫描最后一列以查找非缺失值,然后在每一列中填写其上方的值,仅在对角线上。因此,如果有 3 列,这将填充最后一列的值。我会为每个 'higher taxonomic level' 或左边的下一列重复它:
# fills in for Family-level taxonomy
for(i in nrows(DataFrame)){
if(is.na(DataFrame[[4]][i])) next
else {
DataFrame[[3]][i] <- DataFrame[[3]][i-1]
DataFrame[[2]][i] <- DataFrame[[2]][i-2]
DataFrame[[1]][i] <- DataFrame[[1]][i-3]
}
}
# Repeat to fill in Order's higher taxonomy (Phylum and Class)
for(i in nrows(DataFrame)){ # fills in for Family
if(is.na(DataFrame[[3]][i])) next
else {
DataFrame[[2]][i] <- DataFrame[[2]][i-2]
DataFrame[[1]][i] <- DataFrame[[1]][i-3]
}
}
# And again for each column to the left.
数据可能如下所示:
Phylum Class Order Family
Annelida
Polychaeta
Eunicida
Oenoidae
Onuphidae
Oweniida
Oweniidae
然后将针对该订单中的每个独特家族、Class 中的每个独特订单以及 Phylum 中的每个独特 Class 重复此操作。本质上,我们需要从其上方的下一个非缺失值开始,将值填充到每个非缺失值的左侧。所以最终结果将是:
Phylum Class Order Family
Annelida
Annelida Polychaeta
Annelida Polychaeta Eunicida
Annelida Polychaeta Eunicida Oenoidae
Annelida Polychaeta Eunicida Onuphidae
Annelida Polychaeta Oweniida
Annelida Polychaeta Oweniida Oweniidae
我们不能只复制列,因为一旦我们到达新的门级别,复制 class 停止有一个缺失值,顺序可能有两个缺失值,等等...
我想挑战在于我需要 Dataframe[[ j ]][ i-n ] 在我将传递给应用的任何函数中的值。当 apply 将 'x' 传递给函数时,它传递的是具有属性(如 index/row 名称)的对象还是仅传递值?
或者这是一个浪费的思路,如果我真的需要速度,请使用 for 循环并使用 rcpp。这是每年完成的数据框,我们将对其进行操作约 8,000 行和 13 列。我认为性能不会成为问题……但我们还没有尝试过。不知道为什么。
这是一种方法:
x <- matrix(rnorm(100), 10,10)
x <- cbind(1:nrow(x), x)
output <- apply(x, 1, function(i) {
rowID <- as.numeric(i[1])
x_orig <- unlist(i[-1])
## ... do some more stuff
return(...something...)
})
这是我的方法,只要你的数据看起来像我猜的那样:
library(tidyr)
library(dplyr)
data[data == ""] <- NA
data %>% fill(-Family) %>%
filter(!is.na(Family))
输出:
Phylum Class Order Family
1 Annelida Polychaeta Eunicida Oenoidae
2 Annelida Polychaeta Eunicida Onuphidae
3 Annelida Polychaeta Oweniida Oweniidae
如果你想要空行,你可以试试这个,它允许任意嵌套和取消嵌套:
data %>% fill(-Family) %>%
filter(!is.na(Family)) %>%
do(plyr::rbind.fill(unlist(lapply(1:nrow(.), function(z) lapply(1:4, function(xx) .[z,][1:xx])), recursive = FALSE))) %>%
distinct()
Phylum Class Order Family
1 Annelida <NA> <NA> <NA>
2 Annelida Polychaeta <NA> <NA>
3 Annelida Polychaeta Eunicida <NA>
4 Annelida Polychaeta Eunicida Oenoidae
5 Annelida Polychaeta Eunicida Onuphidae
6 Annelida Polychaeta Oweniida <NA>
7 Annelida Polychaeta Oweniida Oweniidae
8 Annelida blah <NA> <NA>
9 Annelida blah blah <NA>
10 Annelida blah blah blah
数据输入:
structure(list(Phylum = c("Annelida", NA, NA, NA, NA, NA, NA,
NA, NA, NA), Class = c(NA, "Polychaeta", NA, NA, NA, NA, NA,
"blah", NA, NA), Order = c(NA, NA, "Eunicida", NA, NA, "Oweniida",
NA, NA, "blah", NA), Family = c(NA, NA, NA, "Oenoidae", "Onuphidae",
NA, "Oweniidae", NA, NA, "blah")), .Names = c("Phylum", "Class",
"Order", "Family"), row.names = c(NA, -10L), class = "data.frame")
作为其他解决方案的替代方案,您还可以使用 zoo
包中的 na.locf
函数,它将 NA
值替换为最后一次观察值(locf = 上次观察结转).
# replace empty spaces with NA values
df[df == ""] <- NA
# use na.locf to replace the NA values
library(zoo)
df <- na.locf(df)
这导致:
> df
Phylum Class Order Family
1 Annelida <NA> <NA> <NA>
2 Annelida Polychaeta <NA> <NA>
3 Annelida Polychaeta Eunicida <NA>
4 Annelida Polychaeta Eunicida Oenoidae
5 Annelida Polychaeta Eunicida Onuphidae
6 Annelida Polychaeta Oweniida Onuphidae
7 Annelida Polychaeta Oweniida Oweniidae
已用数据:
df <- structure(list(Phylum = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Annelida"), class = "factor"),
Class = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Polychaeta"), class = "factor"),
Order = structure(c(1L, 1L, 2L, 1L, 1L, 3L, 1L), .Label = c("", "Eunicida", "Oweniida"), class = "factor"),
Family = structure(c(1L, 1L, 1L, 2L, 3L, 1L, 4L), .Label = c("", "Oenoidae", "Onuphidae", "Oweniidae"), class = "factor")),
.Names = c("Phylum", "Class", "Order", "Family"), class = "data.frame", row.names = c(NA, -7L))
我们需要填写一个class化验数据table。我倾向于写太多 for 循环,我正在尝试弄清楚如何使用 apply()
来完成它。我正在扫描最后一列以查找非缺失值,然后在每一列中填写其上方的值,仅在对角线上。因此,如果有 3 列,这将填充最后一列的值。我会为每个 'higher taxonomic level' 或左边的下一列重复它:
# fills in for Family-level taxonomy
for(i in nrows(DataFrame)){
if(is.na(DataFrame[[4]][i])) next
else {
DataFrame[[3]][i] <- DataFrame[[3]][i-1]
DataFrame[[2]][i] <- DataFrame[[2]][i-2]
DataFrame[[1]][i] <- DataFrame[[1]][i-3]
}
}
# Repeat to fill in Order's higher taxonomy (Phylum and Class)
for(i in nrows(DataFrame)){ # fills in for Family
if(is.na(DataFrame[[3]][i])) next
else {
DataFrame[[2]][i] <- DataFrame[[2]][i-2]
DataFrame[[1]][i] <- DataFrame[[1]][i-3]
}
}
# And again for each column to the left.
数据可能如下所示:
Phylum Class Order Family
Annelida
Polychaeta
Eunicida
Oenoidae
Onuphidae
Oweniida
Oweniidae
然后将针对该订单中的每个独特家族、Class 中的每个独特订单以及 Phylum 中的每个独特 Class 重复此操作。本质上,我们需要从其上方的下一个非缺失值开始,将值填充到每个非缺失值的左侧。所以最终结果将是:
Phylum Class Order Family
Annelida
Annelida Polychaeta
Annelida Polychaeta Eunicida
Annelida Polychaeta Eunicida Oenoidae
Annelida Polychaeta Eunicida Onuphidae
Annelida Polychaeta Oweniida
Annelida Polychaeta Oweniida Oweniidae
我们不能只复制列,因为一旦我们到达新的门级别,复制 class 停止有一个缺失值,顺序可能有两个缺失值,等等...
我想挑战在于我需要 Dataframe[[ j ]][ i-n ] 在我将传递给应用的任何函数中的值。当 apply 将 'x' 传递给函数时,它传递的是具有属性(如 index/row 名称)的对象还是仅传递值?
或者这是一个浪费的思路,如果我真的需要速度,请使用 for 循环并使用 rcpp。这是每年完成的数据框,我们将对其进行操作约 8,000 行和 13 列。我认为性能不会成为问题……但我们还没有尝试过。不知道为什么。
这是一种方法:
x <- matrix(rnorm(100), 10,10)
x <- cbind(1:nrow(x), x)
output <- apply(x, 1, function(i) {
rowID <- as.numeric(i[1])
x_orig <- unlist(i[-1])
## ... do some more stuff
return(...something...)
})
这是我的方法,只要你的数据看起来像我猜的那样:
library(tidyr)
library(dplyr)
data[data == ""] <- NA
data %>% fill(-Family) %>%
filter(!is.na(Family))
输出:
Phylum Class Order Family
1 Annelida Polychaeta Eunicida Oenoidae
2 Annelida Polychaeta Eunicida Onuphidae
3 Annelida Polychaeta Oweniida Oweniidae
如果你想要空行,你可以试试这个,它允许任意嵌套和取消嵌套:
data %>% fill(-Family) %>%
filter(!is.na(Family)) %>%
do(plyr::rbind.fill(unlist(lapply(1:nrow(.), function(z) lapply(1:4, function(xx) .[z,][1:xx])), recursive = FALSE))) %>%
distinct()
Phylum Class Order Family
1 Annelida <NA> <NA> <NA>
2 Annelida Polychaeta <NA> <NA>
3 Annelida Polychaeta Eunicida <NA>
4 Annelida Polychaeta Eunicida Oenoidae
5 Annelida Polychaeta Eunicida Onuphidae
6 Annelida Polychaeta Oweniida <NA>
7 Annelida Polychaeta Oweniida Oweniidae
8 Annelida blah <NA> <NA>
9 Annelida blah blah <NA>
10 Annelida blah blah blah
数据输入:
structure(list(Phylum = c("Annelida", NA, NA, NA, NA, NA, NA,
NA, NA, NA), Class = c(NA, "Polychaeta", NA, NA, NA, NA, NA,
"blah", NA, NA), Order = c(NA, NA, "Eunicida", NA, NA, "Oweniida",
NA, NA, "blah", NA), Family = c(NA, NA, NA, "Oenoidae", "Onuphidae",
NA, "Oweniidae", NA, NA, "blah")), .Names = c("Phylum", "Class",
"Order", "Family"), row.names = c(NA, -10L), class = "data.frame")
作为其他解决方案的替代方案,您还可以使用 zoo
包中的 na.locf
函数,它将 NA
值替换为最后一次观察值(locf = 上次观察结转).
# replace empty spaces with NA values
df[df == ""] <- NA
# use na.locf to replace the NA values
library(zoo)
df <- na.locf(df)
这导致:
> df
Phylum Class Order Family
1 Annelida <NA> <NA> <NA>
2 Annelida Polychaeta <NA> <NA>
3 Annelida Polychaeta Eunicida <NA>
4 Annelida Polychaeta Eunicida Oenoidae
5 Annelida Polychaeta Eunicida Onuphidae
6 Annelida Polychaeta Oweniida Onuphidae
7 Annelida Polychaeta Oweniida Oweniidae
已用数据:
df <- structure(list(Phylum = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Annelida"), class = "factor"),
Class = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Polychaeta"), class = "factor"),
Order = structure(c(1L, 1L, 2L, 1L, 1L, 3L, 1L), .Label = c("", "Eunicida", "Oweniida"), class = "factor"),
Family = structure(c(1L, 1L, 1L, 2L, 3L, 1L, 4L), .Label = c("", "Oenoidae", "Onuphidae", "Oweniidae"), class = "factor")),
.Names = c("Phylum", "Class", "Order", "Family"), class = "data.frame", row.names = c(NA, -7L))