根据查找 table 替换数据框中的值
Replace values in a dataframe based on lookup table
我在替换数据框中的值时遇到了一些问题。我想根据单独的 table 替换值。下面是我正在尝试做的一个例子。
我有一个 table,其中每一行都是一个客户,每一列都是他们购买的动物。让我们称这个数据框为 table
.
> table
# P1 P2 P3
# 1 cat lizard parrot
# 2 lizard parrot cat
# 3 parrot cat lizard
我还有一个 table,我将引用名为 lookUp
。
> lookUp
# pet class
# 1 cat mammal
# 2 lizard reptile
# 3 parrot bird
我想做的是创建一个名为 new
的新 table,其中包含一个函数,将 table
中的所有值替换为 [=15= 中的 class
列].我自己尝试使用 lapply
函数,但收到以下警告。
new <- as.data.frame(lapply(table, function(x) {
gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)
Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
关于如何使这项工作有任何想法吗?
任何时候你有两个独立的 data.frame
s 并试图将信息从一个传递给另一个,答案是 merge.
每个人在 R 中都有自己最喜欢的合并方法。我的是 data.table
。
此外,由于您想对许多列执行此操作,因此 melt
和 dcast
会更快——而不是遍历列,将其应用于重塑的 table,然后再次整形。
library(data.table)
#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE)
setDT(lookUp)
#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')
# merging
][lookup, new_value := i.class, on = c(value = 'pet')
#reform back to original shape
][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
# rn P1 P2 P3
# 1: 1 mammal reptile bird
# 2: 2 reptile bird mammal
# 3: 3 bird mammal reptile
如果您发现 dcast
/melt
有点令人生畏,这里有一种只在列上循环的方法; dcast
/melt
只是回避了这个问题的循环。
setDT(table) #don't need row names this time
setDT(lookUp)
sapply(names(table), #(or to whichever are the relevant columns)
function(cc) table[lookUp, (cc) := #merge, replace
#need to pass a _named_ vector to 'on', so use setNames
i.class, on = setNames("pet", cc)])
另一个选项是tidyr
和dplyr
的组合
library(dplyr)
library(tidyr)
table %>%
gather(key = "pet") %>%
left_join(lookup, by = "pet") %>%
spread(key = pet, value = class)
您在问题中发布了一个不错的方法。这是一个简单的方法:
new <- df # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])
另一种更快的方法是:
new <- df
new[] <- look$class[match(unlist(df), look$pet)]
请注意,我在两种情况下都使用空括号 ([]
) 以保持 new
的结构不变 (a data.frame)。
(我在回答中使用 df
而不是 table
和 look
而不是 lookup
)
制作一个命名向量,遍历每一列并匹配,参见:
# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1
# cat lizard parrot
# "mammal" "reptile" "bird"
# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL
# res
# P1 P2 P3
# 1 mammal reptile bird
# 2 reptile bird mammal
# 3 bird mammal reptile
数据
df1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
lookUp <- read.table(text = "
pet class
1 cat mammal
2 lizard reptile
3 parrot bird", header = TRUE)
显示如何在 dplyr 中执行此操作的答案 没有回答问题,table 充满了 NA。这有效,我将不胜感激任何显示更好方法的评论:
# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>%
# put in long format, naming column filled with P1, P2, P3 "petCount"
gather(key="petCount", value="pet", -customer) %>%
# add a new column based on the pet's class in data frame "lookup"
left_join(lookup, by="pet") %>%
# since you wanted to replace the values in "table" with their
# "class", remove the pet column
select(-pet) %>%
# put data back into wide format
spread(key="petCount", value="class")
请注意,保留包含客户、宠物、宠物的种类 (?) 及其 class 的长 table 可能会很有用。这个例子简单地添加了一个中间保存到一个变量:
table$customer = seq(nrow(table))
petClasses <- table %>%
gather(key="petCount", value="pet", -customer) %>%
left_join(lookup, by="pet")
custPetClasses <- petClasses %>%
select(-pet) %>%
spread(key="petCount", value="class")
我尝试了其他方法,但它们在我非常大的数据集上花费了很长时间。我改用了以下内容:
# make table "new" using ifelse. See data below to avoid re-typing it
new <- ifelse(table1 =="cat", "mammal",
ifelse(table1 == "lizard", "reptile",
ifelse(table1 =="parrot", "bird", NA)))
此方法需要您为代码编写更多文本,但 ifelse
的矢量化使其 运行 更快。您必须根据自己的数据来决定是花更多时间编写代码还是等待计算机 运行。如果你想确保它有效(你的 iflese
命令中没有任何拼写错误),你可以使用 apply(new, 2, function(x) mean(is.na(x)))
。
数据
# create the data table
table1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
我是使用 factor
内置的。
table$P1 <- factor(table$P1, levels=lookUp$pet, labels=lookUp$class)
table$P2 <- factor(table$P2, levels=lookUp$pet, labels=lookUp$class)
table$P3 <- factor(table$P3, levels=lookUp$pet, labels=lookUp$class)
基准
出于强烈的好奇心,我只是 运行 我想与您分享一些方法的基准。我不太相信答案中关于性能的一些陈述,并试图在此澄清这一点。为了不被不同的rows/columns比率误导,我考虑三种情况:
ncol == nrow
ncol << nrow
ncol >> nrow.
事先强制 as.matrix
可能会有好处,所以我将其作为 附加解决方案(unlist_mat).
microbenchmark::microbenchmark(
lapply=Dat1[col_set] <- lapply(Dat1[col_set], function(x) Look$class[match(x, Look$pet)]),
unlist=Dat2[col_set] <- Look$class[match(unlist(Dat2[col_set]), Look$pet)],
unlist_mat=Mat[, col_set] <- Look$class[match(as.vector(Mat[, col_set]), Look$pet)], ## added
ifelse=Dat3[col_set] <- ifelse(Dat3[col_set] == "cat", "mammal",
ifelse(Dat3[col_set] == "lizard", "reptile",
ifelse(Dat3[col_set] == "parrot", "bird", NA))),
look_vec=Dat4[] <- lapply(Dat4, function(i) look[i]),
times=3L
)
## 1e3 x 1e3
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lapply 40.42905 63.47053 78.03831 86.51201 96.84294 107.17387 3 a
# unlist 513.25197 540.55981 656.25420 567.86766 727.75531 887.64297 3 b
# unlist_mat 45.91743 56.51087 68.50595 67.10432 79.80021 92.49611 3 a
# ifelse 117.83513 153.23771 366.16708 188.64030 490.33306 792.02581 3 ab
# look_vec 58.54449 88.40293 112.91165 118.26137 140.09522 161.92908 3 a
## 1e4 x 1e4
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.427077 3.558234 3.992481 4.689390 4.775183 4.860977 3 a
# unlist 73.125989 79.203107 94.027433 85.280225 104.478155 123.676084 3 b
# unlist_mat 4.940254 5.011684 5.576553 5.083114 5.894703 6.706291 3 a
# ifelse 9.714553 14.444899 36.176777 19.175244 49.407889 79.640535 3 a
# look_vec 8.460969 8.558600 8.784463 8.656230 8.946209 9.236188 3 a
## 1e5 x 1e3
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.314427 2.403001 3.270708 2.491575 3.748848 5.006120 3 a
# unlist 64.098825 66.850221 81.402676 69.601616 90.054601 110.507586 3 b
# unlist_mat 5.018869 5.060865 5.638499 5.102861 5.948314 6.793767 3 a
# ifelse 6.244744 16.488266 39.208119 26.731788 55.689807 84.647825 3 ab
# look_vec 4.512672 6.434651 7.496267 8.356630 8.988064 9.619498 3 a
## 1e3 x 1e5
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 52.833019 55.373432 71.308981 57.913845 80.546963 103.180080 3 ab
# unlist 164.901805 168.710285 186.454796 172.518765 197.231292 221.943819 3 c
# unlist_mat 3.872551 4.422904 4.695393 4.973257 5.106814 5.240372 3 a
# ifelse 72.592437 76.473418 103.930063 80.354399 119.598876 158.843354 3 b
# look_vec 56.444824 58.904604 62.677267 61.364383 65.793488 70.222593 3 ab
注意: 在 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 上执行 使用 R --vanilla
.
all(sapply(list(Dat2, as.data.frame(Mat), Dat3, Dat4), identical, Dat1)) ## *
# [1] TRUE
## *manipulate the data first outside the benchmark, of course!
结论
如果列数 low/lower 而不是行数,则将 lapply
与查找矩阵一起使用似乎是一个不错的选择。如果我们有很多列,尤其是与行相比,我们可能会受益于首先将数据帧的各个列强制转换为矩阵,这应该只需要眨眼。
set.seed(42)
n <- 1e4; m <- 1e4
Dat <- data.frame(matrix(sample(c("cat", "lizard", "parrot"), n*m, replace=TRUE), n, m))
Look <- structure(list(pet = c("cat", "lizard", "parrot"), class = c("mammal", "reptile", "bird")),
class = "data.frame", row.names = c("1", "2", "3"))
look <- setNames(as.character(Look$class), Look$pet)
col_set <- names(Dat)
system.time(
Mat <- as.matrix(Dat)
)
# user system elapsed
# 0.844 0.318 1.161
Dat1 <- Dat2 <- Dat3 <- Dat4 <- Dat
我在替换数据框中的值时遇到了一些问题。我想根据单独的 table 替换值。下面是我正在尝试做的一个例子。
我有一个 table,其中每一行都是一个客户,每一列都是他们购买的动物。让我们称这个数据框为 table
.
> table
# P1 P2 P3
# 1 cat lizard parrot
# 2 lizard parrot cat
# 3 parrot cat lizard
我还有一个 table,我将引用名为 lookUp
。
> lookUp
# pet class
# 1 cat mammal
# 2 lizard reptile
# 3 parrot bird
我想做的是创建一个名为 new
的新 table,其中包含一个函数,将 table
中的所有值替换为 [=15= 中的 class
列].我自己尝试使用 lapply
函数,但收到以下警告。
new <- as.data.frame(lapply(table, function(x) {
gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)
Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
关于如何使这项工作有任何想法吗?
任何时候你有两个独立的 data.frame
s 并试图将信息从一个传递给另一个,答案是 merge.
每个人在 R 中都有自己最喜欢的合并方法。我的是 data.table
。
此外,由于您想对许多列执行此操作,因此 melt
和 dcast
会更快——而不是遍历列,将其应用于重塑的 table,然后再次整形。
library(data.table)
#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE)
setDT(lookUp)
#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')
# merging
][lookup, new_value := i.class, on = c(value = 'pet')
#reform back to original shape
][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
# rn P1 P2 P3
# 1: 1 mammal reptile bird
# 2: 2 reptile bird mammal
# 3: 3 bird mammal reptile
如果您发现 dcast
/melt
有点令人生畏,这里有一种只在列上循环的方法; dcast
/melt
只是回避了这个问题的循环。
setDT(table) #don't need row names this time
setDT(lookUp)
sapply(names(table), #(or to whichever are the relevant columns)
function(cc) table[lookUp, (cc) := #merge, replace
#need to pass a _named_ vector to 'on', so use setNames
i.class, on = setNames("pet", cc)])
另一个选项是tidyr
和dplyr
library(dplyr)
library(tidyr)
table %>%
gather(key = "pet") %>%
left_join(lookup, by = "pet") %>%
spread(key = pet, value = class)
您在问题中发布了一个不错的方法。这是一个简单的方法:
new <- df # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])
另一种更快的方法是:
new <- df
new[] <- look$class[match(unlist(df), look$pet)]
请注意,我在两种情况下都使用空括号 ([]
) 以保持 new
的结构不变 (a data.frame)。
(我在回答中使用 df
而不是 table
和 look
而不是 lookup
)
制作一个命名向量,遍历每一列并匹配,参见:
# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1
# cat lizard parrot
# "mammal" "reptile" "bird"
# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL
# res
# P1 P2 P3
# 1 mammal reptile bird
# 2 reptile bird mammal
# 3 bird mammal reptile
数据
df1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
lookUp <- read.table(text = "
pet class
1 cat mammal
2 lizard reptile
3 parrot bird", header = TRUE)
显示如何在 dplyr 中执行此操作的答案
# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>%
# put in long format, naming column filled with P1, P2, P3 "petCount"
gather(key="petCount", value="pet", -customer) %>%
# add a new column based on the pet's class in data frame "lookup"
left_join(lookup, by="pet") %>%
# since you wanted to replace the values in "table" with their
# "class", remove the pet column
select(-pet) %>%
# put data back into wide format
spread(key="petCount", value="class")
请注意,保留包含客户、宠物、宠物的种类 (?) 及其 class 的长 table 可能会很有用。这个例子简单地添加了一个中间保存到一个变量:
table$customer = seq(nrow(table))
petClasses <- table %>%
gather(key="petCount", value="pet", -customer) %>%
left_join(lookup, by="pet")
custPetClasses <- petClasses %>%
select(-pet) %>%
spread(key="petCount", value="class")
我尝试了其他方法,但它们在我非常大的数据集上花费了很长时间。我改用了以下内容:
# make table "new" using ifelse. See data below to avoid re-typing it
new <- ifelse(table1 =="cat", "mammal",
ifelse(table1 == "lizard", "reptile",
ifelse(table1 =="parrot", "bird", NA)))
此方法需要您为代码编写更多文本,但 ifelse
的矢量化使其 运行 更快。您必须根据自己的数据来决定是花更多时间编写代码还是等待计算机 运行。如果你想确保它有效(你的 iflese
命令中没有任何拼写错误),你可以使用 apply(new, 2, function(x) mean(is.na(x)))
。
数据
# create the data table
table1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
我是使用 factor
内置的。
table$P1 <- factor(table$P1, levels=lookUp$pet, labels=lookUp$class)
table$P2 <- factor(table$P2, levels=lookUp$pet, labels=lookUp$class)
table$P3 <- factor(table$P3, levels=lookUp$pet, labels=lookUp$class)
基准
出于强烈的好奇心,我只是 运行 我想与您分享一些方法的基准。我不太相信答案中关于性能的一些陈述,并试图在此澄清这一点。为了不被不同的rows/columns比率误导,我考虑三种情况:
ncol == nrow
ncol << nrow
ncol >> nrow.
事先强制 as.matrix
可能会有好处,所以我将其作为 附加解决方案(unlist_mat).
microbenchmark::microbenchmark(
lapply=Dat1[col_set] <- lapply(Dat1[col_set], function(x) Look$class[match(x, Look$pet)]),
unlist=Dat2[col_set] <- Look$class[match(unlist(Dat2[col_set]), Look$pet)],
unlist_mat=Mat[, col_set] <- Look$class[match(as.vector(Mat[, col_set]), Look$pet)], ## added
ifelse=Dat3[col_set] <- ifelse(Dat3[col_set] == "cat", "mammal",
ifelse(Dat3[col_set] == "lizard", "reptile",
ifelse(Dat3[col_set] == "parrot", "bird", NA))),
look_vec=Dat4[] <- lapply(Dat4, function(i) look[i]),
times=3L
)
## 1e3 x 1e3
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lapply 40.42905 63.47053 78.03831 86.51201 96.84294 107.17387 3 a
# unlist 513.25197 540.55981 656.25420 567.86766 727.75531 887.64297 3 b
# unlist_mat 45.91743 56.51087 68.50595 67.10432 79.80021 92.49611 3 a
# ifelse 117.83513 153.23771 366.16708 188.64030 490.33306 792.02581 3 ab
# look_vec 58.54449 88.40293 112.91165 118.26137 140.09522 161.92908 3 a
## 1e4 x 1e4
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.427077 3.558234 3.992481 4.689390 4.775183 4.860977 3 a
# unlist 73.125989 79.203107 94.027433 85.280225 104.478155 123.676084 3 b
# unlist_mat 4.940254 5.011684 5.576553 5.083114 5.894703 6.706291 3 a
# ifelse 9.714553 14.444899 36.176777 19.175244 49.407889 79.640535 3 a
# look_vec 8.460969 8.558600 8.784463 8.656230 8.946209 9.236188 3 a
## 1e5 x 1e3
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.314427 2.403001 3.270708 2.491575 3.748848 5.006120 3 a
# unlist 64.098825 66.850221 81.402676 69.601616 90.054601 110.507586 3 b
# unlist_mat 5.018869 5.060865 5.638499 5.102861 5.948314 6.793767 3 a
# ifelse 6.244744 16.488266 39.208119 26.731788 55.689807 84.647825 3 ab
# look_vec 4.512672 6.434651 7.496267 8.356630 8.988064 9.619498 3 a
## 1e3 x 1e5
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 52.833019 55.373432 71.308981 57.913845 80.546963 103.180080 3 ab
# unlist 164.901805 168.710285 186.454796 172.518765 197.231292 221.943819 3 c
# unlist_mat 3.872551 4.422904 4.695393 4.973257 5.106814 5.240372 3 a
# ifelse 72.592437 76.473418 103.930063 80.354399 119.598876 158.843354 3 b
# look_vec 56.444824 58.904604 62.677267 61.364383 65.793488 70.222593 3 ab
注意: 在 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 上执行 使用 R --vanilla
.
all(sapply(list(Dat2, as.data.frame(Mat), Dat3, Dat4), identical, Dat1)) ## *
# [1] TRUE
## *manipulate the data first outside the benchmark, of course!
结论
如果列数 low/lower 而不是行数,则将 lapply
与查找矩阵一起使用似乎是一个不错的选择。如果我们有很多列,尤其是与行相比,我们可能会受益于首先将数据帧的各个列强制转换为矩阵,这应该只需要眨眼。
set.seed(42)
n <- 1e4; m <- 1e4
Dat <- data.frame(matrix(sample(c("cat", "lizard", "parrot"), n*m, replace=TRUE), n, m))
Look <- structure(list(pet = c("cat", "lizard", "parrot"), class = c("mammal", "reptile", "bird")),
class = "data.frame", row.names = c("1", "2", "3"))
look <- setNames(as.character(Look$class), Look$pet)
col_set <- names(Dat)
system.time(
Mat <- as.matrix(Dat)
)
# user system elapsed
# 0.844 0.318 1.161
Dat1 <- Dat2 <- Dat3 <- Dat4 <- Dat