根据查找 table 替换数据框中的值

Replace values in a dataframe based on lookup table

我在替换数据框中的值时遇到了一些问题。我想根据单独的 table 替换值。下面是我正在尝试做的一个例子。

我有一个 table,其中每一行都是一个客户,每一列都是他们购买的动物。让我们称这个数据框为 table.

> table
#       P1     P2     P3
# 1    cat lizard parrot
# 2 lizard parrot    cat
# 3 parrot    cat lizard

我还有一个 table,我将引用名为 lookUp

> lookUp
#      pet   class
# 1    cat  mammal
# 2 lizard reptile
# 3 parrot    bird

我想做的是创建一个名为 new 的新 table,其中包含一个函数,将 table 中的所有值替换为 [=15= 中的 class 列].我自己尝试使用 lapply 函数,但收到以下警告。

new <- as.data.frame(lapply(table, function(x) {
  gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)

Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used

关于如何使这项工作有任何想法吗?

任何时候你有两个独立的 data.frames 并试图将信息从一个传递给另一个,答案是 merge.

每个人在 R 中都有自己最喜欢的合并方法。我的是 data.table

此外,由于您想对许多列执行此操作,因此 meltdcast 会更快——而不是遍历列,将其应用于重塑的 table,然后再次整形。

library(data.table)

#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE) 
setDT(lookUp)

#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')     
       # merging
       ][lookup, new_value := i.class, on = c(value = 'pet') 
         #reform back to original shape
         ][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
#    rn      P1      P2      P3
# 1:  1  mammal reptile    bird
# 2:  2 reptile    bird  mammal
# 3:  3    bird  mammal reptile

如果您发现 dcast/melt 有点令人生畏,这里有一种只在列上循环的方法; dcast/melt 只是回避了这个问题的循环。

setDT(table) #don't need row names this time
setDT(lookUp)

sapply(names(table), #(or to whichever are the relevant columns)
       function(cc) table[lookUp, (cc) := #merge, replace
                            #need to pass a _named_ vector to 'on', so use setNames
                            i.class, on = setNames("pet", cc)])

另一个选项是tidyrdplyr

的组合
library(dplyr)
library(tidyr)
table %>%
   gather(key = "pet") %>%
   left_join(lookup, by = "pet") %>%
   spread(key = pet, value = class)

您在问题中发布了一个不错的方法。这是一个简单的方法:

new <- df  # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])

另一种更快的方法是:

new <- df
new[] <- look$class[match(unlist(df), look$pet)]

请注意,我在两种情况下都使用空括号 ([]) 以保持 new 的结构不变 (a data.frame)。

(我在回答中使用 df 而不是 tablelook 而不是 lookup

制作一个命名向量,遍历每一列并匹配,参见:

# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1    
#      cat    lizard    parrot 
# "mammal" "reptile"    "bird" 

# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL

# res
#        P1      P2      P3
# 1  mammal reptile    bird
# 2 reptile    bird  mammal
# 3    bird  mammal reptile

数据

df1 <- read.table(text = "
       P1     P2     P3
 1    cat lizard parrot
 2 lizard parrot    cat
 3 parrot    cat lizard", header = TRUE)

lookUp <- read.table(text = "
      pet   class
 1    cat  mammal
 2 lizard reptile
 3 parrot    bird", header = TRUE)

显示如何在 dplyr 中执行此操作的答案 没有回答问题,table 充满了 NA。这有效,我将不胜感激任何显示更好方法的评论:

# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>% 
    # put in long format, naming column filled with P1, P2, P3 "petCount"
    gather(key="petCount", value="pet", -customer) %>% 
    # add a new column based on the pet's class in data frame "lookup"
    left_join(lookup, by="pet") %>%
    # since you wanted to replace the values in "table" with their
    # "class", remove the pet column
    select(-pet) %>% 
    # put data back into wide format
    spread(key="petCount", value="class")

请注意,保留包含客户、宠物、宠物的种类 (?) 及其 class 的长 table 可能会很有用。这个例子简单地添加了一个中间保存到一个变量:

table$customer = seq(nrow(table))
petClasses <- table %>% 
    gather(key="petCount", value="pet", -customer) %>% 
    left_join(lookup, by="pet")

custPetClasses <- petClasses %>%
    select(-pet) %>% 
    spread(key="petCount", value="class")

我尝试了其他方法,但它们在我非常大的数据集上花费了很长时间。我改用了以下内容:

    # make table "new" using ifelse. See data below to avoid re-typing it
    new <- ifelse(table1 =="cat", "mammal",
                        ifelse(table1 == "lizard", "reptile",
                               ifelse(table1 =="parrot", "bird", NA)))

此方法需要您为代码编写更多文本,但 ifelse 的矢量化使其 运行 更快。您必须根据自己的数据来决定是花更多时间编写代码还是等待计算机 运行。如果你想确保它有效(你的 iflese 命令中没有任何拼写错误),你可以使用 apply(new, 2, function(x) mean(is.na(x)))

数据

    # create the data table
    table1 <- read.table(text = "
       P1     P2     P3
     1    cat lizard parrot
     2 lizard parrot    cat
     3 parrot    cat lizard", header = TRUE)

我是使用 factor 内置的。

table$P1 <- factor(table$P1, levels=lookUp$pet, labels=lookUp$class)
table$P2 <- factor(table$P2, levels=lookUp$pet, labels=lookUp$class)
table$P3 <- factor(table$P3, levels=lookUp$pet, labels=lookUp$class)

基准

出于强烈的好奇心,我只是 运行 我想与您分享一些方法的基准。我不太相信答案中关于性能的一些陈述,并试图在此澄清这一点。为了不被不同的rows/columns比率误导,我考虑三种情况:

  1. ncol == nrow

  2. ncol << nrow

  3. ncol >> nrow.

事先强制 as.matrix 可能会有好处,所以我将其作为 附加解决方案unlist_mat).

microbenchmark::microbenchmark(
  lapply=Dat1[col_set] <- lapply(Dat1[col_set], function(x) Look$class[match(x, Look$pet)]),
  unlist=Dat2[col_set] <- Look$class[match(unlist(Dat2[col_set]), Look$pet)],
  unlist_mat=Mat[, col_set] <- Look$class[match(as.vector(Mat[, col_set]), Look$pet)],  ## added
  ifelse=Dat3[col_set] <- ifelse(Dat3[col_set] == "cat", "mammal",
                                 ifelse(Dat3[col_set] == "lizard", "reptile",
                                        ifelse(Dat3[col_set] == "parrot", "bird", NA))),
  look_vec=Dat4[] <- lapply(Dat4, function(i) look[i]),
  times=3L
)

## 1e3 x 1e3
# Unit: milliseconds
#       expr       min        lq      mean    median        uq       max neval cld
#     lapply  40.42905  63.47053  78.03831  86.51201  96.84294 107.17387     3  a 
#     unlist 513.25197 540.55981 656.25420 567.86766 727.75531 887.64297     3   b
# unlist_mat  45.91743  56.51087  68.50595  67.10432  79.80021  92.49611     3  a 
#     ifelse 117.83513 153.23771 366.16708 188.64030 490.33306 792.02581     3  ab
#   look_vec  58.54449  88.40293 112.91165 118.26137 140.09522 161.92908     3  a 

## 1e4 x 1e4
# Unit: seconds
#       expr       min        lq      mean    median         uq        max neval cld
#     lapply  2.427077  3.558234  3.992481  4.689390   4.775183   4.860977     3  a 
#     unlist 73.125989 79.203107 94.027433 85.280225 104.478155 123.676084     3   b
# unlist_mat  4.940254  5.011684  5.576553  5.083114   5.894703   6.706291     3  a 
#     ifelse  9.714553 14.444899 36.176777 19.175244  49.407889  79.640535     3  a 
#   look_vec  8.460969  8.558600  8.784463  8.656230   8.946209   9.236188     3  a 

## 1e5 x 1e3
# Unit: seconds
#       expr       min        lq      mean    median        uq        max neval cld
#     lapply  2.314427  2.403001  3.270708  2.491575  3.748848   5.006120     3  a 
#     unlist 64.098825 66.850221 81.402676 69.601616 90.054601 110.507586     3   b
# unlist_mat  5.018869  5.060865  5.638499  5.102861  5.948314   6.793767     3  a 
#     ifelse  6.244744 16.488266 39.208119 26.731788 55.689807  84.647825     3  ab
#   look_vec  4.512672  6.434651  7.496267  8.356630  8.988064   9.619498     3  a 

## 1e3 x 1e5
# Unit: seconds
#       expr        min         lq       mean     median         uq        max neval cld
#     lapply  52.833019  55.373432  71.308981  57.913845  80.546963 103.180080     3 ab 
#     unlist 164.901805 168.710285 186.454796 172.518765 197.231292 221.943819     3   c
# unlist_mat   3.872551   4.422904   4.695393   4.973257   5.106814   5.240372     3 a  
#     ifelse  72.592437  76.473418 103.930063  80.354399 119.598876 158.843354     3  b 
#   look_vec  56.444824  58.904604  62.677267  61.364383  65.793488  70.222593     3 ab 

注意:Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 上执行 使用 R --vanilla.

all(sapply(list(Dat2, as.data.frame(Mat), Dat3, Dat4), identical, Dat1))  ## *
# [1] TRUE
## *manipulate the data first outside the benchmark, of course!

结论

如果列数 low/lower 而不是行数,则将 lapply 与查找矩阵一起使用似乎是一个不错的选择。如果我们有很多列,尤其是与行相比,我们可能会受益于首先将数据帧的各个列强制转换为矩阵,这应该只需要眨眼。


set.seed(42)
n <- 1e4; m <- 1e4
Dat <- data.frame(matrix(sample(c("cat", "lizard", "parrot"), n*m, replace=TRUE), n, m))
Look <- structure(list(pet = c("cat", "lizard", "parrot"), class = c("mammal", "reptile", "bird")),
                  class = "data.frame", row.names = c("1", "2", "3"))
look <- setNames(as.character(Look$class), Look$pet)
col_set <- names(Dat)
system.time(
  Mat <- as.matrix(Dat)
)
#  user  system elapsed 
# 0.844   0.318   1.161 
Dat1 <- Dat2 <- Dat3 <- Dat4 <- Dat