跨多列缩放

Question

我有一个包含 160 列和 >30k 行的数据框。我想缩放每一列中的值，但诀窍是每一列都属于三组之一，缩放应该发生在三组中每一组的所有值上。

这是一个例子：

data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))

生成的数据框如下所示：

apple.fruit    dog.pet    pear.fruit    cat.pet
          1          1         10001         11
          2          2         10002         12
          3          3         10003         13

我希望找到一种聪明的方法来找到所有包含单词 "fruit" 的列，并在所有列中集中缩放所有水果值（对 "pet" 执行相同的操作）最后是这样的：

apple.fruit    dog.pet    pear.fruit    cat.pet
   -0.91305   -1.08112      0.91268     0.72075
   -0.91287   -0.90093      0.91287     0.90093
   -0.91268   -0.72075      0.91305     1.08112

换句话说：而不是 apple.fruit 以这种方式缩放：

scale(data$apple.fruit)

我希望以这种方式扩展它

scale(c(data$apple.fruit, data$pear.fruit))[1:3]

Answer 1

将您的数据转换为长格式并一次缩放一列。这里有一个使用data.table::melt的方法，方便你根据命名模式同时熔化多个列。

library(data.table)
setDT(data)
roots = unique(sub(".*\.", "", names(data)))
result = melt(data, measure.vars = patterns(roots))
setnames(result, old = paste0("value", 1:length(roots)), new = roots)
for (j in names(result)[-1]) set(result, j = j, value = scale(result[[j]]))
result
#    variable      fruit        pet
# 1:        1 -0.9130535 -1.0811250
# 2:        1 -0.9128709 -0.9009375
# 3:        1 -0.9126883 -0.7207500
# 4:        2  0.9126883  0.7207500
# 5:        2  0.9128709  0.9009375
# 6:        2  0.9130535  1.0811250

否则，我认为 for 循环非常简单：

data = as.data.frame(data) # in case you converted to data.table  above
roots = unique(sub(".*\.", "", names(data)))

for (suffix in roots) {
  cols = grep(paste0(suffix, "$"), names(data))
  data[cols] = scale(unlist(data[cols]))
}
#   apple.fruit    dog.pet pear.fruit   cat.pet
# 1  -0.9130535 -1.0811250  0.9126883 0.7207500
# 2  -0.9128709 -0.9009375  0.9128709 0.9009375
# 3  -0.9126883 -0.7207500  0.9130535 1.0811250

Answer 2

tidyverse 方法：将数据转换为 "long" 整洁格式，按 fruit/pet 等分组，然后按组缩放

library(tidyverse)

data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))
data.tidy <- data %>%
  gather(key="id",value = "value") %>%
  mutate(type = gsub(".*\.(.*$)","\1",id),
         name = gsub("(.*)\..*$","\1",id)) %>%
  group_by(type) %>%
  mutate(scaleit = scale(value))

data.tidy
#> # A tibble: 12 x 5
#> # Groups:   type [2]
#>    id          value type  name  scaleit
#>    <chr>       <int> <chr> <chr>   <dbl>
#>  1 apple.fruit     1 fruit apple  -0.913
#>  2 apple.fruit     2 fruit apple  -0.913
#>  3 apple.fruit     3 fruit apple  -0.913
#>  4 dog.pet         1 pet   dog    -1.08 
#>  5 dog.pet         2 pet   dog    -0.901
#>  6 dog.pet         3 pet   dog    -0.721
#>  7 pear.fruit  10001 fruit pear    0.913
#>  8 pear.fruit  10002 fruit pear    0.913
#>  9 pear.fruit  10003 fruit pear    0.913
#> 10 cat.pet        11 pet   cat     0.721
#> 11 cat.pet        12 pet   cat     0.901
#> 12 cat.pet        13 pet   cat     1.08

由 reprex package (v0.2.0.9000) 创建于 2018-08-23。

跨多列缩放

Scale across multiple columns

r

scale

dataframe