跨多列缩放
Scale across multiple columns
我有一个包含 160 列和 >30k 行的数据框。我想缩放每一列中的值,但诀窍是每一列都属于三组之一,缩放应该发生在三组中每一组的所有值上。
这是一个例子:
data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))
生成的数据框如下所示:
apple.fruit dog.pet pear.fruit cat.pet
1 1 10001 11
2 2 10002 12
3 3 10003 13
我希望找到一种聪明的方法来找到所有包含单词 "fruit" 的列,并在所有列中集中缩放所有水果值(对 "pet" 执行相同的操作)最后是这样的:
apple.fruit dog.pet pear.fruit cat.pet
-0.91305 -1.08112 0.91268 0.72075
-0.91287 -0.90093 0.91287 0.90093
-0.91268 -0.72075 0.91305 1.08112
换句话说:而不是 apple.fruit 以这种方式缩放:
scale(data$apple.fruit)
我希望以这种方式扩展它
scale(c(data$apple.fruit, data$pear.fruit))[1:3]
将您的数据转换为长格式并一次缩放一列。这里有一个使用data.table::melt
的方法,方便你根据命名模式同时熔化多个列。
library(data.table)
setDT(data)
roots = unique(sub(".*\.", "", names(data)))
result = melt(data, measure.vars = patterns(roots))
setnames(result, old = paste0("value", 1:length(roots)), new = roots)
for (j in names(result)[-1]) set(result, j = j, value = scale(result[[j]]))
result
# variable fruit pet
# 1: 1 -0.9130535 -1.0811250
# 2: 1 -0.9128709 -0.9009375
# 3: 1 -0.9126883 -0.7207500
# 4: 2 0.9126883 0.7207500
# 5: 2 0.9128709 0.9009375
# 6: 2 0.9130535 1.0811250
否则,我认为 for
循环非常简单:
data = as.data.frame(data) # in case you converted to data.table above
roots = unique(sub(".*\.", "", names(data)))
for (suffix in roots) {
cols = grep(paste0(suffix, "$"), names(data))
data[cols] = scale(unlist(data[cols]))
}
# apple.fruit dog.pet pear.fruit cat.pet
# 1 -0.9130535 -1.0811250 0.9126883 0.7207500
# 2 -0.9128709 -0.9009375 0.9128709 0.9009375
# 3 -0.9126883 -0.7207500 0.9130535 1.0811250
tidyverse 方法:将数据转换为 "long" 整洁格式,按 fruit/pet 等分组,然后按组缩放
library(tidyverse)
data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))
data.tidy <- data %>%
gather(key="id",value = "value") %>%
mutate(type = gsub(".*\.(.*$)","\1",id),
name = gsub("(.*)\..*$","\1",id)) %>%
group_by(type) %>%
mutate(scaleit = scale(value))
data.tidy
#> # A tibble: 12 x 5
#> # Groups: type [2]
#> id value type name scaleit
#> <chr> <int> <chr> <chr> <dbl>
#> 1 apple.fruit 1 fruit apple -0.913
#> 2 apple.fruit 2 fruit apple -0.913
#> 3 apple.fruit 3 fruit apple -0.913
#> 4 dog.pet 1 pet dog -1.08
#> 5 dog.pet 2 pet dog -0.901
#> 6 dog.pet 3 pet dog -0.721
#> 7 pear.fruit 10001 fruit pear 0.913
#> 8 pear.fruit 10002 fruit pear 0.913
#> 9 pear.fruit 10003 fruit pear 0.913
#> 10 cat.pet 11 pet cat 0.721
#> 11 cat.pet 12 pet cat 0.901
#> 12 cat.pet 13 pet cat 1.08
由 reprex package (v0.2.0.9000) 创建于 2018-08-23。
我有一个包含 160 列和 >30k 行的数据框。我想缩放每一列中的值,但诀窍是每一列都属于三组之一,缩放应该发生在三组中每一组的所有值上。
这是一个例子:
data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))
生成的数据框如下所示:
apple.fruit dog.pet pear.fruit cat.pet
1 1 10001 11
2 2 10002 12
3 3 10003 13
我希望找到一种聪明的方法来找到所有包含单词 "fruit" 的列,并在所有列中集中缩放所有水果值(对 "pet" 执行相同的操作)最后是这样的:
apple.fruit dog.pet pear.fruit cat.pet
-0.91305 -1.08112 0.91268 0.72075
-0.91287 -0.90093 0.91287 0.90093
-0.91268 -0.72075 0.91305 1.08112
换句话说:而不是 apple.fruit 以这种方式缩放:
scale(data$apple.fruit)
我希望以这种方式扩展它
scale(c(data$apple.fruit, data$pear.fruit))[1:3]
将您的数据转换为长格式并一次缩放一列。这里有一个使用data.table::melt
的方法,方便你根据命名模式同时熔化多个列。
library(data.table)
setDT(data)
roots = unique(sub(".*\.", "", names(data)))
result = melt(data, measure.vars = patterns(roots))
setnames(result, old = paste0("value", 1:length(roots)), new = roots)
for (j in names(result)[-1]) set(result, j = j, value = scale(result[[j]]))
result
# variable fruit pet
# 1: 1 -0.9130535 -1.0811250
# 2: 1 -0.9128709 -0.9009375
# 3: 1 -0.9126883 -0.7207500
# 4: 2 0.9126883 0.7207500
# 5: 2 0.9128709 0.9009375
# 6: 2 0.9130535 1.0811250
否则,我认为 for
循环非常简单:
data = as.data.frame(data) # in case you converted to data.table above
roots = unique(sub(".*\.", "", names(data)))
for (suffix in roots) {
cols = grep(paste0(suffix, "$"), names(data))
data[cols] = scale(unlist(data[cols]))
}
# apple.fruit dog.pet pear.fruit cat.pet
# 1 -0.9130535 -1.0811250 0.9126883 0.7207500
# 2 -0.9128709 -0.9009375 0.9128709 0.9009375
# 3 -0.9126883 -0.7207500 0.9130535 1.0811250
tidyverse 方法:将数据转换为 "long" 整洁格式,按 fruit/pet 等分组,然后按组缩放
library(tidyverse)
data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))
data.tidy <- data %>%
gather(key="id",value = "value") %>%
mutate(type = gsub(".*\.(.*$)","\1",id),
name = gsub("(.*)\..*$","\1",id)) %>%
group_by(type) %>%
mutate(scaleit = scale(value))
data.tidy
#> # A tibble: 12 x 5
#> # Groups: type [2]
#> id value type name scaleit
#> <chr> <int> <chr> <chr> <dbl>
#> 1 apple.fruit 1 fruit apple -0.913
#> 2 apple.fruit 2 fruit apple -0.913
#> 3 apple.fruit 3 fruit apple -0.913
#> 4 dog.pet 1 pet dog -1.08
#> 5 dog.pet 2 pet dog -0.901
#> 6 dog.pet 3 pet dog -0.721
#> 7 pear.fruit 10001 fruit pear 0.913
#> 8 pear.fruit 10002 fruit pear 0.913
#> 9 pear.fruit 10003 fruit pear 0.913
#> 10 cat.pet 11 pet cat 0.721
#> 11 cat.pet 12 pet cat 0.901
#> 12 cat.pet 13 pet cat 1.08
由 reprex package (v0.2.0.9000) 创建于 2018-08-23。