如何 select 数字字符中的最大数值？

Question

我有一个数据集，其中按 Gene 列分组。分组到每一行的一些值只是 .,，所以我删除了它们，每行和每列只留下几个数字字符。

要做到这一点，我正在编码：

#Group by Gene:
data <- setDT(df2)[, lapply(.SD, paste, collapse = ", "), by = Genes]

#Remove ., from anywhere in the dataframe
dat <- data.frame(lapply(data, function(x) {
  gsub("\.,|\.$|\,$|(, .$)", "", x)
}))

删除 ., 之前和按 Gene 分组后的数据如下所示：

Gene    col1                     col2                  col3           col4
ACE     0.3, 0.4, 0.5, 0.5       .                      ., ., .        1, 1, 1, 1, 1
NOS2    ., .                     .                      ., ., ., .     0, 0, 0, 0, 0
BRCA1   .                                               ., .           1, 1, 1, 1, 1
HER2    .                        0.1, ., .,  0.2, 0.1   .              1, 1, 1, 1, 1

删除 ., 后我的数据如下所示：

Gene    col1                 col2               col3     col4
ACE     0.3, 0.4, 0.5, 0.5                               1, 1, 1, 1, 1
NOS2                                                     0, 0, 0, 0, 0
BRCA1                                                    1, 1, 1, 1, 1
HER2                         0.1,      0.2, 0.1          1, 1, 1, 1, 1

我现在正在尝试 select 每行和每列的最小值或最大值。

期望示例输出：

Gene    col1                 col2            col3    col4
ACE     0.5                                           1
NOS2                                                  0
BRCA1                                                 1
HER2                          0.1                     1

#For col1 I need the max value per row (so for ACE 0.5 is selected)
#For col2 I need the min value per row

请注意，我的实际数据是 100 列和 20,000 行 - 不同的列需要每个基因的最大值或最小值 selected.

然而，使用我使用的代码，我只得到 col4 的预期输出，而我的其他列重复 selected 值两次（我得到 0.5, 0.5 和 0.1, 0.1 我不知道为什么）。

我用于 select min/max 值的代码是：

#Max value per feature and row
max2 = function(x) if(all(is.na(x))) NA else max(x,na.rm = T)
getmax = function(col) str_extract_all(col,"[0-9\.-]+") %>%
  lapply(.,function(x)max2(as.numeric(x)) ) %>%
  unlist() 

#Min value per feature and row
min2 = function(x) if(all(is.na(x))) NA else min(x,na.rm = T)
getmin = function(col) str_extract_all(col,"[0-9\.-]+") %>%
  lapply(.,function(x)min2(as.numeric(x)) ) %>%
  unlist() 

data <- dt %>%
  mutate_at(names(dt)[2],getmax)

data <- dt %>%
  mutate_at(names(dt)[3],getmin)

data <- dt %>%
  mutate_at(names(dt)[4],getmax)

为什么这些 selection 函数不适用于我的所有色谱柱？所有列都是字符 class。我还想知道我是否需要完全删除 ., 并且可以直接跳到 selecting 每行和每列的 max/min 值？

示例输入数据：

structure(list(Gene = c("ACE", "NOS2", "BRCA1", "HER2"), col1 = c("0.3, 0.4, 0.5, 0.5", 
"", "", ""), col2 = c("", "", "", "  0.1,      0.2 0.,1"), col3 = c(NA, 
NA, NA, NA), col4 = c("                         1, 1, 1, 1, 1", 
"                                     0, 0, 0, 0, 0", "                                     1, 1, 1, 1, 1", 
"     1, 1, 1, 1, 1")), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

Answer 1

您可以使用 type.convert 并将其参数 na.strings 设置为 "."。您可能还想使用 range 函数一次性获得最小值和最大值。

假设您的 data.table 看起来像这样

> dt
    Gene               col1                 col2       col3          col4
1:   ACE 0.3, 0.4, 0.5, 0.5                    .    ., ., . 1, 1, 1, 1, 1
2:  NOS2               ., .                    . ., ., ., . 0, 0, 0, 0, 0
3: BRCA1                  .                            ., . 1, 1, 1, 1, 1
4:  HER2                  . 0.1, ., .,  0.2, 0.1          . 1, 1, 1, 1, 1

考虑这样的函数

library(data.table)
library(stringr)

get_range <- function(x) {
  x <- type.convert(str_split(x, ",\s+", simplify = TRUE), na.strings = ".")
  x <- t(apply(x, 1L, function(i) {
    i <- i[!is.na(i)]
    if (length(i) < 1L) c(NA_real_, NA_real_) else range(i)
  }))
  dimnames(x)[[2L]] <- c("min", "max")
  x
}

那你就可以

dt[, c(Gene = .(Gene), lapply(.SD, get_range)), .SDcols = -"Gene"]

输出

    Gene col1.min col1.max col2.min col2.max col3.min col3.max col4.min col4.max
1:   ACE      0.3      0.5       NA       NA       NA       NA        1        1
2:  NOS2       NA       NA       NA       NA       NA       NA        0        0
3: BRCA1       NA       NA       NA       NA       NA       NA        1        1
4:  HER2       NA       NA      0.1      0.2       NA       NA        1        1

请注意，Gene 不需要执行此操作，因为函数 get_range 已经矢量化了。

如何 select 数字字符中的最大数值？

How to select max numeric value out of numeric characters?

r

max

min

dplyr

data.table