R 和数据选择

Question

我有一个数据table dt，如下所示：

structure(list(IM = c(0.830088495575221, 0.681436210847976, 0.498810939357907, 
  0.47265400115141, 0.527908540685945, 0.580763582966226, 0.408069043807859, 
  0.467368671545006, 0.44662887412295, 0.0331974034502217, 0.0368210899219588, 
  0.0333698233772947, 0.0294312465832275, 0.578743426515361, 0.566950053134963, 
  0.808756701221038, 0.585507838980771, 0.61507839619537, 0.586388329979879, 
  0.794196637085474), CM = c(0.876991150442478, 0.996180290297937, 
  0.651605231866825, 0.824409902130109, 0.94418291862811, 0.961820851688693, 
  0.943861532396347, 1.10137922144883, 1.1524325077831, 0.128868067469359, 
  0.155932251596297, 0.159414951213752, 0.196968075413411, 1.19678937171326, 
  0.901168969181722, 3.42528220866977, 2.4377239516641, 2.0040870054458, 
  1.86099597585513, 1.51928615911568), RM = c(0.601769911504425, 
  0.495034377387319, 0.405469678953627, 0.368451352907311, 0.361802286482851, 
  0.320851688693098, 0.791548118347242, 0.816050925099649, 0.786622368849031, 
  0.545805622636092, 0.594370732740163, 0.594771872860171, 0.536043514857356, 
  0.617215610296153, 0.619287991498406, 0.602602774009141, 0.634069706132375, 
  0.596543561108693, 0.582203219315895, 0.695985131558462)), .Names = c("IM",  "CM", "RM"), class = c("data.table", "data.frame"), row.names
  = c(NA, 
  -20L), .internal.selfref = <pointer: 0x00000000003f0788>)

我写了一个函数如下：

DSanity.markWinsorize <- function(dt, colnames)
{
     PERnames <- unlist(lapply(colnames, function(x) paste0("PER",x)));
     print(dt[,colnames])
     if(length(colnames)>1)
     {dt[,PERnames] <- sapply(dt[,colnames], Num.calPtile);}
     else
     {dt[,PERnames] <- Num.calPtile(dt[,colnames]);}

     return(dt)
}

## Calculate Percentile score of a data vector
Num.calPtile <- function(x)
{
     return((ecdf(x))(x))
}

此函数的工作是创建新列，计算提供给函数 markWinsorize 的列的每个数据点的百分位数。

我在这里尝试运行函数 markWinsorize:

colnames <- c('CM','AM','BM')
DSanity.markWinsorize(dt,colnames)

我收到以下错误：

> sdc1 <- DSanity.markWinsorize(sdc,colnames)
[1] "CM" "AM" "BM"
 Show Traceback

Re运行与 Debug

Error in approxfun(vals, cumsum(tabulate(match(x, vals)))/n, method = "constant",  : 
  zero non-NA points In addition: Warning message:
In xy.coords(x, y) : NAs introduced by coercion

如果你们中的一些人能帮助我，那就太好了。谢谢。

Answer 1

你的方法很笨拙。我推荐一种全新的方法。

library(dplyr)

colnames <- c("CM", "AM", "BM")

dt %>%
  select_(.dots = colnames) %>%
  mutate_each(funs(ntile(., 100)))

我认为这可以满足您的需求（可能还添加了 %>% bind_cols(dt)）。

R 和数据选择

R and Data Selection

r

percentile