计算 r 中字符串中唯一元素的最佳方法

Question

我还是 R 的初学者，我有一个问题！

我有 222.000 个观察值的数据框，我对名称为 id 的特定列很感兴趣。问题是它可以在同一个字符串中进一步用'，'分隔，我想计算每个字符串中的唯一元素（我的意思是第一个数据帧的每个字符串中）。例如：

      id                       results

0000001,0000003                   2

0000002,0000002                   1

0010001,0001006,0010001           2

我已经使用函数 'str_split_fixed' 将同一字符串中的所有 ID 分开，并将结果放入一个新的数据框中（所以我知道我只有 1 个 ID 字符串或字符串中没有任何内容）。问题是它可能多达 68 个“”，因此新的数据框非常庞大，有 68 列和 220.000 个观察值，并且需要很长时间（可能是 15 秒）。在使用应用功能后知道所有唯一的。

有人知道更有效的方法或有想法吗？

最后，我使用了下面的代码：

sapply(id, function(x) 
           length(    # count items
             unique(   # that are unique
                scan(   # when arguments are presented to scan as text 
                      text=x, what="", sep =",",  # when separated by ","
                      quiet=TRUE)))  )

但是有消息错误：

Error in textConnection(text, encoding = "UTF-8") : 
  argument 'text' incorrect 
6 textConnection(text, encoding = "UTF-8") 
5 scan(text = x, what = "", sep = ",", quiet = TRUE) 
4 unique(scan(text = x, what = "", sep = ",", quiet = TRUE)) 
3 FUN(X[[i]], ...) 
2 lapply(X = X, FUN = FUN, ...) 
1 sapply(id, function(x) length(unique(scan(text = x, 
    what = "", sep = ",", quiet = TRUE))))

我的 R 版本是：

 R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.0.0 plyr_1.8.3   

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.2.2   Rcpp_0.12.2   stringi_1.0-1
>

我试过这个：Encoding(id) <- "UTF-8" 但结果是：

Error in `Encoding<-`(`*tmp*`, value = "UTF-8")

dput(id) 的输出来自于：

   [9987,] "2320212,2320230"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  [9988,] "4530090,4530917"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  [9989,] "8532412"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9990,] "4560292"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9991,] "4540375"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9992,] "3311324"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9993,] "4540030"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9994,] "9010000"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9995,] "2811810"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9996,] "3311000"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9997,] "4540030"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9998,] "4540215"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  [9999,] "1541201"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [10000,] "2423810"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [ getOption("max.print") est atteint -- 90000 lignes omises ]

输出很大，所以我 post 只是结尾和第一行：

 [9002,] "9460000"

和 dput( head(data$id) ):

"9460000,9433000", "9460000,9436000", "9460000,9437000", 
"9510000", "9510010", "9510030", "9510090", "9910000", "9910020", 
"9910040", "9910090", "D", "FIELD_NOT_FOUND", "I"), class = "factor")

提前致谢，杰夫

Answer 1

sapply(id, function(x) 
           length(    # count items
             unique(   # that are unique
                scan(   # when arguments are presented to scan as text 
                      text=x, what="", sep =",",  # when separated by ","
                      quiet=TRUE)))  )
# --- result: first typed line is 'names' of the items, not the results.
    1 2,3,4   1,1 
    1     3     1

参数 text=x 应该允许 scan 接受长度为 1 的字符元素并将其分解为分隔符参数值的部分。这些将从 id 向量逐个元素地传递给匿名函数（如果它来自数据框，则逐行传递）。

计算 r 中字符串中唯一元素的最佳方法

Best way to count unique element in a string in r

r

unique