计算 r 中字符串中唯一元素的最佳方法
Best way to count unique element in a string in r
我还是 R 的初学者,我有一个问题!
我有 222.000 个观察值的数据框,我对名称为 id 的特定列很感兴趣。问题是它可以在同一个字符串中进一步用','分隔,我想计算每个字符串中的唯一元素(我的意思是第一个数据帧的每个字符串中)。
例如:
id results
0000001,0000003 2
0000002,0000002 1
0010001,0001006,0010001 2
我已经使用函数 'str_split_fixed' 将同一字符串中的所有 ID 分开,并将结果放入一个新的数据框中(所以我知道我只有 1 个 ID 字符串或字符串中没有任何内容)。问题是它可能多达 68 个“”,因此新的数据框非常庞大,有 68 列和 220.000 个观察值,并且需要很长时间(可能是 15 秒)。在使用应用功能后知道所有唯一的。
有人知道更有效的方法或有想法吗?
最后,我使用了下面的代码:
sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
但是有消息错误:
Error in textConnection(text, encoding = "UTF-8") :
argument 'text' incorrect
6 textConnection(text, encoding = "UTF-8")
5 scan(text = x, what = "", sep = ",", quiet = TRUE)
4 unique(scan(text = x, what = "", sep = ",", quiet = TRUE))
3 FUN(X[[i]], ...)
2 lapply(X = X, FUN = FUN, ...)
1 sapply(id, function(x) length(unique(scan(text = x,
what = "", sep = ",", quiet = TRUE))))
我的 R 版本是:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0 plyr_1.8.3
loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.2.2 Rcpp_0.12.2 stringi_1.0-1
>
我试过这个:Encoding(id) <- "UTF-8"
但结果是:
Error in `Encoding<-`(`*tmp*`, value = "UTF-8")
dput(id) 的输出来自于:
[9987,] "2320212,2320230"
[9988,] "4530090,4530917"
[9989,] "8532412"
[9990,] "4560292"
[9991,] "4540375"
[9992,] "3311324"
[9993,] "4540030"
[9994,] "9010000"
[9995,] "2811810"
[9996,] "3311000"
[9997,] "4540030"
[9998,] "4540215"
[9999,] "1541201"
[10000,] "2423810"
[ getOption("max.print") est atteint -- 90000 lignes omises ]
输出很大,所以我 post 只是结尾和第一行:
[9002,] "9460000"
和 dput( head(data$id) )
:
"9460000,9433000", "9460000,9436000", "9460000,9437000",
"9510000", "9510010", "9510030", "9510090", "9910000", "9910020",
"9910040", "9910090", "D", "FIELD_NOT_FOUND", "I"), class = "factor")
提前致谢,杰夫
sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
# --- result: first typed line is 'names' of the items, not the results.
1 2,3,4 1,1
1 3 1
参数 text=x
应该允许 scan
接受长度为 1 的字符元素并将其分解为分隔符参数值的部分。这些将从 id 向量逐个元素地传递给匿名函数(如果它来自数据框,则逐行传递)。
我还是 R 的初学者,我有一个问题!
我有 222.000 个观察值的数据框,我对名称为 id 的特定列很感兴趣。问题是它可以在同一个字符串中进一步用','分隔,我想计算每个字符串中的唯一元素(我的意思是第一个数据帧的每个字符串中)。 例如:
id results
0000001,0000003 2
0000002,0000002 1
0010001,0001006,0010001 2
我已经使用函数 'str_split_fixed' 将同一字符串中的所有 ID 分开,并将结果放入一个新的数据框中(所以我知道我只有 1 个 ID 字符串或字符串中没有任何内容)。问题是它可能多达 68 个“”,因此新的数据框非常庞大,有 68 列和 220.000 个观察值,并且需要很长时间(可能是 15 秒)。在使用应用功能后知道所有唯一的。
有人知道更有效的方法或有想法吗?
最后,我使用了下面的代码:
sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
但是有消息错误:
Error in textConnection(text, encoding = "UTF-8") :
argument 'text' incorrect
6 textConnection(text, encoding = "UTF-8")
5 scan(text = x, what = "", sep = ",", quiet = TRUE)
4 unique(scan(text = x, what = "", sep = ",", quiet = TRUE))
3 FUN(X[[i]], ...)
2 lapply(X = X, FUN = FUN, ...)
1 sapply(id, function(x) length(unique(scan(text = x,
what = "", sep = ",", quiet = TRUE))))
我的 R 版本是:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0 plyr_1.8.3
loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.2.2 Rcpp_0.12.2 stringi_1.0-1
>
我试过这个:Encoding(id) <- "UTF-8"
但结果是:
Error in `Encoding<-`(`*tmp*`, value = "UTF-8")
dput(id) 的输出来自于:
[9987,] "2320212,2320230"
[9988,] "4530090,4530917"
[9989,] "8532412"
[9990,] "4560292"
[9991,] "4540375"
[9992,] "3311324"
[9993,] "4540030"
[9994,] "9010000"
[9995,] "2811810"
[9996,] "3311000"
[9997,] "4540030"
[9998,] "4540215"
[9999,] "1541201"
[10000,] "2423810"
[ getOption("max.print") est atteint -- 90000 lignes omises ]
输出很大,所以我 post 只是结尾和第一行:
[9002,] "9460000"
和 dput( head(data$id) )
:
"9460000,9433000", "9460000,9436000", "9460000,9437000",
"9510000", "9510010", "9510030", "9510090", "9910000", "9910020",
"9910040", "9910090", "D", "FIELD_NOT_FOUND", "I"), class = "factor")
提前致谢,杰夫
sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
# --- result: first typed line is 'names' of the items, not the results.
1 2,3,4 1,1
1 3 1
参数 text=x
应该允许 scan
接受长度为 1 的字符元素并将其分解为分隔符参数值的部分。这些将从 id 向量逐个元素地传递给匿名函数(如果它来自数据框,则逐行传递)。