stringr::str_sub 输出是意外的
stringr::str_sub output is unexpected
考虑以下 data.frame:
df <- structure(list(sufix = c("atizado", "atoria", "atório", "auta",
"áutico", "ável"), min_stem_len = c(4, 5, 3, 5, 4, 2), replacement = c("",
"", "", "", "", ""), exceptions = list(character(0), character(0),
character(0), character(0), character(0), c("afável", "razoável",
"potável", "vulnerável"))), .Names = c("sufix", "min_stem_len",
"replacement", "exceptions"), row.names = 21:26, class = c("tbl_df",
"tbl", "data.frame"))
我在这个 data.frame 的变量 sufix
中有一个字符串列表。
现在我有一个单词 word <- "amável"
,我想得到这个单词的后缀,其长度与 df$sufix
.
的每个单词的长度相同
我正在使用以下代码:
library(stringr)
word <- "amável"
str_sub(word, start = -stringr::str_length(df$sufix))
但这会输出:
> str_sub(word, start = -stringr::str_length(df$sufix))
[1] "amável" "mável" "mável" "vel" "mável" "vel"
> df$sufix
[1] "atizado" "atoria" "atório" "auta" "áutico" "ável"
我期望结果向量的最后一个元素是“ável”,因为:
> str_length("ável")
[1] 4
> str_sub(word, start = -4)
[1] "ável"
这是一个更简单的可重现示例:
set.seed(100)
a <- sample(1:10, 10000, replace = T)
res <- rep("ábc", 10000) %>% str_sub(start = -a)
sum(ifelse(a > 3, 3, a) != str_length(res))
[1] 2504
如果您注意到,所有结果都是错误的(第一个除外)。
他们应该是
[1] "amável" "amável" "amável" "ável" "amável" "ável"
这可以通过
轻松解决
library(stringi)
stri_sub(rep(word, 6), from = -stri_length(df$suffix))
我打赌你可以重复使用你的 stringr
代码。
### 编辑添加###
我现在明白你的意思了。肯定有一种奇怪的行为,很可能与特殊字符 á
有关。请参阅以下示例:
df <- data.frame(suffix = c("Lorem","ipsum","dolor","sit","amet","consectetur","adipiscing", "elit","Donec","arcu"))
df$len <- stri_length(df$suffix)
然后看看结果的第7个元素的奇怪行为:
stri_sub("amavel", from = -df$len)
## [1] "mavel" "mavel" "mavel" "vel" "avel" "amavel" "amavel" "avel"
## [9] "mavel" "avel"
# Compared to
stri_sub("amável", from = -df$len)
## [1] "mável" "mável" "mável" "vel" "ável" "amável" "mável" "ável"
## [9] "mável" "ável"
很奇怪,如果使用 rep
,结果在最后一种情况下得到纠正:
stri_sub(rep("amável", 10), from = -df$len)
## [1] "mável" "mável" "mável" "vel" "ável" "amável" "amável" "ável"
## [9] "mável" "ável"
# note how the 7th element is now correct.
因此,尽管有点老套,上面提供的解决方案应该可行。
我试着查看 stri_sub
的代码,它引用了 C_stri_sub
,但这对我来说是死胡同。也许更了解 C
and/or 字符串操作的人可以来帮忙?
### 第二次编辑 ###
在我看来,问题出在对 stri_sub
的调用中字符串 的重复。查看您在编辑中输入的替代代码:
set.seed(100)
a <- sample(1:10, 10000, replace = TRUE)
res <- stri_sub(rep("ábc", 10000), from = -a)
sum(ifelse(a > 3, 3, a) != stri_length(res))
## [1] 0
这已在 stringi
的开发分支中修复,请参阅 https://github.com/gagolews/stringi/issues/227(因为 stringr
中的 str_sub
依赖于 [=11 中的 stri_sub
=]).一旦 CRAN 上有可用更新,"general public" 中的任何人都可以复制正确的行为,而不是:
str_sub(word, start = -stringr::str_length(df$sufix))
## [1] "amável" "amável" "amável" "ável" "amável" "ável"
考虑以下 data.frame:
df <- structure(list(sufix = c("atizado", "atoria", "atório", "auta",
"áutico", "ável"), min_stem_len = c(4, 5, 3, 5, 4, 2), replacement = c("",
"", "", "", "", ""), exceptions = list(character(0), character(0),
character(0), character(0), character(0), c("afável", "razoável",
"potável", "vulnerável"))), .Names = c("sufix", "min_stem_len",
"replacement", "exceptions"), row.names = 21:26, class = c("tbl_df",
"tbl", "data.frame"))
我在这个 data.frame 的变量 sufix
中有一个字符串列表。
现在我有一个单词 word <- "amável"
,我想得到这个单词的后缀,其长度与 df$sufix
.
我正在使用以下代码:
library(stringr)
word <- "amável"
str_sub(word, start = -stringr::str_length(df$sufix))
但这会输出:
> str_sub(word, start = -stringr::str_length(df$sufix))
[1] "amável" "mável" "mável" "vel" "mável" "vel"
> df$sufix
[1] "atizado" "atoria" "atório" "auta" "áutico" "ável"
我期望结果向量的最后一个元素是“ável”,因为:
> str_length("ável")
[1] 4
> str_sub(word, start = -4)
[1] "ável"
这是一个更简单的可重现示例:
set.seed(100)
a <- sample(1:10, 10000, replace = T)
res <- rep("ábc", 10000) %>% str_sub(start = -a)
sum(ifelse(a > 3, 3, a) != str_length(res))
[1] 2504
如果您注意到,所有结果都是错误的(第一个除外)。
他们应该是
[1] "amável" "amável" "amável" "ável" "amável" "ável"
这可以通过
轻松解决library(stringi)
stri_sub(rep(word, 6), from = -stri_length(df$suffix))
我打赌你可以重复使用你的 stringr
代码。
### 编辑添加###
我现在明白你的意思了。肯定有一种奇怪的行为,很可能与特殊字符 á
有关。请参阅以下示例:
df <- data.frame(suffix = c("Lorem","ipsum","dolor","sit","amet","consectetur","adipiscing", "elit","Donec","arcu"))
df$len <- stri_length(df$suffix)
然后看看结果的第7个元素的奇怪行为:
stri_sub("amavel", from = -df$len)
## [1] "mavel" "mavel" "mavel" "vel" "avel" "amavel" "amavel" "avel"
## [9] "mavel" "avel"
# Compared to
stri_sub("amável", from = -df$len)
## [1] "mável" "mável" "mável" "vel" "ável" "amável" "mável" "ável"
## [9] "mável" "ável"
很奇怪,如果使用 rep
,结果在最后一种情况下得到纠正:
stri_sub(rep("amável", 10), from = -df$len)
## [1] "mável" "mável" "mável" "vel" "ável" "amável" "amável" "ável"
## [9] "mável" "ável"
# note how the 7th element is now correct.
因此,尽管有点老套,上面提供的解决方案应该可行。
我试着查看 stri_sub
的代码,它引用了 C_stri_sub
,但这对我来说是死胡同。也许更了解 C
and/or 字符串操作的人可以来帮忙?
### 第二次编辑 ###
在我看来,问题出在对 stri_sub
的调用中字符串 的重复。查看您在编辑中输入的替代代码:
set.seed(100)
a <- sample(1:10, 10000, replace = TRUE)
res <- stri_sub(rep("ábc", 10000), from = -a)
sum(ifelse(a > 3, 3, a) != stri_length(res))
## [1] 0
这已在 stringi
的开发分支中修复,请参阅 https://github.com/gagolews/stringi/issues/227(因为 stringr
中的 str_sub
依赖于 [=11 中的 stri_sub
=]).一旦 CRAN 上有可用更新,"general public" 中的任何人都可以复制正确的行为,而不是:
str_sub(word, start = -stringr::str_length(df$sufix))
## [1] "amável" "amável" "amável" "ável" "amável" "ável"