拆分字符串并仅保留一个特定部分
Split a string and keep just one specific part
在我的 data.table
中,我使用 tstrsplit
拆分 ValueId 列,并使用 keep=
参数。但在这种情况下,我不知道要放入 keep
的值,我想使用 Level 列中的值。
我所有的尝试都失败了。可能吗 ?也许不在 data.table
?
这是一个代表:
library(data.table)
foo <- data.table(Level = c(2,2,3,4,3),
ValueId = c("11983:1055521", "11983:1055521-5168:290668-198:100798", "11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771-5162:290728-5166:290620",
"11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771", " 11983:1055521-5168:290676-198:100794-92:91781-139:95090-135:95353"))
foo[, newvar := tstrsplit(ValueId, "-", fixed = TRUE, keep = 4)]
foo[, newvar := tstrsplit(ValueId, "-", fixed = TRUE, keep = Level)]
谢谢!!
您可以使用mapply
和[
来提取由strsplit
返回的子字符串在foo$Level
中给定的位置。
mapply(`[`, strsplit(foo$ValueId, "-", fixed = TRUE), foo$Level)
#[1] NA "5168:290668" "198:100798" "92:91604" "198:100794"
有几个问题。其中之一在 tstrsplit
函数本身中,定义为:
function (x, ..., fill = NA, type.convert = FALSE, keep, names = FALSE)
{
if (!isTRUEorFALSE(names) && !is.character(names))
stop("'names' must be TRUE/FALSE or a character vector.")
ans = transpose(strsplit(as.character(x), ...), fill = fill,
ignore.empty = FALSE)
if (!missing(keep)) {
keep = suppressWarnings(as.integer(keep))
chk = min(keep) >= min(1L, length(ans)) & max(keep) <=
length(ans)
if (!isTRUE(chk))
stop("'keep' should contain integer values between ",
min(1L, length(ans)), " and ", length(ans),
".")
ans = ans[keep]
}
if (type.convert)
ans = lapply(ans, type.convert, as.is = TRUE)
if (isFALSE(names))
return(ans)
else if (isTRUE(names))
names = paste0("V", seq_along(ans))
if (length(names) != length(ans)) {
str = if (missing(keep))
"ans"
else "keep"
stop("length(names) (= ", length(names), ") is not equal to length(",
str, ") (= ", length(ans), ").")
}
setattr(ans, "names", names)
ans
}
<bytecode: 0x0000019bffd6da98>
<environment: namespace:data.table>
需要注意的重要一点是 if
块检查您的 keep
是否适合 return。在您的示例中,第一行 returns NA。这在您的硬编码示例中起作用的原因是 strsplit
被矢量化,因此 NA 行与有效行同时为 运行,因此不会触发此 if
块.您可以通过将 4 更改为 40 来尝试此操作,您将收到此消息 Error in tstrsplit(ValueId, "-", fixed = TRUE, keep = 40) : 'keep' should contain integer values between 1 and 9.
因为在那种情况下没有任何效果。
所以你需要做的是重新定义 tstrsplit
函数,这样它 return NA 而不是停止
tstrsplitNA<-function (x, ..., fill = NA, type.convert = FALSE, keep)
{
ans = transpose(strsplit(as.character(x), ...), fill = fill,
ignore.empty = FALSE)
if (!missing(keep)) {
keep = suppressWarnings(as.integer(keep))
chk = min(keep) >= min(1L, length(ans)) & max(keep) <=
length(ans)
if (!isTRUE(chk))
ans<-NA_character_
ans = ans[keep]
}
if (type.convert)
ans = lapply(ans, type.convert, as.is = TRUE)
return(ans)
ans
}
这还不够,因为 strsplit
是矢量化的,所以 foo[, newvar := tstrsplitNA(ValueId, split="-", fixed = TRUE, keep = Level)]
不仅仅是 运行 一行一行地执行该函数,而是提供整个 ValueId
列到 strsplit
,然后将其转置 return 相对于你想要的内容的乱码。
您可以告诉 data.table 只需将 by
参数与 Level
和 ValueId
一起使用即可逐行执行操作
foo[, newvar := tstrsplitNA(ValueId, split="-", fixed = TRUE, keep = Level), by=c('Level','ValueId')]
foo
Level ValueId newvar
1: 2 11983:1055521 NA
2: 2 11983:1055521-5168:290668-198:100798 5168:290668
3: 3 11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771-5162:290728-5166:290620 198:100798
4: 4 11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771 92:91604
5: 3 11983:1055521-5168:290676-198:100794-92:91781-139:95090-135:95353 198:100794
在我的 data.table
中,我使用 tstrsplit
拆分 ValueId 列,并使用 keep=
参数。但在这种情况下,我不知道要放入 keep
的值,我想使用 Level 列中的值。
我所有的尝试都失败了。可能吗 ?也许不在 data.table
?
这是一个代表:
library(data.table)
foo <- data.table(Level = c(2,2,3,4,3),
ValueId = c("11983:1055521", "11983:1055521-5168:290668-198:100798", "11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771-5162:290728-5166:290620",
"11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771", " 11983:1055521-5168:290676-198:100794-92:91781-139:95090-135:95353"))
foo[, newvar := tstrsplit(ValueId, "-", fixed = TRUE, keep = 4)]
foo[, newvar := tstrsplit(ValueId, "-", fixed = TRUE, keep = Level)]
谢谢!!
您可以使用mapply
和[
来提取由strsplit
返回的子字符串在foo$Level
中给定的位置。
mapply(`[`, strsplit(foo$ValueId, "-", fixed = TRUE), foo$Level)
#[1] NA "5168:290668" "198:100798" "92:91604" "198:100794"
有几个问题。其中之一在 tstrsplit
函数本身中,定义为:
function (x, ..., fill = NA, type.convert = FALSE, keep, names = FALSE)
{
if (!isTRUEorFALSE(names) && !is.character(names))
stop("'names' must be TRUE/FALSE or a character vector.")
ans = transpose(strsplit(as.character(x), ...), fill = fill,
ignore.empty = FALSE)
if (!missing(keep)) {
keep = suppressWarnings(as.integer(keep))
chk = min(keep) >= min(1L, length(ans)) & max(keep) <=
length(ans)
if (!isTRUE(chk))
stop("'keep' should contain integer values between ",
min(1L, length(ans)), " and ", length(ans),
".")
ans = ans[keep]
}
if (type.convert)
ans = lapply(ans, type.convert, as.is = TRUE)
if (isFALSE(names))
return(ans)
else if (isTRUE(names))
names = paste0("V", seq_along(ans))
if (length(names) != length(ans)) {
str = if (missing(keep))
"ans"
else "keep"
stop("length(names) (= ", length(names), ") is not equal to length(",
str, ") (= ", length(ans), ").")
}
setattr(ans, "names", names)
ans
}
<bytecode: 0x0000019bffd6da98>
<environment: namespace:data.table>
需要注意的重要一点是 if
块检查您的 keep
是否适合 return。在您的示例中,第一行 returns NA。这在您的硬编码示例中起作用的原因是 strsplit
被矢量化,因此 NA 行与有效行同时为 运行,因此不会触发此 if
块.您可以通过将 4 更改为 40 来尝试此操作,您将收到此消息 Error in tstrsplit(ValueId, "-", fixed = TRUE, keep = 40) : 'keep' should contain integer values between 1 and 9.
因为在那种情况下没有任何效果。
所以你需要做的是重新定义 tstrsplit
函数,这样它 return NA 而不是停止
tstrsplitNA<-function (x, ..., fill = NA, type.convert = FALSE, keep)
{
ans = transpose(strsplit(as.character(x), ...), fill = fill,
ignore.empty = FALSE)
if (!missing(keep)) {
keep = suppressWarnings(as.integer(keep))
chk = min(keep) >= min(1L, length(ans)) & max(keep) <=
length(ans)
if (!isTRUE(chk))
ans<-NA_character_
ans = ans[keep]
}
if (type.convert)
ans = lapply(ans, type.convert, as.is = TRUE)
return(ans)
ans
}
这还不够,因为 strsplit
是矢量化的,所以 foo[, newvar := tstrsplitNA(ValueId, split="-", fixed = TRUE, keep = Level)]
不仅仅是 运行 一行一行地执行该函数,而是提供整个 ValueId
列到 strsplit
,然后将其转置 return 相对于你想要的内容的乱码。
您可以告诉 data.table 只需将 by
参数与 Level
和 ValueId
foo[, newvar := tstrsplitNA(ValueId, split="-", fixed = TRUE, keep = Level), by=c('Level','ValueId')]
foo
Level ValueId newvar
1: 2 11983:1055521 NA
2: 2 11983:1055521-5168:290668-198:100798 5168:290668
3: 3 11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771-5162:290728-5166:290620 198:100798
4: 4 11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771 92:91604
5: 3 11983:1055521-5168:290676-198:100794-92:91781-139:95090-135:95353 198:100794