R 中的 gsub 和正则表达式出现问题
Trouble with gsub and regex in R
我在 R 中使用 gsub 将文本添加到字符串的中间。它工作得很好,但由于某种原因,当位置太长时会抛出错误。代码如下:
gsub(paste0('^(.{', as.integer(loc[1])-1, '})(.+)$'), new_cols, sql)
Error in gsub(paste0("^(.{273})(.+)$"), new_cols, sql) : invalid
regular expression '^(.{273})(.+)$', reason 'Invalid contents of {}'
当括号中的数字(在本例中为 273)较少时,此代码可以正常工作,但当括号中的数字如此大时,则无法正常工作。
这会产生错误:
sql <- "The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats.The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats."
new_cols <- "happy"
gsub('^(.{125})(.+)$', new_cols, sql) #**Works
gsub('^(.{273})(.+)$', new_cols, sql)
Error in gsub("^(.{273})(.+)$", new_cols, sql) : invalid regular
expression '^(.{273})(.+)$', reason 'Invalid contents of {}'
背景
R gsub
默认使用 TRE 正则表达式库。限制量词中的边界从 0 到 TRE 代码中定义的 RE_DUP_MAX
有效。见 this TRE reference:
A bound is one of the following, where n
and m
are unsigned decimal integers between 0
and RE_DUP_MAX
似乎 RE_DUP_MAX
设置为 255(参见此 TRE source file 显示 #define RE_DUP_MAX 255
),因此,您不能在 {n,m}
限制量词中使用更多。
解决方案
使用 PCRE 正则表达式,添加 perl = TRUE
即可。
> sql <- "The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats.The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats."
> new_cols <- "happy"
> gsub('^(.{273})(.+)$', new_cols, sql, perl=TRUE)
[1] "happy"
我在 R 中使用 gsub 将文本添加到字符串的中间。它工作得很好,但由于某种原因,当位置太长时会抛出错误。代码如下:
gsub(paste0('^(.{', as.integer(loc[1])-1, '})(.+)$'), new_cols, sql)
Error in gsub(paste0("^(.{273})(.+)$"), new_cols, sql) : invalid regular expression '^(.{273})(.+)$', reason 'Invalid contents of {}'
当括号中的数字(在本例中为 273)较少时,此代码可以正常工作,但当括号中的数字如此大时,则无法正常工作。
这会产生错误:
sql <- "The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats.The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats."
new_cols <- "happy"
gsub('^(.{125})(.+)$', new_cols, sql) #**Works
gsub('^(.{273})(.+)$', new_cols, sql)
Error in gsub("^(.{273})(.+)$", new_cols, sql) : invalid regular expression '^(.{273})(.+)$', reason 'Invalid contents of {}'
背景
R gsub
默认使用 TRE 正则表达式库。限制量词中的边界从 0 到 TRE 代码中定义的 RE_DUP_MAX
有效。见 this TRE reference:
A bound is one of the following, where
n
andm
are unsigned decimal integers between0
andRE_DUP_MAX
似乎 RE_DUP_MAX
设置为 255(参见此 TRE source file 显示 #define RE_DUP_MAX 255
),因此,您不能在 {n,m}
限制量词中使用更多。
解决方案
使用 PCRE 正则表达式,添加 perl = TRUE
即可。
> sql <- "The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats.The cat with the bat went to town. He ate the fat mat and wouldn't stop til the sun came up. He was a fat cat that lived with a rat who owned many hats."
> new_cols <- "happy"
> gsub('^(.{273})(.+)$', new_cols, sql, perl=TRUE)
[1] "happy"