如何从数据框中的复杂字符串中正确提取数字组件并将字符串替换为提取输出？

Question

我有一个 data.frame，其中有两个字符串表达式变量，例如 "ABC`w/XYZ 8"，其中 w = 1 到 999 之间的任何数字。我需要什么要做的是减去 w 并用它替换整个字符串。我使用此代码：

df <- data.frame(a = c("ABC`5/XYZ 8", "A`25/BHU 19", "ach`246/chy 0"), b = c("sfse`3/cjd 65", "jlke`234/Chu 19", "h`45/hy 0"))

df$a <- sapply(df$a, function(x) {substr(df$a[x], regexpr("`[0-9]+/", df$a[x]) +1,
+  regexpr("`[0-9]+/", df$a[x]) + attr(regexpr("`[0-9]+/", df$a[x]), "match.length")-2)})

它有效，但我得到的不是 a = c(5, 25, 246)，而是 a = c(25, 5, 246)。我猜这是因为 a 的 class 因素。但是，当 a 是 class 字符时，我得到 NAs 作为输出。有没有办法保留 a 的顺序或对字符数组使用 sapply 和 substr？

Answer 1

我们可以使用sub来提取字符串'w'位置指定的数字。匹配一个或多个字母的模式以及“``”，捕获一个或多个数字作为一组（(\d+)）后跟其他字符（.*）并将其替换为反向引用捕获组。

as.numeric(sub("[A-Za-z`]+(\d+).*", "\1", df$a))
#[1]   5  25 246

或者另一种选择是 str_extract

library(stringr)
as.numeric(str_extract(df$a, "\d+"))
#[1]   5  25 246

如何从数据框中的复杂字符串中正确提取数字组件并将字符串替换为提取输出？

How to correctly extract a numeric component from complex strings in a data frame and substitute the strings with extraction output?

r

substr

sapply