R - gsub() 用于从数据集中删除日期

Question

我正在使用 gsub() 函数从数据中删除不需要的文本。我只想将年龄放在括号中，而不是出生日期。但是，这是在具有不同出生日期的大型数据集中。

数据示例：

Test1$Age

Sep 10, 1990(27)
Mar 26, 1987(30
Feb 24, 1997(20)

Answer 1

您可以使用 stringr 包中的 str_extract() 执行此操作：

s <- "Sep 10, 1990(27)"

# get the age in parentheses
stringr::str_extract(s, "\([0-9]+\)")

# just the age, with parentheses removed
stringr::str_extract(s, "(?<=\()[0-9]+")

输出为：

> s <- "Sep 10, 1990(27)"
> 
> # get the age in parentheses
> stringr::str_extract(s, "\([0-9]+\)")
[1] "(27)"
> 
> # just the age, with parentheses removed
> stringr::str_extract(s, "(?<=\()[0-9]+")
[1] "27"

第一个正则表达式匹配包含一个或多个数字的成对括号。第二个正则表达式使用 positive lookbehind 匹配左括号后的一个或多个数字。

如果您的数据位于 data.frame df 中且列名为 age，那么您可以执行以下操作：

df$age <- stringr::str_extract(df$age, "\([0-9]+\)")

或者，在 tidyverse 表示法中：

df <- df %>% mutate(age = stringr::str_extract(age, "\([0-9]+\)"))

Answer 2

好像有两个问题：

不需要左括号之前的日期
右括号有时会丢失，需要插入

1) sub 这些可以用 sub 寻址。匹配

任意数量的字符 .* 后跟
文字左括号 [(] 后跟
捕获组中的数字 (\d+) 后跟
一个可选的右括号 [)]?

然后用左括号替换它，匹配到捕获组 \1 和右括号。

没有使用包。

pat <- ".*[(](\d+)[)]?"
transform(test, Age = sub(pat, "(\1)", Age))

如果您希望将年龄作为数字字段，则：

transform(test, Age = as.numeric(sub(pat, "\1", Age)))

2) substring/sub 另一种可能性是从第 13 个字符开始，它给出了从左括号到字符串末尾的所有内容，如果缺少则插入 ) . )?$ 匹配字符串末尾的右括号，如果 none，则只匹配字符串的末尾。它被右括号代替。同样，没有使用包。

transform(test, Age = sub(")?$", ")", substring(Age, 13))

如果我们想要一个数字年龄，则此方法的一个变体是从第 14 个字符开始获取所有内容并删除最后一个 )（如果存在）。

transform(test, Age = as.numeric(sub(")", "", substring(Age, 14))))

3) read.table 使用 read.table 读取 Age 字段与 sep = "(" 和 comment.char = ")" 和摘下第二栏阅读。这将给出数字年龄，我们可以使用 sprintf 用括号将其括起来。如果 Age 是字符（而不是因子），那么 as.character(Age) 可以选择性地写成 Age.

同样，没有使用包。这个不使用正则表达式。

transform(test, Age = 
  sprintf("(%s)", read.table(text = as.character(Age), sep = "(", comment.char = ")")$V2)

注意：可重现形式的输入是：

test <- data.frame(Age = c("Sep 10, 1990(27)", "Mar 26, 1987(30", "Feb 24, 1997(20)"))

R - gsub() 用于从数据集中删除日期

R - gsub() for to remove dates from data set

regex

substring

r

gsub