如何通过 R 中单列中的字符串标签对行值进行子集化?

How to subset row values by string label in a single column in R?

我有一个列,我想根据 R 中的第一个和最后一个 'string' 标签对其行值进行子集化。级别值如下:

[1] "60022 (Location; 9TH FLOOR; Snacks)"
[3] "60024 (Location; 9TH FLOOR; Lg Snacks)"
[5] "60027 (Location; 9TH FLOOR; Sml Snacks)"

我想要提取 # 和最后一个由“;”分隔的字符串。 R 中是否有函数或语法来执行此操作?所以删除 "Location; 9TH FLOOR" 并保留最后一个 ; "" 字符串。

我试过只提取第一个值,但无法使用此代码保留 "snacks":

#updated_df_2020$Machine <- sub("([A-Za-z]+).*", "\1", updated_df_2020$Machine) 

每一行的最终结果应该是数字(60022,然后是 Snacks),如下所示:

[1] "60022 (Snacks)" 
[1] "60024 (Lg Snacks)" 
[1] "60027 (Sml Snacks)" 

如果我们需要去掉子串,先捕获字符串开头(^)的数字(\d+),然后捕获非白色的space(\S) 在 ; 和零个或多个 space (\s*) 和后面的其他字符 (.*) 之后,直到 ) 在end ($) 作为第二个捕获组。在替换中,指定捕获组的反向引用(\1\2)并通过添加(

对其进行修改
updated_df_2020$Machine <- sub("^(\d+)\b.*;\s*\b(\S.*\))$", 
        "\1 (\2", updated_df_2020$Machine)
updated_df_2020$Machine
#[1] "60022 (Snacks)"     "60024 (Lg Snacks)"  "60027 (Sml Snacks)"

如果字符串的开头不是数字但仍想提取,请将 ((\d+)) 替换为 (\w+)

数据

updated_df_2020 <- data.frame(Machine = c("60022 (Location; 9TH FLOOR; Snacks)",
   "60024 (Location; 9TH FLOOR; Lg Snacks)", "60027 (Location; 9TH FLOOR; Sml Snacks)"),
   stringsAsFactors = FALSE)

你可以

> a <- c("60022 (Location; 9TH FLOOR; Snacks)", "60024 (Location; 9TH FLOOR; Snacks)", "60027 (Location; 9TH FLOOR; Snacks)")
> strs <- strsplit(a, split = " ")
> sapply(strs, function(s) paste(s[1], paste0("(", s[length(s)])))
#
# "60022 (Snacks)" "60024 (Snacks)" "60027 (Snacks)"
#

哪个更难看,但我想更容易理解一些

我们可以使用 sub 提取开头的数字和后面跟冒号的所有内容:

sub("(\d+).*;(.*)", "\1 (\2", x)
#[1] "60022 ( Snacks)"     "60024 ( Lg Snacks)"  "60027 ( Sml Snacks)"

其中 x 是

x <- c("60022 (Location; 9TH FLOOR; Snacks)", 
       "60024 (Location; 9TH FLOOR; Lg Snacks)",
       "60027 (Location; 9TH FLOOR; Sml Snacks)")