如何通过 R 中单列中的字符串标签对行值进行子集化?
How to subset row values by string label in a single column in R?
我有一个列,我想根据 R 中的第一个和最后一个 'string' 标签对其行值进行子集化。级别值如下:
[1] "60022 (Location; 9TH FLOOR; Snacks)"
[3] "60024 (Location; 9TH FLOOR; Lg Snacks)"
[5] "60027 (Location; 9TH FLOOR; Sml Snacks)"
我想要提取 # 和最后一个由“;”分隔的字符串。 R 中是否有函数或语法来执行此操作?所以删除 "Location; 9TH FLOOR" 并保留最后一个 ; "" 字符串。
我试过只提取第一个值,但无法使用此代码保留 "snacks":
#updated_df_2020$Machine <- sub("([A-Za-z]+).*", "\1", updated_df_2020$Machine)
每一行的最终结果应该是数字(60022,然后是 Snacks),如下所示:
[1] "60022 (Snacks)"
[1] "60024 (Lg Snacks)"
[1] "60027 (Sml Snacks)"
如果我们需要去掉子串,先捕获字符串开头(^
)的数字(\d+
),然后捕获非白色的space(\S
) 在 ;
和零个或多个 space (\s*
) 和后面的其他字符 (.*
) 之后,直到 )
在end ($
) 作为第二个捕获组。在替换中,指定捕获组的反向引用(\1
、\2
)并通过添加(
对其进行修改
updated_df_2020$Machine <- sub("^(\d+)\b.*;\s*\b(\S.*\))$",
"\1 (\2", updated_df_2020$Machine)
updated_df_2020$Machine
#[1] "60022 (Snacks)" "60024 (Lg Snacks)" "60027 (Sml Snacks)"
如果字符串的开头不是数字但仍想提取,请将 ((\d+)
) 替换为 (\w+)
数据
updated_df_2020 <- data.frame(Machine = c("60022 (Location; 9TH FLOOR; Snacks)",
"60024 (Location; 9TH FLOOR; Lg Snacks)", "60027 (Location; 9TH FLOOR; Sml Snacks)"),
stringsAsFactors = FALSE)
你可以
> a <- c("60022 (Location; 9TH FLOOR; Snacks)", "60024 (Location; 9TH FLOOR; Snacks)", "60027 (Location; 9TH FLOOR; Snacks)")
> strs <- strsplit(a, split = " ")
> sapply(strs, function(s) paste(s[1], paste0("(", s[length(s)])))
#
# "60022 (Snacks)" "60024 (Snacks)" "60027 (Snacks)"
#
哪个更难看,但我想更容易理解一些
我们可以使用 sub
提取开头的数字和后面跟冒号的所有内容:
sub("(\d+).*;(.*)", "\1 (\2", x)
#[1] "60022 ( Snacks)" "60024 ( Lg Snacks)" "60027 ( Sml Snacks)"
其中 x 是
x <- c("60022 (Location; 9TH FLOOR; Snacks)",
"60024 (Location; 9TH FLOOR; Lg Snacks)",
"60027 (Location; 9TH FLOOR; Sml Snacks)")
我有一个列,我想根据 R 中的第一个和最后一个 'string' 标签对其行值进行子集化。级别值如下:
[1] "60022 (Location; 9TH FLOOR; Snacks)"
[3] "60024 (Location; 9TH FLOOR; Lg Snacks)"
[5] "60027 (Location; 9TH FLOOR; Sml Snacks)"
我想要提取 # 和最后一个由“;”分隔的字符串。 R 中是否有函数或语法来执行此操作?所以删除 "Location; 9TH FLOOR" 并保留最后一个 ; "" 字符串。
我试过只提取第一个值,但无法使用此代码保留 "snacks":
#updated_df_2020$Machine <- sub("([A-Za-z]+).*", "\1", updated_df_2020$Machine)
每一行的最终结果应该是数字(60022,然后是 Snacks),如下所示:
[1] "60022 (Snacks)"
[1] "60024 (Lg Snacks)"
[1] "60027 (Sml Snacks)"
如果我们需要去掉子串,先捕获字符串开头(^
)的数字(\d+
),然后捕获非白色的space(\S
) 在 ;
和零个或多个 space (\s*
) 和后面的其他字符 (.*
) 之后,直到 )
在end ($
) 作为第二个捕获组。在替换中,指定捕获组的反向引用(\1
、\2
)并通过添加(
updated_df_2020$Machine <- sub("^(\d+)\b.*;\s*\b(\S.*\))$",
"\1 (\2", updated_df_2020$Machine)
updated_df_2020$Machine
#[1] "60022 (Snacks)" "60024 (Lg Snacks)" "60027 (Sml Snacks)"
如果字符串的开头不是数字但仍想提取,请将 ((\d+)
) 替换为 (\w+)
数据
updated_df_2020 <- data.frame(Machine = c("60022 (Location; 9TH FLOOR; Snacks)",
"60024 (Location; 9TH FLOOR; Lg Snacks)", "60027 (Location; 9TH FLOOR; Sml Snacks)"),
stringsAsFactors = FALSE)
你可以
> a <- c("60022 (Location; 9TH FLOOR; Snacks)", "60024 (Location; 9TH FLOOR; Snacks)", "60027 (Location; 9TH FLOOR; Snacks)")
> strs <- strsplit(a, split = " ")
> sapply(strs, function(s) paste(s[1], paste0("(", s[length(s)])))
#
# "60022 (Snacks)" "60024 (Snacks)" "60027 (Snacks)"
#
哪个更难看,但我想更容易理解一些
我们可以使用 sub
提取开头的数字和后面跟冒号的所有内容:
sub("(\d+).*;(.*)", "\1 (\2", x)
#[1] "60022 ( Snacks)" "60024 ( Lg Snacks)" "60027 ( Sml Snacks)"
其中 x 是
x <- c("60022 (Location; 9TH FLOOR; Snacks)",
"60024 (Location; 9TH FLOOR; Lg Snacks)",
"60027 (Location; 9TH FLOOR; Sml Snacks)")