在 R 中使用正则表达式和 tidyr 在第一个匹配实例上拆分列变量
Using regex and tidyr in R to split column variable on first instance of match
试图拆分变量中有多个 space 的 R 数据框中的列,但我只想拆分第一个 space。示例数据框:
df <- data.frame(game = c(1, 2, 3, 4, 5, 6), date = c("Monday Apr 3", "Tuesday Apr 4", "Wednesday Apr 5", "Thursday Apr 6", "Friday Apr 7", "Saturday Apr 8"))
我正在尝试使用 tidyr 在第一个 space 上拆分 df 'date' 列,以便日期在其自己的列中:
game day date
1 1 Monday Apr 3
2 2 Tuesday Apr 4
3 3 Wednesday Apr 5
4 4 Thursday Apr 6
5 5 Friday Apr 7
6 6 Saturday Apr 8
以上就是问题所在。以下是我尝试过的以及出了什么问题。
根据 tidyr 文档,'sep' 的默认值是 'a regular expression that matches any sequence of non-alphanumeric values.' 所以如果我这样做:
df %>% separate(date, c("day", "date"))
这将在 space 上拆分,但它会在 space 上拆分(例如 'Monday' 之后的 space 和 [= 之后的 space 36=] 在 'Monday Apr 3' 中)。结果是:
game day date
1 1 Monday Apr
2 2 Tuesday Apr
3 3 Wednesday Apr
4 4 Thursday Apr
5 5 Friday Apr
6 6 Saturday Apr
Warning message:
Too many values at 6 locations: 1, 2, 3, 4, 5, 6
我可以将正则表达式添加到 select,只是第一个 space(我检查了这个正则表达式在 Sublime Text 中是否有效):
df %>% separate(date, c("day", "date"), sep='^[^\s]*\K\s')
但这给了我:
game day date
1 1 Monday Apr 3 <NA>
2 2 Tuesday Apr 4 <NA>
3 3 Wednesday Apr 5 <NA>
4 4 Thursday Apr 6 <NA>
5 5 Friday Apr 7 <NA>
6 6 Saturday Apr 8 <NA>
Warning message:
Too few values at 6 locations: 1, 2, 3, 4, 5, 6
那到底出了什么问题?或者我如何使这项工作?或者我不明白什么明显的事情?
您需要将 extra
参数指定为 merge
:
library(tidyr)
df %>% separate(date, c("day", "date"), extra = "merge")
# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8
Psidom 是否涵盖了关于值过多的第一条警告消息。关于你最终得到的值太少的第二种方法,部分原因是 \K
不适用于 stringi
,而 separate
正在使用它。您可以使用 stringi::stri_split_regex(df$date, '^[^\s]*\K\s')
自行检查。因此,您不会使用该正则表达式进行任何拆分,并且最终会收到有关值太少的警告消息。
您可以将 sep
指定为
# a space not followed by a digit
df %>% separate(date, c("day", "date"), sep = "\s(?!\d)")
# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8
一些替代正则表达式:
不能使用\K
,但是如果需要使用变长look-behind,量词需要有界:
# a space preceded by 3 - 6 characters and "day".
# 3 - 6 characters allows "Monday" and "Wednesday"
"(?<=.{3,6}day)\s"
# same idea
"(?<=\S{3,6}day)\s"
# same idea
"(?<=.?.?.?...day)\s"
# same idea, but using ^ to anchor and not using "day"
"(?<=^\S{0,9})\s"
# space followed by some other characters, a space, digit(s) and the end of the line
"\s(?=.+\s\d+$)"
我们可以使用 base R
轻松做到这一点
cbind(df[1], read.csv(text=sub("\s+", ",", df$date),
header=FALSE, col.names = c("day", "date")))
# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8
或者另一个选项是 extract
来自 tidyr
library(tidyr)
extract(df, date, into = c("day", "date"), "(\S+)\s+(.*)")
# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8
试图拆分变量中有多个 space 的 R 数据框中的列,但我只想拆分第一个 space。示例数据框:
df <- data.frame(game = c(1, 2, 3, 4, 5, 6), date = c("Monday Apr 3", "Tuesday Apr 4", "Wednesday Apr 5", "Thursday Apr 6", "Friday Apr 7", "Saturday Apr 8"))
我正在尝试使用 tidyr 在第一个 space 上拆分 df 'date' 列,以便日期在其自己的列中:
game day date
1 1 Monday Apr 3
2 2 Tuesday Apr 4
3 3 Wednesday Apr 5
4 4 Thursday Apr 6
5 5 Friday Apr 7
6 6 Saturday Apr 8
以上就是问题所在。以下是我尝试过的以及出了什么问题。
根据 tidyr 文档,'sep' 的默认值是 'a regular expression that matches any sequence of non-alphanumeric values.' 所以如果我这样做:
df %>% separate(date, c("day", "date"))
这将在 space 上拆分,但它会在 space 上拆分(例如 'Monday' 之后的 space 和 [= 之后的 space 36=] 在 'Monday Apr 3' 中)。结果是:
game day date
1 1 Monday Apr
2 2 Tuesday Apr
3 3 Wednesday Apr
4 4 Thursday Apr
5 5 Friday Apr
6 6 Saturday Apr
Warning message:
Too many values at 6 locations: 1, 2, 3, 4, 5, 6
我可以将正则表达式添加到 select,只是第一个 space(我检查了这个正则表达式在 Sublime Text 中是否有效):
df %>% separate(date, c("day", "date"), sep='^[^\s]*\K\s')
但这给了我:
game day date
1 1 Monday Apr 3 <NA>
2 2 Tuesday Apr 4 <NA>
3 3 Wednesday Apr 5 <NA>
4 4 Thursday Apr 6 <NA>
5 5 Friday Apr 7 <NA>
6 6 Saturday Apr 8 <NA>
Warning message:
Too few values at 6 locations: 1, 2, 3, 4, 5, 6
那到底出了什么问题?或者我如何使这项工作?或者我不明白什么明显的事情?
您需要将 extra
参数指定为 merge
:
library(tidyr)
df %>% separate(date, c("day", "date"), extra = "merge")
# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8
Psidom 是否涵盖了关于值过多的第一条警告消息。关于你最终得到的值太少的第二种方法,部分原因是 \K
不适用于 stringi
,而 separate
正在使用它。您可以使用 stringi::stri_split_regex(df$date, '^[^\s]*\K\s')
自行检查。因此,您不会使用该正则表达式进行任何拆分,并且最终会收到有关值太少的警告消息。
您可以将 sep
指定为
# a space not followed by a digit
df %>% separate(date, c("day", "date"), sep = "\s(?!\d)")
# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8
一些替代正则表达式:
不能使用\K
,但是如果需要使用变长look-behind,量词需要有界:
# a space preceded by 3 - 6 characters and "day".
# 3 - 6 characters allows "Monday" and "Wednesday"
"(?<=.{3,6}day)\s"
# same idea
"(?<=\S{3,6}day)\s"
# same idea
"(?<=.?.?.?...day)\s"
# same idea, but using ^ to anchor and not using "day"
"(?<=^\S{0,9})\s"
# space followed by some other characters, a space, digit(s) and the end of the line
"\s(?=.+\s\d+$)"
我们可以使用 base R
cbind(df[1], read.csv(text=sub("\s+", ",", df$date),
header=FALSE, col.names = c("day", "date")))
# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8
或者另一个选项是 extract
来自 tidyr
library(tidyr)
extract(df, date, into = c("day", "date"), "(\S+)\s+(.*)")
# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8