按开头对列中的字符串值进行分组
Group string values in column by their beginning
我有一个数据框:
ID value
1 request body: <?xml version="2.0"> values received
2 request body: <code> 'jnwg3425'
3 request body: <?xml version="2.0", <PlatCode>, <code> 'qwefn2'
4 Error in message received
5 Error in message received
6 Push forward message x3535
7 Push forward message <MarkCheckMSG>
我想根据开头 os 字符串中的相似性对第二列中的值进行分组。如何获得具有每个组模式的数据框,如下所示:
patterns
request body:
Error in message received
Push forward message
我该怎么做?哪些方法更适合我的目标?我应该使用正则表达式还是字符串距离方法?
首先,我们使用 stringr::str_extract
提取前 3 个词或后跟 :
的前 2 个词,或者您可以只使用 sub
来匹配完整的 value
并且只捕获给定的表达式,即 sub('^(expre).+$', '\1', value)
,正则表达式模式如下 \w+ \w+(:| \w+)
即匹配两个词 \w+ \w+
然后匹配 :
或另一个词。
library(stringr)
df %>%
mutate(beginnings= str_extract(value, "\w+ \w+(:| \w+)")) %>%
group_by(beginnings)
# A tibble: 7 x 3
# Groups: beginnings [3]
ID value beginnings
<int> <fct> <chr>
1 1 request body: <?xml version=2.0> values received request body:
2 2 request body: <code> jnwg3425 request body:
3 3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:
4 4 Error in message received Error in message
5 5 Error in message received Error in message
6 6 Push forward message x3535 Push forward message
7 7 Push forward message <MarkCheckMSG> Push forward message
使用不同的正则表达式
(\w+ )+[a-z]{2,}:?
=> 匹配尽可能多的单词后跟 space ((\w+ )+
) 后跟两个以上的字母 [a-z]{2,}
和 :
如果它存在。
df %>%
mutate(beginings= str_extract(value, "(\w+ )+[a-z]{2,}:?")) %>%
group_by(beginings)
# A tibble: 7 x 3
# Groups: beginings [3]
ID value beginings
<int> <fct> <chr>
1 1 request body: <?xml version=2.0> values received request body:
2 2 request body: <code> jnwg3425 request body:
3 3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:
4 4 Error in message received Error in message received
5 5 Error in message received Error in message received
6 6 Push forward message x3535 Push forward message
7 7 Push forward message <MarkCheckMSG> Push forward message
我有一个数据框:
ID value
1 request body: <?xml version="2.0"> values received
2 request body: <code> 'jnwg3425'
3 request body: <?xml version="2.0", <PlatCode>, <code> 'qwefn2'
4 Error in message received
5 Error in message received
6 Push forward message x3535
7 Push forward message <MarkCheckMSG>
我想根据开头 os 字符串中的相似性对第二列中的值进行分组。如何获得具有每个组模式的数据框,如下所示:
patterns
request body:
Error in message received
Push forward message
我该怎么做?哪些方法更适合我的目标?我应该使用正则表达式还是字符串距离方法?
首先,我们使用 stringr::str_extract
提取前 3 个词或后跟 :
的前 2 个词,或者您可以只使用 sub
来匹配完整的 value
并且只捕获给定的表达式,即 sub('^(expre).+$', '\1', value)
,正则表达式模式如下 \w+ \w+(:| \w+)
即匹配两个词 \w+ \w+
然后匹配 :
或另一个词。
library(stringr)
df %>%
mutate(beginnings= str_extract(value, "\w+ \w+(:| \w+)")) %>%
group_by(beginnings)
# A tibble: 7 x 3
# Groups: beginnings [3]
ID value beginnings
<int> <fct> <chr>
1 1 request body: <?xml version=2.0> values received request body:
2 2 request body: <code> jnwg3425 request body:
3 3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:
4 4 Error in message received Error in message
5 5 Error in message received Error in message
6 6 Push forward message x3535 Push forward message
7 7 Push forward message <MarkCheckMSG> Push forward message
使用不同的正则表达式
(\w+ )+[a-z]{2,}:?
=> 匹配尽可能多的单词后跟 space ((\w+ )+
) 后跟两个以上的字母 [a-z]{2,}
和 :
如果它存在。
df %>%
mutate(beginings= str_extract(value, "(\w+ )+[a-z]{2,}:?")) %>%
group_by(beginings)
# A tibble: 7 x 3
# Groups: beginings [3]
ID value beginings
<int> <fct> <chr>
1 1 request body: <?xml version=2.0> values received request body:
2 2 request body: <code> jnwg3425 request body:
3 3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:
4 4 Error in message received Error in message received
5 5 Error in message received Error in message received
6 6 Push forward message x3535 Push forward message
7 7 Push forward message <MarkCheckMSG> Push forward message