按开头对列中的字符串值进行分组

Group string values in column by their beginning

我有一个数据框:

ID    value
1     request body: <?xml version="2.0"> values received
2     request body: <code> 'jnwg3425'
3     request body: <?xml version="2.0", <PlatCode>, <code> 'qwefn2'
4     Error in message received
5     Error in message received
6     Push forward message x3535
7     Push forward message <MarkCheckMSG>

我想根据开头 os 字符串中的相似性对第二列中的值进行分组。如何获得具有每个组模式的数据框,如下所示:

    patterns
request body:
Error in message received
Push forward message

我该怎么做?哪些方法更适合我的目标?我应该使用正则表达式还是字符串距离方法?

首先,我们使用 stringr::str_extract 提取前 3 个词或后跟 : 的前 2 个词,或者您可以只使用 sub 来匹配完整的 value并且只捕获给定的表达式,即 sub('^(expre).+$', '\1', value) ,正则表达式模式如下 \w+ \w+(:| \w+) 即匹配两个词 \w+ \w+ 然后匹配 : 或另一个词。

library(stringr)
df %>% 
    mutate(beginnings= str_extract(value, "\w+ \w+(:| \w+)")) %>%
    group_by(beginnings)
# A tibble: 7 x 3
# Groups:   beginnings [3]
     ID value                                                     beginnings    
  <int> <fct>                                                      <chr>               
1     1 request body: <?xml version=2.0> values received           request body:       
2     2 request body: <code> jnwg3425                              request body:       
3     3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:       
4     4 Error in message received                                  Error in message    
5     5 Error in message received                                  Error in message    
6     6 Push forward message x3535                                 Push forward message
7     7 Push forward message <MarkCheckMSG>                        Push forward message

使用不同的正则表达式

(\w+ )+[a-z]{2,}:? => 匹配尽可能多的单词后跟 space ((\w+ )+) 后跟两个以上的字母 [a-z]{2,}: 如果它存在。

df %>%
   mutate(beginings= str_extract(value, "(\w+ )+[a-z]{2,}:?")) %>%
   group_by(beginings)
# A tibble: 7 x 3
# Groups:   beginings [3]
     ID value                                                      beginings                
  <int> <fct>                                                      <chr>                    
1     1 request body: <?xml version=2.0> values received           request body:            
2     2 request body: <code> jnwg3425                              request body:            
3     3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:            
4     4 Error in message received                                  Error in message received
5     5 Error in message received                                  Error in message received
6     6 Push forward message x3535                                 Push forward message     
7     7 Push forward message <MarkCheckMSG>                        Push forward message