根据字符串切换到新行并在 R 中的括号内提取信息

Question

我有一个看起来像这样的字符串：

    [1] "1st April 2004 (Queens Cross) — 16th August 2007 (Midstocket/Rosemount) — 19th May 2011 (Dyce/Bucksburn/Danestone) — 23rd June 2011 (Airyhall/Broomhill/Garthdee) — 30th July 2015 (Hilton/Woodside/Stockethill; Kincorth/Nigg/Cove) — 1st October 2015 (George St/Harbour; Midstocket/Rosemount) — 3rd October 2019 (Bridge of Don) — 21st November 2019 (Torry/Ferryhill) — 5th November 2020 (Kincorth/Nigg/Cove)"                                                          
[2] "9th June 2005 (Huntly E.) — 1st May 2008 (Troup) — 23rd April 2009 (Aboyne, Upper Deeside & Donside) — 27th November 2014 (Troup) — 5th November 2015 (Huntly, Strathbogie & Howe of Alford) — 3rd November 2016 (Banff & District; Inverurie & District) — 12th October 2017 (Inverurie & District) — 15th October 2020 (Ellon & District)"

共有 300 多行。我想做的是两件事：

首先，当有 — 字符串时创建新行。所以结果看起来像这样：

goal_1
 [1] "1st April 2004 (Queens Cross)"                                   
 [2] "16th August 2007 (Midstocket/Rosemount)"                         
 [3] "19th May 2011 (Dyce/Bucksburn/Danestone)"                        
 [4] "23rd June 2011 (Airyhall/Broomhill/Garthdee)"                    
 [5] "30th July 2015 (Hilton/Woodside/Stockethill; Kincorth/Nigg/Cove)"
 [6] "1st October 2015 (George St/Harbour; Midstocket/Rosemount)"      
 [7] "3rd October 2019 (Bridge of Don)"                                
 [8] "21st November 2019 (Torry/Ferryhill)"                            
 [9] "5th November 2020 (Kincorth/Nigg/Cove)"                          
[10] "9th June 2005 (Huntly E.)"                                       
[11] "1st May 2008 (Troup)"                                            
[12] "23rd April 2009 (Aboyne, Upper Deeside & Donside)"               
[13] "27th November 2014 (Troup)"                                      
[14] "5th November 2015 (Huntly, Strathbogie & Howe of Alford)"        
[15] "3rd November 2016 (Banff & District; Inverurie & District)"      
[16] "12th October 2017 (Inverurie & District)"                        
[17] "15th October 2020 (Ellon & District)"

然后，我想将括号之间的位放入一个新对象中，但是（这就是我遇到的真正麻烦的地方）如果有一个;将它分开我希望它是两个不同的行。因此，仅举例说明前 6 个，这就是我要查找的内容：

    goal_2
  date             name                       
  <chr>            <chr>                      
1 1st April 2004   Queens Cross               
2 16th August 2007 Midstocket/Rosemount       
3 19th May 2011    Dyce/Bucksburn/Danestone   
4 23rd June 2011   Airyhall/Broomhill/Garthdee
5 30th July 2015   Hilton/Woodside/Stockethill
6 30th July 2015   Kincorth/Nigg/Cove

我很抱歉一次发布两个任务，但我一直在努力解决这个问题，所以所有的帮助都会很棒。非常感谢您的帮助！

Answer 1

这样做 tidyverse 时尚

df  <- data.frame(str <- c("1st April 2004 (Queens Cross)   16th August 2007 (Midstocket/ Rosemount)   19th May 2011 (Dyce/ Bucksburn/ Danestone)   23rd June 2011 (Airyhall/ Broomhill/ Garthdee)   30th July 2015 (Hilton/ Woodside/ Stockethill; Kincorth/ Nigg/ Cove)   1st October 2015 (George St/ Harbour; Midstocket/ Rosemount)   3rd October 2019 (Bridge of Don)   21st November 2019 (Torry/ Ferryhill)   5th November 2020 (Kincorth/ Nigg/ Cove)", "9th June 2005 (Huntly E.)   1st May 2008 (Troup)   23rd April 2009 (Aboyne, Upper Deeside & Donside)   27th November 2014 (Troup)   5th November 2015 (Huntly, Strathbogie & Howe of Alford)   3rd November 2016 (Banff & District; Inverurie & District)   12th October 2017 (Inverurie & District)   15th October 2020 (Ellon & District)")

suppressMessages(library(tidyverse))

df %>% separate_rows(str, sep = '—') %>%
  separate(str, into = c('date', 'name'), sep = '\(') %>%
  separate_rows(name, sep = ";") %>%
  mutate(name = str_remove_all(name, "\)")) %>%
  mutate(across(everything(), ~str_trim(.)))


# A tibble: 20 x 2
   date               name                                
   <chr>              <chr>                               
 1 1st April 2004     Queens Cross                        
 2 16th August 2007   Midstocket/<U+200B>Rosemount                
 3 19th May 2011      Dyce/<U+200B>Bucksburn/<U+200B>Danestone            
 4 23rd June 2011     Airyhall/<U+200B>Broomhill/<U+200B>Garthdee         
 5 30th July 2015     Hilton/<U+200B>Woodside/<U+200B>Stockethill         
 6 30th July 2015     Kincorth/<U+200B>Nigg/<U+200B>Cove                  
 7 1st October 2015   George St/<U+200B>Harbour                   
 8 1st October 2015   Midstocket/<U+200B>Rosemount                
 9 3rd October 2019   Bridge of Don                       
10 21st November 2019 Torry/<U+200B>Ferryhill                     
11 5th November 2020  Kincorth/<U+200B>Nigg/<U+200B>Cove                  
12 9th June 2005      Huntly E.                           
13 1st May 2008       Troup                               
14 23rd April 2009    Aboyne, Upper Deeside & Donside     
15 27th November 2014 Troup                               
16 5th November 2015  Huntly, Strathbogie & Howe of Alford
17 3rd November 2016  Banff & District                    
18 3rd November 2016  Inverurie & District                
19 12th October 2017  Inverurie & District                
20 15th October 2020  Ellon & District

^{由 reprex package (v2.0.0)}

于 2021-05-04 创建

Answer 2

基数 R：

newtxt <- unlist(lapply(strsplit(txt, "\u2014"), trimws))
newtxt
#  [1] "1st April 2004 (Queens Cross)"                                       
#  [2] "16th August 2007 (Midstocket/ Rosemount)"                            
#  [3] "19th May 2011 (Dyce/ Bucksburn/ Danestone)"                          
#  [4] "23rd June 2011 (Airyhall/ Broomhill/ Garthdee)"                      
#  [5] "30th July 2015 (Hilton/ Woodside/ Stockethill; Kincorth/ Nigg/ Cove)"
#  [6] "1st October 2015 (George St/ Harbour; Midstocket/ Rosemount)"        
#  [7] "3rd October 2019 (Bridge of Don)"                                    
#  [8] "21st November 2019 (Torry/ Ferryhill)"                               
#  [9] "5th November 2020 (Kincorth/ Nigg/ Cove)"                            
# [10] "9th June 2005 (Huntly E.)"                                           
# [11] "1st May 2008 (Troup)"                                                
# [12] "23rd April 2009 (Aboyne, Upper Deeside & Donside)"                   
# [13] "27th November 2014 (Troup)"                                          
# [14] "5th November 2015 (Huntly, Strathbogie & Howe of Alford)"            
# [15] "3rd November 2016 (Banff & District; Inverurie & District)"          
# [16] "12th October 2017 (Inverurie & District)"                            
# [17] "15th October 2020 (Ellon & District)"

然后

out <- strcapture("^(.*)\(([^)]*)\)", newtxt, list(date = "", name = ""))
out
#                   date                                                name
# 1      1st April 2004                                         Queens Cross
# 2    16th August 2007                                Midstocket/ Rosemount
# 3       19th May 2011                           Dyce/ Bucksburn/ Danestone
# 4      23rd June 2011                        Airyhall/ Broomhill/ Garthdee
# 5      30th July 2015  Hilton/ Woodside/ Stockethill; Kincorth/ Nigg/ Cove
# 6    1st October 2015            George St/ Harbour; Midstocket/ Rosemount
# 7    3rd October 2019                                        Bridge of Don
# 8  21st November 2019                                     Torry/ Ferryhill
# 9   5th November 2020                                 Kincorth/ Nigg/ Cove
# 10      9th June 2005                                            Huntly E.
# 11       1st May 2008                                                Troup
# 12    23rd April 2009                      Aboyne, Upper Deeside & Donside
# 13 27th November 2014                                                Troup
# 14  5th November 2015                 Huntly, Strathbogie & Howe of Alford
# 15  3rd November 2016               Banff & District; Inverurie & District
# 16  12th October 2017                                 Inverurie & District
# 17  15th October 2020                                     Ellon & District

拆分 ; 个句子，

semis <- strsplit(out$name, ";")
data.frame(date = rep(out$date, lengths(semis)), name = unlist(semis))
#                   date                                 name
# 1      1st April 2004                          Queens Cross
# 2    16th August 2007                 Midstocket/ Rosemount
# 3       19th May 2011            Dyce/ Bucksburn/ Danestone
# 4      23rd June 2011         Airyhall/ Broomhill/ Garthdee
# 5      30th July 2015         Hilton/ Woodside/ Stockethill
# 6      30th July 2015                  Kincorth/ Nigg/ Cove
# 7    1st October 2015                    George St/ Harbour
# 8    1st October 2015                 Midstocket/ Rosemount
# 9    3rd October 2019                         Bridge of Don
# 10 21st November 2019                      Torry/ Ferryhill
# 11  5th November 2020                  Kincorth/ Nigg/ Cove
# 12      9th June 2005                             Huntly E.
# 13       1st May 2008                                 Troup
# 14    23rd April 2009       Aboyne, Upper Deeside & Donside
# 15 27th November 2014                                 Troup
# 16  5th November 2015  Huntly, Strathbogie & Howe of Alford
# 17  3rd November 2016                      Banff & District
# 18  3rd November 2016                  Inverurie & District
# 19  12th October 2017                  Inverurie & District
# 20  15th October 2020                      Ellon & District

数据

txt <- c("1st April 2004 (Queens Cross)   16th August 2007 (Midstocket/ Rosemount)   19th May 2011 (Dyce/ Bucksburn/ Danestone)   23rd June 2011 (Airyhall/ Broomhill/ Garthdee)   30th July 2015 (Hilton/ Woodside/ Stockethill; Kincorth/ Nigg/ Cove)   1st October 2015 (George St/ Harbour; Midstocket/ Rosemount)   3rd October 2019 (Bridge of Don)   21st November 2019 (Torry/ Ferryhill)   5th November 2020 (Kincorth/ Nigg/ Cove)", "9th June 2005 (Huntly E.)   1st May 2008 (Troup)   23rd April 2009 (Aboyne, Upper Deeside & Donside)   27th November 2014 (Troup)   5th November 2015 (Huntly, Strathbogie & Howe of Alford)   3rd November 2016 (Banff & District; Inverurie & District)   12th October 2017 (Inverurie & District)   15th October 2020 (Ellon & District)"

Answer 3

基础 R

part1 = trimws(gsub('\s\((.*)\)', '',  unlist(str_split(original_string, '—'))))
  
part2 = trimws(gsub('[0-9](.*)[0-9]|\v', '',  unlist(str_split(original_string, '—'))))
     


data.frame(part1, part2)


               part1                                             part2
1      1st April 2004                                    (Queens Cross)
2    16th August 2007                            (Midstocket/Rosemount)
3       19th May 2011                        (Dyce/Bucksburn/Danestone)
4      23rd June 2011                     (Airyhall/Broomhill/Garthdee)
5      30th July 2015 (Hilton/Woodside/Stockethill; Kincorth/Nigg/Cove)
6    1st October 2015         (George St/Harbour; Midstocket/Rosemount)
7    3rd October 2019                                   (Bridge of Don)
8  21st November 2019                                 (Torry/Ferryhill)
9   5th November 2020                              (Kincorth/Nigg/Cove)
10      9th June 2005                                       (Huntly E.)
11       1st May 2008                                           (Troup)
12    23rd April 2009                 (Aboyne, Upper Deeside & Donside)
13 27th November 2014                                           (Troup)
14  5th November 2015            (Huntly, Strathbogie & Howe of Alford)
15  3rd November 2016          (Banff & District; Inverurie & District)
16  12th October 2017                            (Inverurie & District)
17  15th October 2020                                (Ellon & District)

数据

original_string <- 
c( "1st April 2004 (Queens Cross) — 16th August 2007 (Midstocket/Rosemount) — 19th May 2011 (Dyce/Bucksburn/Danestone) — 23rd June 2011 (Airyhall/Broomhill/Garthdee) — 30th July 2015 (Hilton/Woodside/Stockethill; Kincorth/Nigg/Cove) — 1st October 2015 (George St/Harbour; Midstocket/Rosemount) — 3rd October 2019 (Bridge of Don) — 21st November 2019 (Torry/Ferryhill) — 5th November 2020 (Kincorth/Nigg/Cove)"   ,                                                       
"9th June 2005 (Huntly E.) — 1st May 2008 (Troup) — 23rd April 2009 (Aboyne, Upper Deeside & Donside) — 27th November 2014 (Troup) — 5th November 2015 (Huntly, Strathbogie & Howe of Alford) — 3rd November 2016 (Banff & District; Inverurie & District) — 12th October 2017 (Inverurie & District) — 15th October 2020 (Ellon & District)" )

根据字符串切换到新行并在 R 中的括号内提取信息

Switch to new row based on string and extract info within parantheses in R

string

r

stringr

dplyr