从模式子集的 R 中的文本中提取模式
extract a pattern from a text in R from a subset of patterns
我有如下代码列表
ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')
现在,我想从下面的示例中提取一个以代码开头的字符串
consolidated_csv_v2 <- c("pt paid rs-8488/- remaining amt","Credit Card Sales","ML 2926 VARSHA LAKHANI (AG)","IMRAN KHAN-PW-4798","Deepali Mishra Ah-5564 Tst", "MANJU S-11226 T","SNEHA S-16191","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")
数据分布在 477326 行
预期输出是代码后跟数字。
str_extract(consolidated_csv_v2, "AH.*$")
[1] NA NA NA NA NA NA
[7] NA "AH-5747 AG" "AH-5361 AG" NA AG"
此公式仅适用于静态代码 "AH"。我如何才能与 ccode 中的任何代码匹配。
我们可以试试
pat <- paste0("(?i)\b(", paste(ccode, collapse="|"),")-.*")
str_extract(v1, pat)
#[1] NA NA NA NA "Ah-5564 Tst" NA "AH-2445 AG" "AH-5747 AG" "AH-5361 AG" "Ah-5564 Tst"
数据
v1 <- c("Head Office", "(cancelled)", "(cancelled)", "(cancelled)",
"Deepali Mishra Ah-5564 Tst", "(cancelled)", "SHRUTI BHAGAT AH-2445 AG",
"SUMIT SETHI AH-5747 AG", "SUJATA VORA AH-5361 AG", "Deepali Mishra Ah-5564 Tst")
我假设您需要提取单词边界后以 "code" 开头并后跟连字符的子字符串。
然后,使用
"\b(?:S|PD|CH|ML|MD|VA|BVI|DB|KD|KE|PW|COL|AD|MET|VP|SI|VR|GAO|LK|RP|PAD|WAN|PWD|PMP|PBR|VN|PPC|NK|K|AH|I|JP|JU|UDZ|CHM|DDN|LN|CL|CLH|DKM|GK|WD|ED|DDK|DLN|DRN|DFD|GZB|DVV|GUR|GGN|ND|HHN|HAS|HYD|HKP|BWF|BBW|BKM|BSN|BL|BIN|ST|KN)-\w*"
其中 \b
代表一个单词边界,然后是一组替代代码 ((?:...)
),然后是一个连字符 (-
),后跟零个或多个 alphanumeric/underscore 符号 (\w*
).
这是一个演示代码:
> consolidated_csv_v2 <- c("Head Office","(cancelled)","(cancelled)","(cancelled)","Deepali Mishra Ah-5564 Tst", "(cancelled)","SHRUTI BHAGAT AH-2445 AG","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")
> ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')
> reg <- paste0("\b(?:", paste(ccode, collapse="|"),")-\w*")
> str_extract(consolidated_csv_v2, reg)
[1] NA NA NA NA NA NA "AH-2445"
[8] "AH-5747" "AH-5361" NA
>
更新
not all the words are followed by '-', some are follwed by a space and some don't have any character in between.
要求比较笼统,但是我们可以在一组交替之后使用惰性点匹配(.*?
)来满足它,以尽可能少地匹配换行符以外的任何 0+ 个字符到第一组数字 (\d+
) 后跟单词边界 (\b
)。使用
reg <- paste0("(?i)\b(?:", paste(ccode, collapse="|"),").*?\d+\b")
要使此模式不区分大小写,只需在第一个 [=13= 前面添加一个 (?i)
].
我有如下代码列表
ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')
现在,我想从下面的示例中提取一个以代码开头的字符串
consolidated_csv_v2 <- c("pt paid rs-8488/- remaining amt","Credit Card Sales","ML 2926 VARSHA LAKHANI (AG)","IMRAN KHAN-PW-4798","Deepali Mishra Ah-5564 Tst", "MANJU S-11226 T","SNEHA S-16191","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")
数据分布在 477326 行
预期输出是代码后跟数字。
str_extract(consolidated_csv_v2, "AH.*$")
[1] NA NA NA NA NA NA
[7] NA "AH-5747 AG" "AH-5361 AG" NA AG"
此公式仅适用于静态代码 "AH"。我如何才能与 ccode 中的任何代码匹配。
我们可以试试
pat <- paste0("(?i)\b(", paste(ccode, collapse="|"),")-.*")
str_extract(v1, pat)
#[1] NA NA NA NA "Ah-5564 Tst" NA "AH-2445 AG" "AH-5747 AG" "AH-5361 AG" "Ah-5564 Tst"
数据
v1 <- c("Head Office", "(cancelled)", "(cancelled)", "(cancelled)",
"Deepali Mishra Ah-5564 Tst", "(cancelled)", "SHRUTI BHAGAT AH-2445 AG",
"SUMIT SETHI AH-5747 AG", "SUJATA VORA AH-5361 AG", "Deepali Mishra Ah-5564 Tst")
我假设您需要提取单词边界后以 "code" 开头并后跟连字符的子字符串。
然后,使用
"\b(?:S|PD|CH|ML|MD|VA|BVI|DB|KD|KE|PW|COL|AD|MET|VP|SI|VR|GAO|LK|RP|PAD|WAN|PWD|PMP|PBR|VN|PPC|NK|K|AH|I|JP|JU|UDZ|CHM|DDN|LN|CL|CLH|DKM|GK|WD|ED|DDK|DLN|DRN|DFD|GZB|DVV|GUR|GGN|ND|HHN|HAS|HYD|HKP|BWF|BBW|BKM|BSN|BL|BIN|ST|KN)-\w*"
其中 \b
代表一个单词边界,然后是一组替代代码 ((?:...)
),然后是一个连字符 (-
),后跟零个或多个 alphanumeric/underscore 符号 (\w*
).
这是一个演示代码:
> consolidated_csv_v2 <- c("Head Office","(cancelled)","(cancelled)","(cancelled)","Deepali Mishra Ah-5564 Tst", "(cancelled)","SHRUTI BHAGAT AH-2445 AG","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")
> ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')
> reg <- paste0("\b(?:", paste(ccode, collapse="|"),")-\w*")
> str_extract(consolidated_csv_v2, reg)
[1] NA NA NA NA NA NA "AH-2445"
[8] "AH-5747" "AH-5361" NA
>
更新
not all the words are followed by '-', some are follwed by a space and some don't have any character in between.
要求比较笼统,但是我们可以在一组交替之后使用惰性点匹配(.*?
)来满足它,以尽可能少地匹配换行符以外的任何 0+ 个字符到第一组数字 (\d+
) 后跟单词边界 (\b
)。使用
reg <- paste0("(?i)\b(?:", paste(ccode, collapse="|"),").*?\d+\b")
要使此模式不区分大小写,只需在第一个 [=13= 前面添加一个 (?i)
].