r regexp - 用大文件(> 200 万行)中的任何内容替换字符串任何部分的标题和后缀
r regexp - replace title and suffix in any part of string with nothing in large file (> 2 million rows)
我正在处理一个大文件(超过 200 万行),我想从其中的每个字符串中删除所有标题和后缀(个人 and/or 专业)。正如您将从下面的小测试用例中看到的,标题和后缀出现在每个字符串的不同位置。
我使用了以下 3 个问题的部分答案:
regular expression for exact match of a word
How to search for multiple strings and replace them with nothing within a list of strings
test <- c("pan-chr ii", "true ii.", "mr. and mrs panjii", "pans iv prof",
"md trs iv.", "iipan", "a c iii miss clark", "a c iv jones mrs",
"a c jones iv", "a c jr huffman phd.", "a c jr markkula",
"a c sr. goldtrap", "mr & mrs prof dr. a c cjdr iv, esq.",
"false mr petty phd", "abe jr esquibel phd",
"md reginald r dr esquire garcia", "laurence curry, md",
"lawrence mcdonald md phd", "mdonald mr and mrs sebelmd dr jr md phd",
"(van) der walls")
# test
# [1] "pan-chr ii"
# [2] "true ii."
# [3] "mr. and mrs panjii"
# [4] "pans iv prof"
# [5] "md trs iv."
# [6] "iipan"
# [7] "a c iii miss clark"
# [8] "a c iv jones mrs"
# [9] "a c jones iv"
# [10] "a c jr huffman phd."
# [11] "a c jr markkula"
# [12] "a c sr. goldtrap"
# [13] "mr & mrs prof dr. a c cjdr iv, esq."
# [14] "false mr petty phd"
# [15] "abe jr esquibel phd"
# [16] "md reginald r dr esquire garcia"
# [17] "laurence curry, md"
# [18] "lawrence mcdonald md phd"
# [19] "mdonald mr and mrs sebelmd dr jr md phd"
# [20] "(van) der walls"
testresult <- gsub(",? *(mister|sir|madam|mr\.|mr|mrs\.|mrs|ms\.|
mr\. and mrs\.|mr and mrs|mr\. and mrs|mr and mrs\.|
mr\. & mrs\.|mr & mrs|mr\. & mrs|mr & mrs\.|& mrs\.|and mrs\.|
and mrs\.|& mrs|and mrs|ms|miss\.|miss|prof\.|prof|professor|
doctor|md|md\.|m\.d\.|dr\.|dr|phd|phd\.|esq\.|esq|esquire|
i{2,3}|i{2,3}\.|iv|iv\.|jr|jr\.|sr|sr\.|\(|\))(?![\w\d])", "",
test, perl = TRUE)
# testresult
# [1] "pan-chr" "true."
# [3] " panj" "pans"
# [5] " trs." "iipan"
# [7] "a c clark" "a c jones"
# [9] "a c jones" "a c huffman."
# [11] "a c markkula" "a c. goldtrap"
# [13] " a c cj" "false petty"
# [15] "abe esquibel" " reginald r garcia"
# [17] "laurence curry" "lawrence mcdonald"
# [19] "mdonald sebel" "(van der walls"
1) testresult中表达的正则表达式应该如何修改才能得到下面的结果?
2) 有没有比使用 gsub
更快的选项,因为我有一个超过 200 万行的文件?
谢谢。
# testresult that I want to have
# [1] "pan-chr" "true"
# [3] "panjii" "pans"
# [5] "trs" "iipan"
# [7] "a c clark" "a c jones"
# [9] "a c jones" "a c huffman"
# [11] "a c markkula" "a c goldtrap"
# [13] "a c cjdr" "false petty"
# [15] "abe esquibel" "reginald r garcia"
# [17] "laurence curry" "lawrence mcdonald"
# [19] "mdonald sebelmd" "van der walls"
我不认为为所有情况创建一个正则表达式是最好的方法。我花了一些时间尝试,你会遇到空格问题,因为你在字符串的开头、结尾和顺序都有标题。最终,如果您正确删除所有标题,您可能会绑定一些名称(至少它发生在我身上)并且有一些多个空格,这将需要进一步的 gsubs 来删除。您也更容易出错,因为我确定您将无法检查 2m 行的特定情况和组合。
我建议使用一种不同的方法,如果可能的话,它肯定比完美的正则表达式慢,但结果更可靠。您可以使用一些定界符拆分字符串,删除您不感兴趣的部分,然后将其余部分绑定回去。像这样:
test.split <- strsplit(test, "\s|\.|\,|\(|\)") #Split by empty spaces, dots, commas and parenthesis
titles <- c("mr", "mrs", "iv", "md", "phd", "iii", "ii", "and", "&", "miss", "jr", "sr", "iv", "prof", "professor", "esquire", "dr", "esq", "sc", "d", "") #Everything you want to remove that isn't a separator above should be here
test.clear <- sapply(test.split, function(st) paste(st[!(st %in% titles)], collapse=" "), USE.NAMES=FALSE)
test.clear
[1] "pan-chr" "true" " panjii" "pans "
[5] "trs " "iipan" "a c clark" "a c joness"
[9] "a c jones " "a c huffman" "a c markkula" "a c goldtrap"
[13] " a c cjdr " "false petty" "abe esquibel" "reginald r garcia"
[17] "laurence curry" "lawrence mcdonald" "mdonald sebelmd " "van der walls"
要优化,可以使用包stringi
拆分:
library(stringi)
test.split <- stri_split(test, regex="\s|\.|\,|\(|\)")
性能:
> system.time(replicate(10000, strsplit(test, "\s|\.|\,|\(|\)"))) #base
user system ellapsed
1.99 0.00 2.01
> system.time(replicate(10000, str_split(test, "\s|\.|\,|\(|\)"))) #package stringr
user system ellapsed
21.97 0.03 25.39
> system.time(replicate(10000, stri_split(test, regex="\s|\.|\,|\(|\)"))) #package stringi
user system ellapsed
0.78 0.00 0.78
不过我不会使用 paste()
的任何软件包,因为 base 更快:
> system.time(replicate(50000, paste(letters[1:5])))
user system ellapsed
0.28 0.00 0.28
> system.time(replicate(50000, str_join(letters[1:5])))
user system ellapsed
1.72 0.00 1.75
> system.time(replicate(50000, stri_join(letters[1:5])))
user system ellapsed
0.38 0.00 0.39
我正在处理一个大文件(超过 200 万行),我想从其中的每个字符串中删除所有标题和后缀(个人 and/or 专业)。正如您将从下面的小测试用例中看到的,标题和后缀出现在每个字符串的不同位置。
我使用了以下 3 个问题的部分答案:
regular expression for exact match of a word
How to search for multiple strings and replace them with nothing within a list of strings
test <- c("pan-chr ii", "true ii.", "mr. and mrs panjii", "pans iv prof",
"md trs iv.", "iipan", "a c iii miss clark", "a c iv jones mrs",
"a c jones iv", "a c jr huffman phd.", "a c jr markkula",
"a c sr. goldtrap", "mr & mrs prof dr. a c cjdr iv, esq.",
"false mr petty phd", "abe jr esquibel phd",
"md reginald r dr esquire garcia", "laurence curry, md",
"lawrence mcdonald md phd", "mdonald mr and mrs sebelmd dr jr md phd",
"(van) der walls")
# test
# [1] "pan-chr ii"
# [2] "true ii."
# [3] "mr. and mrs panjii"
# [4] "pans iv prof"
# [5] "md trs iv."
# [6] "iipan"
# [7] "a c iii miss clark"
# [8] "a c iv jones mrs"
# [9] "a c jones iv"
# [10] "a c jr huffman phd."
# [11] "a c jr markkula"
# [12] "a c sr. goldtrap"
# [13] "mr & mrs prof dr. a c cjdr iv, esq."
# [14] "false mr petty phd"
# [15] "abe jr esquibel phd"
# [16] "md reginald r dr esquire garcia"
# [17] "laurence curry, md"
# [18] "lawrence mcdonald md phd"
# [19] "mdonald mr and mrs sebelmd dr jr md phd"
# [20] "(van) der walls"
testresult <- gsub(",? *(mister|sir|madam|mr\.|mr|mrs\.|mrs|ms\.|
mr\. and mrs\.|mr and mrs|mr\. and mrs|mr and mrs\.|
mr\. & mrs\.|mr & mrs|mr\. & mrs|mr & mrs\.|& mrs\.|and mrs\.|
and mrs\.|& mrs|and mrs|ms|miss\.|miss|prof\.|prof|professor|
doctor|md|md\.|m\.d\.|dr\.|dr|phd|phd\.|esq\.|esq|esquire|
i{2,3}|i{2,3}\.|iv|iv\.|jr|jr\.|sr|sr\.|\(|\))(?![\w\d])", "",
test, perl = TRUE)
# testresult
# [1] "pan-chr" "true."
# [3] " panj" "pans"
# [5] " trs." "iipan"
# [7] "a c clark" "a c jones"
# [9] "a c jones" "a c huffman."
# [11] "a c markkula" "a c. goldtrap"
# [13] " a c cj" "false petty"
# [15] "abe esquibel" " reginald r garcia"
# [17] "laurence curry" "lawrence mcdonald"
# [19] "mdonald sebel" "(van der walls"
1) testresult中表达的正则表达式应该如何修改才能得到下面的结果?
2) 有没有比使用 gsub
更快的选项,因为我有一个超过 200 万行的文件?
谢谢。
# testresult that I want to have
# [1] "pan-chr" "true"
# [3] "panjii" "pans"
# [5] "trs" "iipan"
# [7] "a c clark" "a c jones"
# [9] "a c jones" "a c huffman"
# [11] "a c markkula" "a c goldtrap"
# [13] "a c cjdr" "false petty"
# [15] "abe esquibel" "reginald r garcia"
# [17] "laurence curry" "lawrence mcdonald"
# [19] "mdonald sebelmd" "van der walls"
我不认为为所有情况创建一个正则表达式是最好的方法。我花了一些时间尝试,你会遇到空格问题,因为你在字符串的开头、结尾和顺序都有标题。最终,如果您正确删除所有标题,您可能会绑定一些名称(至少它发生在我身上)并且有一些多个空格,这将需要进一步的 gsubs 来删除。您也更容易出错,因为我确定您将无法检查 2m 行的特定情况和组合。
我建议使用一种不同的方法,如果可能的话,它肯定比完美的正则表达式慢,但结果更可靠。您可以使用一些定界符拆分字符串,删除您不感兴趣的部分,然后将其余部分绑定回去。像这样:
test.split <- strsplit(test, "\s|\.|\,|\(|\)") #Split by empty spaces, dots, commas and parenthesis
titles <- c("mr", "mrs", "iv", "md", "phd", "iii", "ii", "and", "&", "miss", "jr", "sr", "iv", "prof", "professor", "esquire", "dr", "esq", "sc", "d", "") #Everything you want to remove that isn't a separator above should be here
test.clear <- sapply(test.split, function(st) paste(st[!(st %in% titles)], collapse=" "), USE.NAMES=FALSE)
test.clear
[1] "pan-chr" "true" " panjii" "pans "
[5] "trs " "iipan" "a c clark" "a c joness"
[9] "a c jones " "a c huffman" "a c markkula" "a c goldtrap"
[13] " a c cjdr " "false petty" "abe esquibel" "reginald r garcia"
[17] "laurence curry" "lawrence mcdonald" "mdonald sebelmd " "van der walls"
要优化,可以使用包stringi
拆分:
library(stringi)
test.split <- stri_split(test, regex="\s|\.|\,|\(|\)")
性能:
> system.time(replicate(10000, strsplit(test, "\s|\.|\,|\(|\)"))) #base
user system ellapsed
1.99 0.00 2.01
> system.time(replicate(10000, str_split(test, "\s|\.|\,|\(|\)"))) #package stringr
user system ellapsed
21.97 0.03 25.39
> system.time(replicate(10000, stri_split(test, regex="\s|\.|\,|\(|\)"))) #package stringi
user system ellapsed
0.78 0.00 0.78
不过我不会使用 paste()
的任何软件包,因为 base 更快:
> system.time(replicate(50000, paste(letters[1:5])))
user system ellapsed
0.28 0.00 0.28
> system.time(replicate(50000, str_join(letters[1:5])))
user system ellapsed
1.72 0.00 1.75
> system.time(replicate(50000, stri_join(letters[1:5])))
user system ellapsed
0.38 0.00 0.39