使用 R 从大文本中提取城市名称
Extract city names from large text with R
你好,我有一个有趣的问题。假设我有一个长字符,其中包含其他城市名称。
test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"
我的目标是提取它的所有城市名称。我通过以下五个步骤实现了它。
#replace | with ,
test2<-str_replace_all(test, "[|]", ", ")
# Remove punctuation from data
test3<-gsub("[[:punct:]\n]","",test2)
# Split data at word boundaries
test4 <- strsplit(test3, " ")
# Load data from package maps
data(world.cities)
# Match on cities in world.cities
citiestest<-lapply(test4, function(x)x[which(x %in% world.cities$name)])
结果可能是正确的
citiestest
[[1]]
[1] "San" "Boston" "Boston" "Washington" "York"
[6] "York" "Kettering" "York" "York" "Charlotte"
[11] "Carolina" "Cleveland" "Nashville" "Seattle" "Seattle"
[16] "Washington" "Asan"
但是如您所见,我无法处理名称由两个词组成的城市(纽约、圣地亚哥等),因为它们是分开的。当然手动修复这个问题不是一个选项,因为我的真实数据集非常大。
这是使用 strsplit
和 sub
的基础 R 选项:
terms <- unlist(strsplit(test, "\s*\|\s*"))
cities <- sapply(terms, function(x) gsub("[^,]+,\s*([^,]+),.*", "\1", x))
cities[1:3]
Ucsd Medical Center, San Diego, California, USA
"San Diego"
Yale Cancer Center, New Haven, Connecticut, USA
"New Haven"
Massachusetts General Hospital., Boston, Massachusetts, USA
"Boston"
我会做什么:
test2 <- str_replace_all(test, "[|]", ", ") #Same as you did
test3 <- unlist(strsplit(test2, split=", ")) #Turns string into a vector
check <- test3 %in% world.cities$name #Check if element vectors match list of city names
test3[check == TRUE] #Select vector elements that match list of city names
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York" "New York"
[9] "New York" "Charlotte" "Cleveland" "Nashville" "Seattle" "Washington"
另一种没有循环的方法
pat="(,.\w+,)|(,.\w+.\w+,)"
gsub("(,\s)|,","",regmatches(m<-strsplit(test,"\|")[[1]],regexpr(pat,m)))
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York"
[8] "Charlotte" "Cleveland" "Nashville" "Seattle" "Gyeonggi-do" "Seoul" "Seoul"
[15] "Seoul" "Seoul"
本页中给出的其他结果均失败:例如,有一个名为Greonggi-do
的城镇,其他解决方案中未给出。还有一些代码将整个字符串作为 town
一种完全不同的方法,可能或多或少有用,具体取决于手头的数据:将每个地址传递给地理编码 API,然后从响应中提取城市。
library(tidyverse)
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\|')
places <- places %>%
mutate(geodata = map(string, ~{Sys.sleep(1); ggmap::geocode(.x, output = 'all')}))
places <- places %>%
mutate(address_components = map(geodata, list('results', 1, 'address_components')),
address_components = map(address_components,
~as_data_frame(transpose(.x)) %>%
unnest(long_name, short_name)),
city = map(address_components, unnest),
city = map_chr(city, ~{
l <- set_names(.x$long_name, .x$types);
coalesce(l['locality'], l['administrative_area_level_1'])
}))
对比原图,
places %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 St. Louis Washington University, Saint Louis, Missouri, USA
#> 6 New York Mount SInai Medical Center, New York, New York, USA
#> 7 New York Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 Goyang-si National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 서울특별시 Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 Amsterdam VU MEDISCH CENTRUM; Dept. of Medical Oncology
...好吧,它并不完美。最大的问题是美国城市的城市被分类为 localities
,而韩国的城市被分类为 administrative_area_level_1
(在美国是州)。与其他韩语行不同,12 实际上有一个地方,它不是列出的城市(在响应中作为行政区域)。此外,第 13 行的 "Seoul" 被莫名其妙地翻译成韩语。
好消息是"Saint Louis"已经缩短为"St. Louis",这是一个更规范的形式,最后一行位于阿姆斯特丹。
扩展这种方法可能需要为 API.
的使用支付 Google 一点费用
要扩展 @hrbrmstr 上面的评论,您可以使用 Stanford CoreNLP 库对每个字符串进行命名实体识别 (NER)。对这样一项工作的一个重要警告是,大多数 NER 注释器只能将标记注释为 "location" 或等价物,当城市与州和国家混合时,这不是很有用。不过,除了通常的 NER 注释器之外,CoreNLP 还包含一个额外的正则表达式 NER 注释器,可以将 NER 粒度增加到城市级别。
在 R 中,您可以使用 coreNLP
包来 运行 注释器。它确实需要 rJava
,这在某些情况下可能很难配置。您还需要下载实际的(相当大的)库,这可以通过 coreNLP::downloadCoreNLP
完成,并且,如果您愿意,可以将 ~/.Renviron
中的 CORENLP_HOME
环境变量设置为安装路径。
另请注意,此方法相当缓慢且占用大量资源,因为它在 Java 中做了很多工作。
library(tidyverse)
library(coreNLP)
# set which annotators to use
writeLines('annotators = tokenize, ssplit, pos, lemma, ner, regexner\n', 'corenlp.properties')
initCoreNLP(libLoc = Sys.getenv('CORENLP_HOME'), parameterFile = 'corenlp.properties')
unlink('corenlp.properties') # clean up
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\|') # separate strings
places_ner <- places %>%
mutate(annotations = map(string, annotateString),
tokens = map(annotations, 'token'),
tokens = map(tokens, group_by, token_id = data.table::rleid(NER)),
city = map(tokens, filter, NER == 'CITY'),
city = map(city, summarise, city = paste(token, collapse = ' ')),
city = map_chr(city, ~if(nrow(.x) == 0) NA_character_ else .x$city))
哪个returns
places_ner %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 NA Washington University, Saint Louis, Missouri, USA
#> 6 NA Mount SInai Medical Center, New York, New York, USA
#> 7 NA Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 NA National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 Seoul Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 NA VU MEDISCH CENTRUM; Dept. of Medical Oncology
失败:
- "New York" 被识别为州或省两次("New York City" 将被识别为州或省)。
- "Saint Louis"被识别为一个人。 "St. Louis" 在我的安装中被识别为一个位置,但 an online version of the same library 将原始位置识别为一个位置,因此这可能是版本问题。
- "Gyeonggi-do" 无法识别,但 "Seoul" 可以。我不确定
regexner
注释器的粒度如何,但考虑到(顾名思义)它通过正则表达式工作,有一个 size/familiarity 阈值,在该阈值下它不包含正则表达式。 You can add your own regex to it 如果值得的话。
cleanNLP
package 还支持 Stanford CoreNLP(和其他几个后端),界面更易于使用(设置仍然很困难),但据我所知不允许使用由于它初始化 CoreNLP 的方式,目前 regexner
。
可以使用tidytext提取bigram-->words-->intersect得到公共部分
library(tidyverse)
libraty(tidytext)
# city is a vector containing pre-defined city name
t2 <- test %>% as_tibble() %>%
unnest_tokens(bigram,value,token = 'ngrams', n =2) %>%
separate(bigram,c('word1','word2'),remove = F)
city_get <- c(intersect(t2$bigram,city),intersect(t2$word1,city))%>%
unique()
你好,我有一个有趣的问题。假设我有一个长字符,其中包含其他城市名称。
test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"
我的目标是提取它的所有城市名称。我通过以下五个步骤实现了它。
#replace | with ,
test2<-str_replace_all(test, "[|]", ", ")
# Remove punctuation from data
test3<-gsub("[[:punct:]\n]","",test2)
# Split data at word boundaries
test4 <- strsplit(test3, " ")
# Load data from package maps
data(world.cities)
# Match on cities in world.cities
citiestest<-lapply(test4, function(x)x[which(x %in% world.cities$name)])
结果可能是正确的
citiestest
[[1]]
[1] "San" "Boston" "Boston" "Washington" "York"
[6] "York" "Kettering" "York" "York" "Charlotte"
[11] "Carolina" "Cleveland" "Nashville" "Seattle" "Seattle"
[16] "Washington" "Asan"
但是如您所见,我无法处理名称由两个词组成的城市(纽约、圣地亚哥等),因为它们是分开的。当然手动修复这个问题不是一个选项,因为我的真实数据集非常大。
这是使用 strsplit
和 sub
的基础 R 选项:
terms <- unlist(strsplit(test, "\s*\|\s*"))
cities <- sapply(terms, function(x) gsub("[^,]+,\s*([^,]+),.*", "\1", x))
cities[1:3]
Ucsd Medical Center, San Diego, California, USA
"San Diego"
Yale Cancer Center, New Haven, Connecticut, USA
"New Haven"
Massachusetts General Hospital., Boston, Massachusetts, USA
"Boston"
我会做什么:
test2 <- str_replace_all(test, "[|]", ", ") #Same as you did
test3 <- unlist(strsplit(test2, split=", ")) #Turns string into a vector
check <- test3 %in% world.cities$name #Check if element vectors match list of city names
test3[check == TRUE] #Select vector elements that match list of city names
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York" "New York"
[9] "New York" "Charlotte" "Cleveland" "Nashville" "Seattle" "Washington"
另一种没有循环的方法
pat="(,.\w+,)|(,.\w+.\w+,)"
gsub("(,\s)|,","",regmatches(m<-strsplit(test,"\|")[[1]],regexpr(pat,m)))
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York"
[8] "Charlotte" "Cleveland" "Nashville" "Seattle" "Gyeonggi-do" "Seoul" "Seoul"
[15] "Seoul" "Seoul"
本页中给出的其他结果均失败:例如,有一个名为Greonggi-do
的城镇,其他解决方案中未给出。还有一些代码将整个字符串作为 town
一种完全不同的方法,可能或多或少有用,具体取决于手头的数据:将每个地址传递给地理编码 API,然后从响应中提取城市。
library(tidyverse)
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\|')
places <- places %>%
mutate(geodata = map(string, ~{Sys.sleep(1); ggmap::geocode(.x, output = 'all')}))
places <- places %>%
mutate(address_components = map(geodata, list('results', 1, 'address_components')),
address_components = map(address_components,
~as_data_frame(transpose(.x)) %>%
unnest(long_name, short_name)),
city = map(address_components, unnest),
city = map_chr(city, ~{
l <- set_names(.x$long_name, .x$types);
coalesce(l['locality'], l['administrative_area_level_1'])
}))
对比原图,
places %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 St. Louis Washington University, Saint Louis, Missouri, USA
#> 6 New York Mount SInai Medical Center, New York, New York, USA
#> 7 New York Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 Goyang-si National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 서울특별시 Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 Amsterdam VU MEDISCH CENTRUM; Dept. of Medical Oncology
...好吧,它并不完美。最大的问题是美国城市的城市被分类为 localities
,而韩国的城市被分类为 administrative_area_level_1
(在美国是州)。与其他韩语行不同,12 实际上有一个地方,它不是列出的城市(在响应中作为行政区域)。此外,第 13 行的 "Seoul" 被莫名其妙地翻译成韩语。
好消息是"Saint Louis"已经缩短为"St. Louis",这是一个更规范的形式,最后一行位于阿姆斯特丹。
扩展这种方法可能需要为 API.
的使用支付 Google 一点费用要扩展 @hrbrmstr 上面的评论,您可以使用 Stanford CoreNLP 库对每个字符串进行命名实体识别 (NER)。对这样一项工作的一个重要警告是,大多数 NER 注释器只能将标记注释为 "location" 或等价物,当城市与州和国家混合时,这不是很有用。不过,除了通常的 NER 注释器之外,CoreNLP 还包含一个额外的正则表达式 NER 注释器,可以将 NER 粒度增加到城市级别。
在 R 中,您可以使用 coreNLP
包来 运行 注释器。它确实需要 rJava
,这在某些情况下可能很难配置。您还需要下载实际的(相当大的)库,这可以通过 coreNLP::downloadCoreNLP
完成,并且,如果您愿意,可以将 ~/.Renviron
中的 CORENLP_HOME
环境变量设置为安装路径。
另请注意,此方法相当缓慢且占用大量资源,因为它在 Java 中做了很多工作。
library(tidyverse)
library(coreNLP)
# set which annotators to use
writeLines('annotators = tokenize, ssplit, pos, lemma, ner, regexner\n', 'corenlp.properties')
initCoreNLP(libLoc = Sys.getenv('CORENLP_HOME'), parameterFile = 'corenlp.properties')
unlink('corenlp.properties') # clean up
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\|') # separate strings
places_ner <- places %>%
mutate(annotations = map(string, annotateString),
tokens = map(annotations, 'token'),
tokens = map(tokens, group_by, token_id = data.table::rleid(NER)),
city = map(tokens, filter, NER == 'CITY'),
city = map(city, summarise, city = paste(token, collapse = ' ')),
city = map_chr(city, ~if(nrow(.x) == 0) NA_character_ else .x$city))
哪个returns
places_ner %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 NA Washington University, Saint Louis, Missouri, USA
#> 6 NA Mount SInai Medical Center, New York, New York, USA
#> 7 NA Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 NA National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 Seoul Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 NA VU MEDISCH CENTRUM; Dept. of Medical Oncology
失败:
- "New York" 被识别为州或省两次("New York City" 将被识别为州或省)。
- "Saint Louis"被识别为一个人。 "St. Louis" 在我的安装中被识别为一个位置,但 an online version of the same library 将原始位置识别为一个位置,因此这可能是版本问题。
- "Gyeonggi-do" 无法识别,但 "Seoul" 可以。我不确定
regexner
注释器的粒度如何,但考虑到(顾名思义)它通过正则表达式工作,有一个 size/familiarity 阈值,在该阈值下它不包含正则表达式。 You can add your own regex to it 如果值得的话。
cleanNLP
package 还支持 Stanford CoreNLP(和其他几个后端),界面更易于使用(设置仍然很困难),但据我所知不允许使用由于它初始化 CoreNLP 的方式,目前 regexner
。
可以使用tidytext提取bigram-->words-->intersect得到公共部分
library(tidyverse)
libraty(tidytext)
# city is a vector containing pre-defined city name
t2 <- test %>% as_tibble() %>%
unnest_tokens(bigram,value,token = 'ngrams', n =2) %>%
separate(bigram,c('word1','word2'),remove = F)
city_get <- c(intersect(t2$bigram,city),intersect(t2$word1,city))%>%
unique()