在 R 中查找所有唯一字符串

Question

我对 R 比较陌生。我有一个数据框 df 看起来像这样（只有一个字符变量...我的实际 df 跨越 100k+ 行，但为了简单起见，我们只看 5 行):

V1
oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy

我希望能够输出每个唯一的字符串，使其看起来像这样：

V1
oximetry
hydrogen peroxide adverse effects
epoprostenol adverse effects
angioedema chemically induced
abo blood group system
imipramine poisoning
adverse effects
isoenzymes
myocardial infarction drug therapy
thrombosis drug therapy

我是否使用 tm 包？我尝试使用 dtm 但我的代码效率低下，因为它会将 dtm 转换为矩阵，这需要 100k+ 行的大量内存。

请指教。谢谢！

Answer 1

试试这个：

library(stringr)
library(tidyverse)

df <- data.frame(variable = c(
'oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects',
'angioedema chemically induced, angioedema chemically induced, oximetry',
'abo blood group system, imipramine poisoning, adverse effects',
'isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy',
'thrombosis drug therapy'), stringsAsFactors=FALSE)

mutate(df, variable = str_split(variable, ', ')) %>%
  unnest() %>% distinct()

Answer 2

仅使用基数 R，您可以使用 strsplit() 在每个 "comma+space" 或“\n”处拆分您的大字符串。然后使用 unique() 仅 return 个唯一字符串：

text_vec <- c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy")

strsplit(text_vec, ", |\n")[[1]])
# [1] "oximetry"                           "hydrogen peroxide adverse effects" 
# [3] "epoprostenol adverse effects"       "angioedema chemically induced"     
# [5] "angioedema chemically induced"      "oximetry"                          
# [7] "abo blood group system"             "imipramine poisoning"              
# [9] "adverse effects"                    "isoenzymes"                        
# [11] "myocardial infarction drug therapy" "thrombosis drug therapy"           
# [13] "thrombosis drug therapy"   

unique(strsplit(text_vec, ", |\n")[[1]])
# [1] "oximetry"                           "hydrogen peroxide adverse effects" 
# [3] "epoprostenol adverse effects"       "angioedema chemically induced"     
# [5] "abo blood group system"             "imipramine poisoning"              
# [7] "adverse effects"                    "isoenzymes"                        
# [9] "myocardial infarction drug therapy" "thrombosis drug therapy"

在 R 中查找所有唯一字符串

Find all unique strings in R

string

r

unique

text-mining

dataframe