将 gsub 函数和文本文件配对以进行语料库清理

Question

我有大量推文样本，在分析它们之前我正试图清理它们。我在数据框中有推文，其中每个单元格都有一条推文的内容（例如 "i love san francisco" 和 "proud member of the air force"）。但是，当我在网络可视化中分析文本时，每个 bio 中都有一些单词应该组合起来。我还想组合常见的双词短语（例如 "new york"、"san francisco" 和 "air force"）。我已经编译了需要组合的术语列表，并使用 gsub 将其中的一些与这行代码组合：

twitterdata_cleaning$bio = gsub('air force','airforce',twitterdata_cleaning$bio)

上面这行代码把"proud member of the air force"变成了"proud member of the airforce"。我已经能够用几十个两个词的短语成功地做到这一点。

但是，我的 bios 中有数百个双词短语，我想更好地跟踪它们，所以我将所有这些术语移动到 excel 文件中的两列中.我想找到一种在 txt 或 excel 文件上使用上述公式的方法，它可以识别数据框中的术语，这些术语看起来像 txt 文件第一列中的术语，并将单词更改为看起来像那些txt文件的第二列。

例如，我有如下所示的 xlsx 和 txt 文件：

    **column1**               **column2*
   san francisco              sanfrancisco
     new york                   newyork
     las vegas                  lasvegas
     san diego                  sandiego
   new hampshire              newhampshire
      good bye                   goodbye
      air force                  airforce
     video game                 videogame
    high school                  school
    middle school                school
    elementary school            school

我想在公式中使用 gsub 命令，该命令在数据框中搜索 column 1 中的所有术语，并使用类似这样的方法将它们归入 column 2 中的术语:

twitterdata_df$tweet = gsub('textfile$column1','textfile$columnb',twitterdata_df$tweet)

在单元格中得到这样的东西：

i love sanfrancisco
can not wait to go to newyork
what happens in lasvegas stays there
at the beach in sandiego
can beat the autumn leave in newhampshire
so done with all the drama goodbye
proud member of the airforce
love this videogame so much
playing at the school tonight 
so sick of school
school was the best and i miss it

任何帮助将不胜感激。

Answer 1

广义解

您可以将包 stringr 中的命名向量输入到 str_replace_all() 以完成此操作。在我的示例中，df 有一个包含 old 值的列将被 new 值替换。我假设这就是您使用 Excel 文件来跟踪它们的意思。

library(stringr)

df <- data.frame(old = c("five", "six", "seven"),
                 new = as.character(5:7),
                 stringsAsFactors = FALSE)

text <- c("I am a vector with numbers six and other text five",
          "another vector seven six text five")

str_replace_all(text, setNames(df$new, df$old))

结果：

[1] "I am a vector with numbers 6 and other text 5" "another vector 7 6 text 5"

具体例子

数据

读入替换的文本文件。

textfile <- read.csv(textConnection("column1,column2
san francisco,sanfrancisco
new york,newyork
las vegas,lasvegas
san diego,sandiego
new hampshire,newhampshire
good bye,goodbye
air force,airforce
video game,videogame
high school,school
middle school,school
elementary school,school"), stringsAsFactors = FALSE)

在列 tweet 中加载包含推文的数据框。

twitterdata_df <- data.frame(id = 1:11)
twitterdata_df$tweet <- c("i love san francisco",
                          "can not wait to go to new york",
                          "what happens in las vegas stays there",
                          "at the beach in san diego",
                          "can beat the autumn leave in new hampshire",
                          "so done with all the drama goodbye",
                          "proud member of the air force",
                          "love this video game so much",
                          "playing at the high school tonight",
                          "so sick of middle school",
                          "elementary school was the best and i miss it")

替换

twitterdata_df$tweet2 <- str_replace_all(twitterdata_df$tweet, setNames(textfile$column2, textfile$column1))

结果

如您所见，替换是在 tweet2 中完成的。

   id                                        tweet                                    tweet2
1   1                         i love san francisco                       i love sanfrancisco
2   2               can not wait to go to new york             can not wait to go to newyork
3   3        what happens in las vegas stays there      what happens in lasvegas stays there
4   4                    at the beach in san diego                  at the beach in sandiego
5   5   can beat the autumn leave in new hampshire can beat the autumn leave in newhampshire
6   6           so done with all the drama goodbye        so done with all the drama goodbye
7   7                proud member of the air force              proud member of the airforce
8   8                 love this video game so much               love this videogame so much
9   9           playing at the high school tonight             playing at the school tonight
10 10                     so sick of middle school                         so sick of school
11 11 elementary school was the best and i miss it         school was the best and i miss it

Answer 2

感谢您的帮助，但我找到了解决方法。我决定使用一个循环，它进入我的 table 两列，并在第一列中搜索每组术语，并将它们替换为第二列中的单词。

 for(i in 1:nrow(compoundterms)) {
            twitterdata_dfg$tweet = gsub(compoundterms[i,1],compoundterms[i,2],twitterdata_df$tweet)
    }

将 gsub 函数和文本文件配对以进行语料库清理

Pairing a gsub function and text file for corpus cleaning

text

r

text-files

gsub

data-cleaning

广义解

具体例子