R：在 tidyverse 中使用德语停用词，但 anti_join 不起作用

Question

我尝试使用 tidyverse (http://tidyverse.org/) to analyse a list of German sentences. I stick to this guide (http://tidytextmining.com/)。

当我尝试使用德语 stop-words 列表时，它根本不起作用。

library(tidyverse)
library(readxl) # read excel
library(tibble) # tobble dataframe
library(dplyr) # piping
library(stringr) # character manipulation
library(tidytext)
library(tokenizers)

data <- read_xlsx("C:/R/npsfeedback.xlsx", sheet = "Tabelle1", col_names="feedback")
data
is.tibble(data)

# tokenise
data_clean <- data %>% 
  na.omit() %>%
  unnest_tokens(word,feedback)

这是给我带来麻烦的部分：

# remove stopwords
sw <- tibble(stopwords("de"))
sw

data_clean <- data_clean %>% 
  anti_join(.,sw)

我的热门词在一栏和字符类型的小标题中。但是如果我尝试使用 anti_join 我会得到这个输出：

Error: `by` required, because the data sources have no common variables

你知道我要做什么吗？

Answer 1

您需要指定要反连接的两个数据帧中的哪一列，所以您有这样的东西

antijoin(., sw, by = c("first_df_var" = "second_df_var"))

否则 R 不知道要连接哪些列。您的两个数据框都需要有一个共同点才能加入任何连接函数

Answer 2

没有附加参数 anti_join 期望连接到具有相同列名的数据帧。

诀窍是

sw <- tibble(word = stopwords("de"))

或者像甜蜜的音乐性解释的那样。

Answer 3

我遇到了同样的问题，但我没有创建新对象，而是使用管道运算符，并使用与停用词变量相同的名称："word"。这样 anti_join 加入具有相同列名的数据框

`data_clean <- data %>%
               mutate(linenumber = row_number()) %>%
               unnest_tokens(word, feedback) %>%
               anti_join(get_stopwords(language = "de") ) %>% 
               ungroup()`

R：在 tidyverse 中使用德语停用词，但 anti_join 不起作用

R: Use German stopwords in tidyverse but anti_join does not work

r

stop-words

tidyverse