R：如何连接分成多行的字符串？

Question

我有一个如下所示的数据框：

df1 <- data.frame(Question=c("This is the start", "of a question", "This is a second", "question"), 
  Answer = c("Yes", "", "No", ""))

           Question Answer
1 This is the start    Yes
2     of a question       
3  This is a second     No
4          question

这是虚拟数据，但真实数据是通过 tabulizer 从 PDF 中提取的。任何时候在源文档的 Question 中有一个换行符，这个问题就会被分成多行。如何根据 Answer 为空的条件连接回去？

想要的结果很简单：

                     Question     Answer
1 This is the start of a question    Yes
2       This is a second question     No

逻辑很简单，如果 Answer[x] 为空，则连接 Question[x] 和 Question[x-1] 并删除行 x。

Answer 1

这无疑可以改进，但如果您乐于使用 tidyverse，也许这样的方法可行？

library(dplyr)
library(tidyr)
library(stringr)

df1 %>% 
  mutate(id = if_else(Answer != "", row_number(), NA_integer_)) %>%
  fill(id) %>% group_by(id) %>%
  summarise(Question = str_c(Question, collapse = " "), Answer = first(Answer))

#> # A tibble: 2 x 3
#>      id                        Question Answer
#>   <int>                           <chr> <fctr>
#> 1     1 This is the start of a question    Yes
#> 2     3       This is a second question     No

Answer 2

如果我按照你的逻辑，应该做以下事情：

# test data
dff <- data.frame(Question=c("This is the start",
                             "of a question",
                             "This is a second",
                             "question",
                             "This is a third",
                             "question",
                             "and more space",
                             "yet even more space",
                             "This is actually another question"),
                  Answer = c("Yes",
                             "",
                             "No",
                             "",
                             "Yes",
                             "",
                             "",
                             "",
                             "No"),
                  stringsAsFactors = F)


# solution
do.call(rbind, lapply(split(dff, cumsum(nchar(dff$Answer)>0)), function(x) {
  data.frame(Question=paste0(x$Question, collapse=" "), Answer=head(x$Answer,1))
}))


#                                                        Question Answer
# 1                             This is the start of a question    Yes
# 2                                   This is a second question     No
# 3 This is a third question and more space yet even more space    Yes
# 4                           This is actually another question     No

想法是在表达式 nchar(dff$Answer)>0 上使用 cumsum。这应该创建一个分组向量以与 split 函数一起使用。在对分组向量进行拆分后，您应该能够使用拆分操作的结果创建更小的数据帧，方法是连接 Question 列中的值并采用 Answer 列的第一个值。随后，您可以 rbind 结果数据帧。

希望对您有所帮助。

Answer 3

..另一种（非常相似）使用 dplyr

的方法

require(dplyr)

df1 %>% mutate(id = cumsum(!df1$Answer %in% c('Yes', 'No')),
               Q2 = ifelse(Answer == "", paste(lag(Question), Question), ""),
               A2 = ifelse(Answer == "", as.character(lag(Answer)), "")) %>%
        filter(Q2 != "") %>%
        select(id, Question = Q2, Answer = A2)

R：如何连接分成多行的字符串？

R: how do I concatenate a string broken into multiple lines?

r

data-cleaning