R:如何连接分成多行的字符串?
R: how do I concatenate a string broken into multiple lines?
我有一个如下所示的数据框:
df1 <- data.frame(Question=c("This is the start", "of a question", "This is a second", "question"),
Answer = c("Yes", "", "No", ""))
Question Answer
1 This is the start Yes
2 of a question
3 This is a second No
4 question
这是虚拟数据,但真实数据是通过 tabulizer
从 PDF 中提取的。任何时候在源文档的 Question
中有一个换行符,这个问题就会被分成多行。如何根据 Answer
为空的条件连接回去?
想要的结果很简单:
Question Answer
1 This is the start of a question Yes
2 This is a second question No
逻辑很简单,如果 Answer[x]
为空,则连接 Question[x]
和 Question[x-1]
并删除行 x
。
这无疑可以改进,但如果您乐于使用 tidyverse
,也许这样的方法可行?
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(id = if_else(Answer != "", row_number(), NA_integer_)) %>%
fill(id) %>% group_by(id) %>%
summarise(Question = str_c(Question, collapse = " "), Answer = first(Answer))
#> # A tibble: 2 x 3
#> id Question Answer
#> <int> <chr> <fctr>
#> 1 1 This is the start of a question Yes
#> 2 3 This is a second question No
如果我按照你的逻辑,应该做以下事情:
# test data
dff <- data.frame(Question=c("This is the start",
"of a question",
"This is a second",
"question",
"This is a third",
"question",
"and more space",
"yet even more space",
"This is actually another question"),
Answer = c("Yes",
"",
"No",
"",
"Yes",
"",
"",
"",
"No"),
stringsAsFactors = F)
# solution
do.call(rbind, lapply(split(dff, cumsum(nchar(dff$Answer)>0)), function(x) {
data.frame(Question=paste0(x$Question, collapse=" "), Answer=head(x$Answer,1))
}))
# Question Answer
# 1 This is the start of a question Yes
# 2 This is a second question No
# 3 This is a third question and more space yet even more space Yes
# 4 This is actually another question No
想法是在表达式 nchar(dff$Answer)>0
上使用 cumsum
。这应该创建一个分组向量以与 split
函数一起使用。在对分组向量进行拆分后,您应该能够使用拆分操作的结果创建更小的数据帧,方法是连接 Question
列中的值并采用 Answer
列的第一个值。随后,您可以 rbind
结果数据帧。
希望对您有所帮助。
..另一种(非常相似)使用 dplyr
的方法
require(dplyr)
df1 %>% mutate(id = cumsum(!df1$Answer %in% c('Yes', 'No')),
Q2 = ifelse(Answer == "", paste(lag(Question), Question), ""),
A2 = ifelse(Answer == "", as.character(lag(Answer)), "")) %>%
filter(Q2 != "") %>%
select(id, Question = Q2, Answer = A2)
我有一个如下所示的数据框:
df1 <- data.frame(Question=c("This is the start", "of a question", "This is a second", "question"),
Answer = c("Yes", "", "No", ""))
Question Answer
1 This is the start Yes
2 of a question
3 This is a second No
4 question
这是虚拟数据,但真实数据是通过 tabulizer
从 PDF 中提取的。任何时候在源文档的 Question
中有一个换行符,这个问题就会被分成多行。如何根据 Answer
为空的条件连接回去?
想要的结果很简单:
Question Answer
1 This is the start of a question Yes
2 This is a second question No
逻辑很简单,如果 Answer[x]
为空,则连接 Question[x]
和 Question[x-1]
并删除行 x
。
这无疑可以改进,但如果您乐于使用 tidyverse
,也许这样的方法可行?
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(id = if_else(Answer != "", row_number(), NA_integer_)) %>%
fill(id) %>% group_by(id) %>%
summarise(Question = str_c(Question, collapse = " "), Answer = first(Answer))
#> # A tibble: 2 x 3
#> id Question Answer
#> <int> <chr> <fctr>
#> 1 1 This is the start of a question Yes
#> 2 3 This is a second question No
如果我按照你的逻辑,应该做以下事情:
# test data
dff <- data.frame(Question=c("This is the start",
"of a question",
"This is a second",
"question",
"This is a third",
"question",
"and more space",
"yet even more space",
"This is actually another question"),
Answer = c("Yes",
"",
"No",
"",
"Yes",
"",
"",
"",
"No"),
stringsAsFactors = F)
# solution
do.call(rbind, lapply(split(dff, cumsum(nchar(dff$Answer)>0)), function(x) {
data.frame(Question=paste0(x$Question, collapse=" "), Answer=head(x$Answer,1))
}))
# Question Answer
# 1 This is the start of a question Yes
# 2 This is a second question No
# 3 This is a third question and more space yet even more space Yes
# 4 This is actually another question No
想法是在表达式 nchar(dff$Answer)>0
上使用 cumsum
。这应该创建一个分组向量以与 split
函数一起使用。在对分组向量进行拆分后,您应该能够使用拆分操作的结果创建更小的数据帧,方法是连接 Question
列中的值并采用 Answer
列的第一个值。随后,您可以 rbind
结果数据帧。
希望对您有所帮助。
..另一种(非常相似)使用 dplyr
的方法require(dplyr)
df1 %>% mutate(id = cumsum(!df1$Answer %in% c('Yes', 'No')),
Q2 = ifelse(Answer == "", paste(lag(Question), Question), ""),
A2 = ifelse(Answer == "", as.character(lag(Answer)), "")) %>%
filter(Q2 != "") %>%
select(id, Question = Q2, Answer = A2)