R：将特定行转换为列

Question

我从 json 文件中导入了相当混乱的数据，它看起来像这样：

raw_df <- data.frame(text = c(paste0('text', 1:3), '---------- OUTCOME LINE ----------', paste0('text', 4:6), '---------- OUTCOME LINE ----------'),
                              demand = c('cat1', rep('', 2), 'info', 'cat2', rep('', 2), 'info2')
                     )



raw_df
                                text demand
1                              text1   cat1
2                              text2       
3                              text3       
4 ---------- OUTCOME LINE ----------   info
5                              text4   cat2
6                              text5       
7                              text6       
8 ---------- OUTCOME LINE ----------  info2

（顺便说一句，---------- OUTCOME LINE ---------- 是我在 text 列中的实际字符串）

我想整理一下，使其具有以下格式：

final_df
                  text demand outcome
1 text1. text2. text3.   cat1   info1
2 text4. text5. text6.   cat2   info2

最快最有效的方法是什么？感谢您的提示。

Answer 1

在这里，我们根据 'text' 列中 - 的存在，使用 'grepl' 创建逻辑索引，子集 'raw_df' 以删除这些行，创建一个通过获取 'indx'、aggregate 到 paste 的累积总和来对列进行分组，在将 '' 替换为 [=16] 之后，'text' 列按 'demand' 分组=] 并使用 na.locf 填充非 NA 先前值。然后，通过用 'indx'

子集从 'demand' 创建 'outcome'

indx <- grepl("-", raw_df$text)
transform(aggregate(text~demand, transform(raw_df[!indx,], 
  demand = zoo::na.locf(replace(demand, demand=="", NA))), toString),
    outcome = raw_df$demand[indx])
#  demand                text outcome
#1   cat1 text1, text2, text3    info
#2   cat2 text4, text5, text6   info2

或者这可以通过 data.table

来完成

library(data.table)
setDT(raw_df)[demand == "", demand := NA][!indx, .(text= paste(text, collapse='. ')),
          .(demand = zoo::na.locf(demand))][, outcome := raw_df$demand[indx]][]

Answer 2

一个dplyr&tidyr解决方案：

raw_df %>% 
    mutate(outcome = demand,
           demand = replace(demand, demand == '', NA),
           outcome = replace(outcome, outcome == '', NA),
           outcome = gsub("^cat\d+", NA, outcome)) %>% 
    fill(demand) %>% 
    fill(outcome, .direction = "up") %>% 
    filter(!grepl("-----", text)) %>%
    group_by(demand, outcome) %>% 
    summarize(text = gsub(",", "\.", toString(text))) %>% 
    select(text, everything())

根据需要修复要显示的文本，替换 NA 的空白，并准备结果栏。
fill 默认向下方向的 demand 列，向上方向的结果列。
filter 根据其连字符输出 ----- OUTCOME LINE ------。
为 text 列生成 group_concat，然后将默认的 , 替换为 .。
select 列到所需的顺序。

# A tibble: 2 x 3
# Groups:   demand [2]
                 text demand outcome
                <chr> <fctr>   <chr>
1 text1. text2. text3   cat1    info
2 text4. text5. text6   cat2   info2

R：将特定行转换为列

R: transforming specific rows into columns

json

r

dplyr

tidyr

tidyverse