在特定列上组合或迭代 dplyr 行

Question

我有一个包含两方聊天对话的数据集。我想将数据集合并到人 1 和人 2 之间的逐行对话中。

有时人们会输入多个句子，这些句子将在数据框中显示为多个记录。

这是我试图找出的伪代码。

line_text待合并
时间戳更新为最新时间
如果 line_by 显示同一个人输入了多行并且通过他们的聊天发送
因为这个数据集中有多个 id，表示每个人1和人2的对话记录，我想要通过每个唯一 ID 循环运行。

这是数据框现在的样子：

id    timestamp line_by line_text
1234    02:54.3 Person1 Text Line 1
1234    03:23.8 Person2 Text Line 2
1234    03:47.0 Person2 Text Line 3
1234    04:46.8 Person1 Text Line 4
1234    05:46.2 Person1 Text Line 5
9876    06:44.5 Person2 Text Line 6
9876    07:27.6 Person1 Text Line 7
9876    08:17.5 Person2 Text Line 8
9876    10:20.3 Person2 Text Line 9

我想看数据改成如下：

id    timestamp line_by line_text
1234    02:54.3 Person1 Text Line 1
1234    03:47.0 Person2 Text Line 2Text Line 3
1234    05:46.2 Person1 Text Line 4Text Line 5
9876    06:44.5 Person2 Text Line 6
9876    07:27.6 Person1 Text Line 7
9876    10:20.3 Person2 Text Line 8Text Line 9

披露：我在 python 中问过同样的问题，但 pandas。这就是我被困在 R 和 .

的地方

Answer 1

试试这个

library(dplyr)
library(data.table)
df %>%
  group_by(id, grp = rleid(line_by)) %>%
  summarise(timestamp = last(timestamp),
            line_by = unique(line_by), line_text = paste(line_text, collapse=", ")) %>%
  select(-grp)

诀窍是除了 id

之外还按 rleid(...) 分组

输出

# A tibble: 6 x 4
# Groups:   id [2]
     # id timestamp line_by            line_text
  # <int>     <chr>   <chr>                <chr>
# 1  1234   02:54.3 Person1            TextLine1
# 2  1234   03:47.0 Person2 TextLine2, TextLine3
# 3  1234   05:46.2 Person1 TextLine4, TextLine5
# 4  9876   06:44.5 Person2            TextLine6
# 5  9876   07:27.6 Person1            TextLine7
# 6  9876   10:20.3 Person2 TextLine8, TextLine9

Answer 2

仅使用 dplyr 的变体：

library(dplyr)
df %>% group_by(id,line_by,grp = cumsum(line_by !=lag(line_by,1,""))) %>%
  summarise(timestamp = last(timestamp),line_text = paste(line_text,collapse="")) %>%
  select(-grp)

在特定列上组合或迭代 dplyr 行

combine or iterate dplyr rows on specific columns

r

dplyr

data-science