合并数据框中的文本元素并删除文本来自的行
Combine text elements in a data frame and remove rows the text came from
这个玩具数据框代表人的时间条目。我可用的格式以完全随机的模式为同一个人和同一天提供多个文本条目。同一个人和同一天最多可以有 15 个文本条目。多文本条目的行中没有任何人物条目。
structure(list(Date = structure(c(1514764800, 1514764800, NA,
1517443200, 1519862400, NA, NA, NA, 1519862400, NA, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Person = c("FMC", "ABC", NA, "FMC",
"ABC", NA, NA, NA, "RWM", NA, NA), Text = c("work on request",
"More text", "third line", "email to re: summary", "work on loan documents",
"sixth line of text", "text seven", "eighth in a series", "conferences with working group",
"line ten", "review and provide comments")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
我如何组合文本元素,以便每个人的条目只有一行 每天,删除不需要的行(一旦文本粘贴在一起)和到达以下对象?
EDITED QUESTION 省略了 for
循环,我尝试失败了。
必须有一种方法可以将给定人员在给定日期的所有文本合并为一行(例如,ABC 在 2018 年 1 月 1 日有两个条目)并删除其中的行合并后的文字来了。
library(dplyr)
merge_lines <- function(x) paste(x, collapse = ' ')
df %>%
zoo::na.locf(.) %>%
group_by(Person) %>%
summarise_at(vars(Text), (funs(merge_lines)))
结果:
# A tibble: 4 x 2
Person Text
<chr> <chr>
1 ABC More text third line
2 FMC work on request email to re: summary
3 HIL work on loan documents sixth line of text text seven eighth in a series
4 RWM conferences with working group line ten review and provide comments
我们可以使用na.locf
用最后一个非缺失值填充缺失值(NA
),然后group_by
连续出现Person
和summarise
Text
通过 paste
将它们组合在一起。
library(dplyr)
library(zoo)
library(data.table)
df %>%
na.locf(.) %>%
group_by(group = rleid(Person)) %>%
summarise(Text = paste0(Text, collapse = " "))
# group Text
# <int> <chr>
#1 1 work on request
#2 2 More text third line
#3 3 email to re: summary
#4 4 work on loan documents sixth line of text text seven eighth in a series
#5 5 conferences with working group line ten review and provide comments
对于更新后的问题,我们可以做
library(dplyr)
library(zoo)
df %>%
na.locf(.) %>%
group_by(Date, Person) %>%
summarise(Text = paste0(Text, collapse = " "))
不用复杂,直接用tidyverse
.
根据问题的变化进行调整:
library(tidyverse)
> df%>%
fill(Date:Person, Date:Person) %>% # Fills missing values in using the previous entry.
group_by(Date, Person) %>%
summarise(Text = paste(Text, collapse = ' '))
# A tibble: 5 x 3
Date Person Text
<dttm> <chr> <chr>
1 2018-01-01 00:00:00 ABC More text third line
2 2018-01-01 00:00:00 FMC work on request
3 2018-02-01 00:00:00 FMC email to re: summary
4 2018-03-01 00:00:00 ABC work on loan documents sixth line of text text seven eighth in a series
5 2018-03-01 00:00:00 RWM conferences with working group line ten review and provide comments
数据:
# A tibble: 11 x 3
Date Person Text
<dttm> <chr> <chr>
1 2018-01-01 00:00:00 FMC work on request
2 2018-01-01 00:00:00 ABC More text
3 NA NA third line
4 2018-02-01 00:00:00 FMC email to re: summary
5 2018-03-01 00:00:00 ABC work on loan documents
6 NA NA sixth line of text
7 NA NA text seven
8 NA NA eighth in a series
9 2018-03-01 00:00:00 RWM conferences with working group
10 NA NA line ten
11 NA NA review and provide comments
这个玩具数据框代表人的时间条目。我可用的格式以完全随机的模式为同一个人和同一天提供多个文本条目。同一个人和同一天最多可以有 15 个文本条目。多文本条目的行中没有任何人物条目。
structure(list(Date = structure(c(1514764800, 1514764800, NA,
1517443200, 1519862400, NA, NA, NA, 1519862400, NA, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Person = c("FMC", "ABC", NA, "FMC",
"ABC", NA, NA, NA, "RWM", NA, NA), Text = c("work on request",
"More text", "third line", "email to re: summary", "work on loan documents",
"sixth line of text", "text seven", "eighth in a series", "conferences with working group",
"line ten", "review and provide comments")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
我如何组合文本元素,以便每个人的条目只有一行 每天,删除不需要的行(一旦文本粘贴在一起)和到达以下对象?
EDITED QUESTION 省略了 for
循环,我尝试失败了。
必须有一种方法可以将给定人员在给定日期的所有文本合并为一行(例如,ABC 在 2018 年 1 月 1 日有两个条目)并删除其中的行合并后的文字来了。
library(dplyr)
merge_lines <- function(x) paste(x, collapse = ' ')
df %>%
zoo::na.locf(.) %>%
group_by(Person) %>%
summarise_at(vars(Text), (funs(merge_lines)))
结果:
# A tibble: 4 x 2
Person Text
<chr> <chr>
1 ABC More text third line
2 FMC work on request email to re: summary
3 HIL work on loan documents sixth line of text text seven eighth in a series
4 RWM conferences with working group line ten review and provide comments
我们可以使用na.locf
用最后一个非缺失值填充缺失值(NA
),然后group_by
连续出现Person
和summarise
Text
通过 paste
将它们组合在一起。
library(dplyr)
library(zoo)
library(data.table)
df %>%
na.locf(.) %>%
group_by(group = rleid(Person)) %>%
summarise(Text = paste0(Text, collapse = " "))
# group Text
# <int> <chr>
#1 1 work on request
#2 2 More text third line
#3 3 email to re: summary
#4 4 work on loan documents sixth line of text text seven eighth in a series
#5 5 conferences with working group line ten review and provide comments
对于更新后的问题,我们可以做
library(dplyr)
library(zoo)
df %>%
na.locf(.) %>%
group_by(Date, Person) %>%
summarise(Text = paste0(Text, collapse = " "))
不用复杂,直接用tidyverse
.
根据问题的变化进行调整:
library(tidyverse)
> df%>%
fill(Date:Person, Date:Person) %>% # Fills missing values in using the previous entry.
group_by(Date, Person) %>%
summarise(Text = paste(Text, collapse = ' '))
# A tibble: 5 x 3
Date Person Text
<dttm> <chr> <chr>
1 2018-01-01 00:00:00 ABC More text third line
2 2018-01-01 00:00:00 FMC work on request
3 2018-02-01 00:00:00 FMC email to re: summary
4 2018-03-01 00:00:00 ABC work on loan documents sixth line of text text seven eighth in a series
5 2018-03-01 00:00:00 RWM conferences with working group line ten review and provide comments
数据:
# A tibble: 11 x 3
Date Person Text
<dttm> <chr> <chr>
1 2018-01-01 00:00:00 FMC work on request
2 2018-01-01 00:00:00 ABC More text
3 NA NA third line
4 2018-02-01 00:00:00 FMC email to re: summary
5 2018-03-01 00:00:00 ABC work on loan documents
6 NA NA sixth line of text
7 NA NA text seven
8 NA NA eighth in a series
9 2018-03-01 00:00:00 RWM conferences with working group
10 NA NA line ten
11 NA NA review and provide comments