如何将一个观察分解成多个sub-observations?
How to break down one observation into several sub-observations?
我的dataframe包含几篇文章合集,df$title代表标题,df$text代表每篇文章的内容。我需要将每篇文章分成几个段落。以下是我如何分解一篇文章:
pattern = "\bM(?:rs?|s)\.\s"
aa <- str_replace_all( text1, pattern, "XXXX")
bb <- unlist(strsplit(aa, "XXXX"))
cc <- bb[-1]
dd <- gsub("[\]", " ", cc)
paragraph vector <- gsub("[^[:alnum:]]", " ", dd)
如何用文章标题标记每个段落并将分解工作应用到整个专栏 (df$text)?我希望每个段落成为一个观察(而不是一篇文章作为观察)。
这是一个简单的例子,每个段落由两个空行分隔:
library(tidyverse)
data <- tibble(
title = c("The Book of words", "A poem"),
text = c("It was a dark and stormy night. \n\n And this is another paragraph.", "This\n\nis\n\nthe\n\nEnd")
)
cat(data$text[[1]])
#> It was a dark and stormy night.
#>
#> And this is another paragraph.
cat(data$text[[2]])
#> This
#>
#> is
#>
#> the
#>
#> End
data %>%
transmute(
title,
paragraph = text %>% map(~ {
.x %>%
str_split("\n\n") %>%
simplify() %>%
map_chr(str_trim)
})
) %>%
unnest(paragraph)
#> # A tibble: 6 × 2
#> title paragraph
#> <chr> <chr>
#> 1 The Book of words It was a dark and stormy night.
#> 2 The Book of words And this is another paragraph.
#> 3 A poem This
#> 4 A poem is
#> 5 A poem the
#> 6 A poem End
由 reprex package (v2.0.1)
于 2021-09-26 创建
我的dataframe包含几篇文章合集,df$title代表标题,df$text代表每篇文章的内容。我需要将每篇文章分成几个段落。以下是我如何分解一篇文章:
pattern = "\bM(?:rs?|s)\.\s"
aa <- str_replace_all( text1, pattern, "XXXX")
bb <- unlist(strsplit(aa, "XXXX"))
cc <- bb[-1]
dd <- gsub("[\]", " ", cc)
paragraph vector <- gsub("[^[:alnum:]]", " ", dd)
如何用文章标题标记每个段落并将分解工作应用到整个专栏 (df$text)?我希望每个段落成为一个观察(而不是一篇文章作为观察)。
这是一个简单的例子,每个段落由两个空行分隔:
library(tidyverse)
data <- tibble(
title = c("The Book of words", "A poem"),
text = c("It was a dark and stormy night. \n\n And this is another paragraph.", "This\n\nis\n\nthe\n\nEnd")
)
cat(data$text[[1]])
#> It was a dark and stormy night.
#>
#> And this is another paragraph.
cat(data$text[[2]])
#> This
#>
#> is
#>
#> the
#>
#> End
data %>%
transmute(
title,
paragraph = text %>% map(~ {
.x %>%
str_split("\n\n") %>%
simplify() %>%
map_chr(str_trim)
})
) %>%
unnest(paragraph)
#> # A tibble: 6 × 2
#> title paragraph
#> <chr> <chr>
#> 1 The Book of words It was a dark and stormy night.
#> 2 The Book of words And this is another paragraph.
#> 3 A poem This
#> 4 A poem is
#> 5 A poem the
#> 6 A poem End
由 reprex package (v2.0.1)
于 2021-09-26 创建