R:快速生成部分序列
R: Quickly generate partial sequences
我希望根据在文本片段上训练 RNN 来生成文本序列(我之前在 articles like this 中做过)。
第一步是获取文本片段并将它们分解成子序列以训练模型:
c("E","X","A","M","P","L","E")
会变成
c("E")
c("E","X")
c("E","X","A")
...
我目前的方法是对每个词使用映射:
require(tidyverse)
data <- data_frame(id = c(1,2),word = list(c("E","X","A","M","P","L","E"), c("R","S","T","U","D","I","O")))
result <- data %>%
pmap(function(id,word){
subs <- map(1:length(word),function(i) word[1:i])
data_frame(id = id, sub = subs)
}) %>%
bind_rows()
但这在大型数据集上非常慢。有没有一种快速生成所有这些部分序列的方法?
原来问题出在地图函数中调用 data_frame
。显然,创建数据框很慢。如果您牺牲使用数据框而坚持使用列表,则可以更快地完成:
result <- data %>%
pmap(function(id,word){
map(1:length(word),function(i) list(id = id, sub = word[1:i]))
}) %>%
purrr::flatten()
我希望最后可以使用 bind_rows()
将其全部变成 data_frame
,但由于某些原因,该函数不适用于列表列。
您正在寻找 Reduce
和 accumulate=TRUE
Reduce(c,a,accumulate = T)
[[1]]
[1] "E"
[[2]]
[1] "E" "X"
[[3]]
[1] "E" "X" "A"
[[4]]
[1] "E" "X" "A" "M"
[[5]]
[1] "E" "X" "A" "M" "P"
[[6]]
[1] "E" "X" "A" "M" "P" "L"
[[7]]
[1] "E" "X" "A" "M" "P" "L" "E"
因此,要将其包含在您的数据中,您可以执行以下操作:
data%>%
group_by(id)%>%
mutate(word=list(Reduce(c,unlist(word),accumulate = T)))%>%
unnest()
要在 purrr
中执行相同操作,您可以使用函数 accumulate
purrr::accumulate(a,c)
虽然这是purrr
中的函数,但基本上是在调用Reduce
函数。即
purrr::accumulate
function (.x, .f, ..., .init)
{
.f <- as_mapper(.f, ...)
f <- function(x, y) {
.f(x, y, ...)
}
Reduce(f, .x, init = .init, accumulate = TRUE)#THIS IS USING THE BASE FUNCTION Reduce
}
<environment: namespace:purrr>
在这里使用 lapply 和 Reduce 可能会更快
x <- lapply(data$word, function(w){
Reduce(c, w, accumulate = TRUE)}
然后你可以将它们绑定回 data_frame 这样的
id2 <- rep(id, unlist(lapply(x, length)))
data2 <- data_frame(id2, subs=unlist(x, recursive=FALSE))
我希望根据在文本片段上训练 RNN 来生成文本序列(我之前在 articles like this 中做过)。
第一步是获取文本片段并将它们分解成子序列以训练模型:
c("E","X","A","M","P","L","E")
会变成
c("E")
c("E","X")
c("E","X","A")
...
我目前的方法是对每个词使用映射:
require(tidyverse)
data <- data_frame(id = c(1,2),word = list(c("E","X","A","M","P","L","E"), c("R","S","T","U","D","I","O")))
result <- data %>%
pmap(function(id,word){
subs <- map(1:length(word),function(i) word[1:i])
data_frame(id = id, sub = subs)
}) %>%
bind_rows()
但这在大型数据集上非常慢。有没有一种快速生成所有这些部分序列的方法?
原来问题出在地图函数中调用 data_frame
。显然,创建数据框很慢。如果您牺牲使用数据框而坚持使用列表,则可以更快地完成:
result <- data %>%
pmap(function(id,word){
map(1:length(word),function(i) list(id = id, sub = word[1:i]))
}) %>%
purrr::flatten()
我希望最后可以使用 bind_rows()
将其全部变成 data_frame
,但由于某些原因,该函数不适用于列表列。
您正在寻找 Reduce
和 accumulate=TRUE
Reduce(c,a,accumulate = T)
[[1]]
[1] "E"
[[2]]
[1] "E" "X"
[[3]]
[1] "E" "X" "A"
[[4]]
[1] "E" "X" "A" "M"
[[5]]
[1] "E" "X" "A" "M" "P"
[[6]]
[1] "E" "X" "A" "M" "P" "L"
[[7]]
[1] "E" "X" "A" "M" "P" "L" "E"
因此,要将其包含在您的数据中,您可以执行以下操作:
data%>%
group_by(id)%>%
mutate(word=list(Reduce(c,unlist(word),accumulate = T)))%>%
unnest()
要在 purrr
中执行相同操作,您可以使用函数 accumulate
purrr::accumulate(a,c)
虽然这是purrr
中的函数,但基本上是在调用Reduce
函数。即
purrr::accumulate
function (.x, .f, ..., .init)
{
.f <- as_mapper(.f, ...)
f <- function(x, y) {
.f(x, y, ...)
}
Reduce(f, .x, init = .init, accumulate = TRUE)#THIS IS USING THE BASE FUNCTION Reduce
}
<environment: namespace:purrr>
在这里使用 lapply 和 Reduce 可能会更快
x <- lapply(data$word, function(w){
Reduce(c, w, accumulate = TRUE)}
然后你可以将它们绑定回 data_frame 这样的
id2 <- rep(id, unlist(lapply(x, length)))
data2 <- data_frame(id2, subs=unlist(x, recursive=FALSE))