R 将列表对象放入 table

R taking list objects into table

我有一个列表,里面有很多对象。我想根据这个列表的属性创建一个 table。

head(casscade.list)
$`444424960908754944`
screen_name     tweet_id        tweet_created_at        retweet_screen_name
NerSeref        6.028628e+17    2015-05-25 11:44:24     Lasthowen
DURULMA_ZAMANI  6.028631e+17    2015-05-25 11:45:32     Lasthowen
ssari75         6.028647e+17    2015-05-25 11:52:10     Lasthowen   
saintserif2009  6.028672e+17    2015-05-25 12:01:48     Lasthowen
Hejinilim       6.028721e+17    2015-05-25 12:21:13     Lasthowen

$`407136916317171712`
screen_name     tweet_id        tweet_created_at        retweet_screen_name
isa_sakar       6.072663e+17    2015-06-06 15:22:18     cavurizmir
canfeda1923     6.072666e+17    2015-06-06 15:23:34     cavurizmir
Apolloniuss_58  6.072669e+17    2015-06-06 15:24:47     cavurizmir

我需要创建一个必须有这些的 table;

table
retweet_screen_name screen_name         length  life(seconds)
Lasthowen           Hejinilim           5       2209
cavurizmir          Apolloniuss_58      3       149 

我用了这个功能,解决了一半的问题

get.summary <- function(i){
        curr.frame = cascade.list[[i]]
        return(c(unique(curr.frame$retweet_screen_name),curr.frame$screen_name[nrow(curr.frame)],
                 unique(curr.frame$retweet_created_at), curr.frame$tweet_created_at[nrow(curr.frame)], 
                 nrow(curr.frame)))
}    

和此代码:

cdf=data.frame(t(sapply(1:length(cascade.list),get.summary)))

它创建一个数据框,所有变量都在同一行中。

 V1                                                                                     V2
c("EastanbulTimes", "onuryasercan", "2010-12-20 15:18:22", "2015-05-19 18:28:25", "1")  c("Lasthowen", "Apolloniuss_58", "2013-12-01 08:19:39", "2015-06-06 15:24:47", "3")

我需要修复数据帧结构,它应该有 6 列和行,等于列表长度。我还需要添加到时间变量。

提前感谢所有建议。

因为cascade.list是一个列数相等的dataframes的列表,你可以将它们绑定到一个数据集中,然后执行你需要的聚合。 data.table:

的实现
# make a list of the dataframes (see below for the used dataframes)
dflist <- list(df1,df2)
# bind the dataframes together into one datatable (which is an enhanced dataframe)
library(data.table)
DT <- rbindlist(dflist)

使用生成的数据表,您现在可以按如下方式执行所需的汇总:

DT[, .(screen_name = screen_name[.N],
       length = .N,
       life_in_seconds = difftime(tweet_created_at[.N], tweet_created_at[1], units="secs")),
   by = .(retweet_screen_name)]

这导致:

   retweet_screen_name    screen_name length life_in_seconds
1:           Lasthowen      Hejinilim      5       2209 secs
2:          cavurizmir Apolloniuss_58      3        149 secs

解释:

  • .N 是一个特殊的 data.table 运算符,它为您提供组中的总行数(或 data.table 当不使用分组时)。
  • screen_name[.N] 会给你最后一个 screen_name 因为它是用总行数索引的,因此给你每个组的最后一次观察。同样,screen_name[1] 会给你每个组中的第一个观察结果。
  • difftime 或多或少不言自明。使用 units 可以指定时差的表示方式。请参阅 ?difftime 了解可能性。
  • 使用 by =,您可以指定应使用哪些列来确定数据分组。

类似的操作可以用dplyr来完成:

library(dplyr)

newdf <- bind_rows(dflist)

newdf %>% group_by(retweet_screen_name) %>% 
  summarise(screen_name = last(screen_name),
            length = n(),
            life_in_seconds = difftime(last(tweet_created_at), first(tweet_created_at), units="secs"))

使用数据:

df1 <- structure(list(screen_name = structure(c(3L, 1L, 5L, 4L, 2L), .Label = c("DURULMA_ZAMANI", "Hejinilim", "NerSeref", "saintserif2009", "ssari75"), class = "factor"), tweet_id = c(6.028628e+17, 6.028631e+17, 6.028647e+17, 6.028672e+17, 6.028721e+17), tweet_created_at = structure(c(1432547064, 1432547132, 1432547530, 1432548108, 1432549273), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Lasthowen", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(screen_name = structure(c(3L, 2L, 1L), .Label = c("Apolloniuss_58", "canfeda1923", "isa_sakar"), class = "factor"), tweet_id = c(6.072663e+17, 6.072666e+17, 6.072669e+17), tweet_created_at = structure(c(1433596938, 1433597014, 1433597087), class = c("POSIXct", "POSIXt"), tzone = ""), retweet_screen_name = structure(c(1L, 1L, 1L), .Label = "cavurizmir", class = "factor")), .Names = c("screen_name", "tweet_id", "tweet_created_at", "retweet_screen_name"), class = "data.frame", row.names = c(NA, -3L))